[GCP Provider] Upgrade Report: The Swamp of 'API Enablement' and Least Privilege + [GKE] Upgrade: Preventing the "My Node Pool Got Deleted" Disaster

📅 February 23, 2026GCP Provider & GKE Optimization

kubernetesgcpsecurityterraformawsiammigration

[GCP Provider] Upgrade Report: The Swamp of 'API Enablement' and Least Privilege

When upgrading the GCP Provider or adding new resources, the errors you encounter most frequently are "API not enabled" or "Permission denied". Unlike AWS, GCP requires APIs to be enabled on a per-project basis. However, the way Terraform handles this process often leads to Race Conditions.

1. Symptoms

You attempt to create a google_container_cluster (GKE) via Terraform code using a Service Account (SA) that clearly has Admin privileges, but an error occurs.

[Log]

Error: googleapi: Error 403: Kubernetes Engine API has not been used in project 12345 before or it is disabled.
Enable it by visiting https://console.developers.google.com/apis/api/container.googleapis.com/...

Alternatively, even though you included code to enable the API, a timeout occurs with an "Enabling..." message.

2. Root Cause

Race Condition: This happens because Terraform attempts to create the actual resource (GKE) before the google_project_service resource (API Enablement) has fully completed.
Insufficient Permissions: As the Provider version increases, the internal API endpoints called may change (e.g., Beta -> v1), causing existing Custom Roles to lack the necessary permissions.

3. Fix

[Strategy 1: Explicit Dependency Configuration]
Set disable_on_destroy = false on the API enablement resource to prevent accidental API disabling, and enforce strict module dependencies.

resource "google_project_service" "container" {
  service = "container.googleapis.com"
  disable_on_destroy = false # Important! Keep API on even if destroyed by mistake
}
resource "google_container_cluster" "primary" {
  # ...
  # Force execution only after API enablement is complete
  depends_on = [google_project_service.container]
}

[Strategy 2: Least Privilege Template]
Granting Owner or Editor permissions is a security nightmare. Grant only the necessary roles to the CI/CD SA.

Terraform State: Storage Object Admin
Resource Management: Compute Admin, Kubernetes Engine Admin, Service Account User

4. Validation

Verify that terraform apply passes through the API enablement stage without hanging.
Verify that after a terraform destroy, running apply again proceeds without errors (Idempotent) even if the API is already enabled.

[GKE] Upgrade: Preventing the "My Node Pool Got Deleted" Disaster

The most terrifying accident when managing GKE (Google Kubernetes Engine) with Terraform is "All pods dying because the node pool was recreated during a cluster update."

1. Anti-Pattern

This occurs when you define the node_config block directly inside the google_container_cluster resource.

Problem: Even a slight change in cluster configuration (e.g., version upgrade) causes Terraform to attempt a "Full Cluster Recreation" or "Default Node Pool Recreation". This leads to a major outage.

2. Fix: Decoupling Node Pools

You must completely separate the "Cluster Shell" and the "Worker Nodes".

[Step 1: Remove Default Node Pool]

resource "google_container_cluster" "primary" {
  name     = "my-gke-cluster"
  location = "asia-northeast3"
  
  # Core: Delete the default node pool immediately after creation
  remove_default_node_pool = true
  initial_node_count       = 1
}

[Step 2: Use Separate Node Pool Resources]

resource "google_container_node_pool" "primary_nodes" {
  name       = "my-node-pool-v1"
  cluster    = google_container_cluster.primary.id
  node_count = 3
node_config {
    machine_type = "e2-medium"
    # ...
  }
  
  # Limit the number of nodes that go down at once during upgrades
  upgrade_settings {
    max_surge       = 1
    max_unavailable = 0
  }
}

3. Validation

Check Plan: When upgrading the cluster version, google_container_cluster must be update in-place, and node_pool must undergo a rolling update separately.
IAM: Verify that the SA (service_account) used by the nodes (VMs) has logging.logWriter and monitoring.metricWriter permissions. (Logs will not be collected without them).