[GCP Provider] Upgrade Report: The Swamp of 'API Enablement' and Least Privilege + [GKE] Upgrade: Preventing the "My Node Pool Got Deleted" Disaster
[GCP Provider] Upgrade Report: The Swamp of 'API Enablement' and Least Privilege
When upgrading the GCP Provider or adding new resources, the errors you encounter most frequently are "API not enabled" or "Permission denied". Unlike AWS, GCP requires APIs to be enabled on a per-project basis. However, the way Terraform handles this process often leads to Race Conditions.
1. Symptoms
You attempt to create a google_container_cluster (GKE) via Terraform code using a Service Account (SA) that clearly has Admin privileges, but an error occurs.
[Log]
Error: googleapi: Error 403: Kubernetes Engine API has not been used in project 12345 before or it is disabled.
Enable it by visiting https://console.developers.google.com/apis/api/container.googleapis.com/...
Alternatively, even though you included code to enable the API, a timeout occurs with an "Enabling..." message.
2. Root Cause
- Race Condition: This happens because Terraform attempts to create the actual resource (GKE) before the
google_project_serviceresource (API Enablement) has fully completed. - Insufficient Permissions: As the Provider version increases, the internal API endpoints called may change (e.g., Beta -> v1), causing existing Custom Roles to lack the necessary permissions.
3. Fix
[Strategy 1: Explicit Dependency Configuration]
Set disable_on_destroy = false on the API enablement resource to prevent accidental API disabling, and enforce strict module dependencies.
resource "google_project_service" "container" {
service = "container.googleapis.com"
disable_on_destroy = false # Important! Keep API on even if destroyed by mistake
}
resource "google_container_cluster" "primary" {
# ...
# Force execution only after API enablement is complete
depends_on = [google_project_service.container]
}
[Strategy 2: Least Privilege Template]
Granting Owner or Editor permissions is a security nightmare. Grant only the necessary roles to the CI/CD SA.
- Terraform State:
Storage Object Admin - Resource Management:
Compute Admin,Kubernetes Engine Admin,Service Account User
4. Validation
- Verify that
terraform applypasses through the API enablement stage without hanging. - Verify that after a
terraform destroy, runningapplyagain proceeds without errors (Idempotent) even if the API is already enabled.
[GKE] Upgrade: Preventing the "My Node Pool Got Deleted" Disaster
The most terrifying accident when managing GKE (Google Kubernetes Engine) with Terraform is "All pods dying because the node pool was recreated during a cluster update."
1. Anti-Pattern
This occurs when you define the node_config block directly inside the google_container_cluster resource.
- Problem: Even a slight change in cluster configuration (e.g., version upgrade) causes Terraform to attempt a "Full Cluster Recreation" or "Default Node Pool Recreation". This leads to a major outage.
2. Fix: Decoupling Node Pools
You must completely separate the "Cluster Shell" and the "Worker Nodes".
[Step 1: Remove Default Node Pool]
resource "google_container_cluster" "primary" {
name = "my-gke-cluster"
location = "asia-northeast3"
# Core: Delete the default node pool immediately after creation
remove_default_node_pool = true
initial_node_count = 1
}
[Step 2: Use Separate Node Pool Resources]
resource "google_container_node_pool" "primary_nodes" {
name = "my-node-pool-v1"
cluster = google_container_cluster.primary.id
node_count = 3
node_config {
machine_type = "e2-medium"
# ...
}
# Limit the number of nodes that go down at once during upgrades
upgrade_settings {
max_surge = 1
max_unavailable = 0
}
}
3. Validation
- Check Plan: When upgrading the cluster version,
google_container_clustermust beupdate in-place, andnode_poolmust undergo a rolling update separately. - IAM: Verify that the SA (
service_account) used by the nodes (VMs) haslogging.logWriterandmonitoring.metricWriterpermissions. (Logs will not be collected without them).