☁️ Cloud & DevOps

Building Kubernetes Cellular Architecture with GitOps

Marcus Cole
Marcus Cole
Cloud & DevOps Lead

Platform engineer who's been through every infrastructure era — bare metal, VMs, containers, serverless. Has strong opinions about YAML files and even stronger opinions about over-engineering.

GitOpsArgo CDblast radius reductionKubernetes isolationsystem stability

We have all been there. It is 3 AM, the pager is screaming, and you are staring at a terminal trying to understand why the entire production environment is down. You eventually trace it back to a minor configuration typo in a non-critical background service. That single typo caused a memory leak, which starved the shared ingress controller, which then took down the routing for all 500 of your microservices.

This is the reality of the modern, flat Kubernetes cluster. In our rush to embrace microservices, we built massive, multi-tenant clusters that look great on a whiteboard but act like a house of cards in production. Recently, engineers at Duolingo shared their journey of migrating hundreds of services to a Kubernetes cellular architecture. They did not do this because it was trendy; they did it to survive their own scale.

Today, we are going to look at how to build a pragmatic Kubernetes cellular architecture using GitOps. No hype, no magic—just solid engineering fundamentals designed to let you sleep through the night.

The Core Problem: Blast Radius

The real bottleneck in our infrastructure is not the speed of our deployments or the size of our nodes. It is the blast radius of our failures.

When you deploy hundreds of services into a single cluster—relying solely on logical isolation like Kubernetes Namespaces—you are building a ship without bulkheads. If a rogue deployment exhausts the IP space (a common pain point that drives teams toward IPv6-only pods), or if a bad query overloads the shared etcd database, the entire ship sinks.

Namespaces are just labels. They do not stop a noisy neighbor from consuming shared control plane resources. We need physical, hard boundaries. We need cells.

Under the Hood: How Components Actually Fail

Before we build a solution, let us look at what happens under the hood of a failing cluster.

Think of a Kubernetes cluster like a busy restaurant kitchen. The API server is the expeditor, taking orders (YAML manifests) and assigning them to cooks (kubelets on worker nodes). etcd is the ticket rail holding all the orders.

If one waiter suddenly drops 10,000 complex, malformed orders onto the expeditor's desk, the expeditor stops processing everything else. The cooks stand around waiting, and the customers (your users) starve. It does not matter if the orders were for appetizers or main courses; the shared bottleneck brings the whole operation to a halt.

A cellular architecture changes this. Instead of one massive kitchen, you build multiple, independent food trucks. If one food truck catches fire, the others keep serving tacos. In Kubernetes terms, a cell is a fully isolated environment—often its own cluster or a strictly partitioned set of dedicated nodes and control planes—that handles a specific shard of traffic.

Flat Cluster vs. Cellular Architecture Flat Architecture (High Risk) Shared Control Plane One failure impacts all services Cellular Architecture (Resilient) Cell A Control Cell B Control Failure contained to Cell B

The Pragmatic Solution: Step-by-Step Tutorial

We are going to build a foundational cellular architecture using GitOps. Why GitOps? Because managing multiple isolated cells manually is a fast track to configuration drift. If Cell A and Cell B are configured differently, you no longer have a reliable system; you have two unique, fragile snowflakes. Argo CD will be our delivery mechanism, ensuring every cell is an exact replica of our Git repository.

Prerequisites

Before we start, you will need the following tools installed on your local machine:

  • Docker (to run our local clusters)

  • Kind (Kubernetes IN Docker) to simulate our cells

  • kubectl (the Kubernetes command-line tool)

  • Argo CD CLI (to interact with our GitOps controller)

  • A GitHub account to host your configuration repository


Step 1: Define the Cell Structure

Before we write any configuration, we need to understand why we structure our repository this way. We use the "App of Apps" pattern. Instead of telling Argo CD to deploy 50 individual services, we tell it to deploy one "Root Application." This root application contains the definitions for all other applications that belong in a cell. This means bootstrapping a new cell is as simple as pointing Argo CD at the root application.

Create a Git repository with the following directory structure:

clusters/
  ├── cell-us-east-1a/
  │   └── root-app.yaml
  └── cell-us-east-1b/
      └── root-app.yaml
apps/
  ├── ingress-controller/
  ├── identity-service/
  └── core-backend/

Step 2: Set up the Base Clusters (The Foundation)

We need to create the physical boundaries. In production, these would be separate EKS, GKE, or bare-metal clusters. For this tutorial, we will use Kind to spin up two isolated clusters representing our cells.

Run the following commands in your terminal:

# Create Cell A
kind create cluster --name cell-us-east-1a

# Create Cell B
kind create cluster --name cell-us-east-1b

Why are we doing this? By separating the clusters at the infrastructure level, a CPU spike or an exhausted IP pool in cell-us-east-1a cannot physically impact the workloads running in cell-us-east-1b.

Step 3: Bootstrap Argo CD (The Delivery Mechanism)

Now we need to install our operator inside each cell. Argo CD will sit inside the cluster, watch our Git repository, and pull the desired state inward. This is much more secure than pushing changes from an external CI server, which requires giving your CI pipeline cluster-admin credentials.

Switch your context to the first cell and install Argo CD:

kubectl config use-context kind-cell-us-east-1a

kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

Repeat this process for cell-us-east-1b.

Step 4: Deploy the Cell via GitOps

Now we define our root-app.yaml. This file tells Argo CD to look at the apps/ directory in our repository and deploy everything it finds.

Why do we define this in YAML instead of clicking through the Argo CD UI? Because UIs cannot be version controlled, peer-reviewed, or rolled back.

Create the root-app.yaml file:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: cell-bootstrap
  namespace: argocd
spec:
  project: default
  source:
    repoURL: 'https://github.com/your-username/your-cell-repo.git'
    path: apps
    targetRevision: HEAD
  destination:
    server: 'https://kubernetes.default.svc'
    namespace: default
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Apply this file to your cluster:

kubectl apply -f clusters/cell-us-east-1a/root-app.yaml

Argo CD will instantly wake up, read the repository, and begin stamping out the ingress controllers, identity services, and core backends exactly as defined in your Git repository.

Verification

To confirm your cellular architecture is working, check the status of your applications in both cells.

First, port-forward the Argo CD UI for Cell A:

kubectl port-forward svc/argocd-server -n argocd 8080:443

Log in (the default username is admin, and you can retrieve the password using kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d).

You should see your cell-bootstrap application glowing green, indicating that the cell's state perfectly matches your Git repository. If you repeat this for Cell B, you will see an identical, perfectly isolated replica.

Troubleshooting

Even the most pragmatic systems hit bumps in the road. Here is what to look out for:

1. Argo CD is stuck in a Sync Loop
The Symptom: Your application constantly toggles between 'Synced' and 'Out of Sync'.
The Fix: This usually happens when a Kubernetes Mutating Webhook (like a service mesh sidecar injector) modifies your YAML after Argo CD applies it. Argo CD sees the change, thinks it drifted from Git, and reapplies. To fix this, use Argo CD's ignoreDifferences feature in your Application manifest to tell it to ignore specific fields like sidecar container injections.

2. Pods are stuck in Pending state
The Symptom: You deployed your cell, but nothing is running.
The Fix: In a cellular architecture, resources are strictly partitioned. You likely forgot to define proper Resource Requests and Limits in your application manifests, or the cell is simply out of compute. Run kubectl describe pod and look at the events at the bottom. If it says FailedScheduling, you need to provision larger nodes for the cell or adjust your resource requests.

3. Cross-Cell Communication Failures
The Symptom: Service A in Cell 1 cannot talk to Service B in Cell 2.
The Fix: This is actually by design! Cells should be independent. If they strictly require synchronous communication, you are coupling them, which defeats the purpose of the blast radius reduction. Re-evaluate your architecture to use asynchronous event streaming (like Kafka) between cells, or route traffic through your external global load balancer.

What You Built

You just built the foundation of a highly resilient, distributed system. By separating your infrastructure into isolated cells and using GitOps to ensure configuration consistency, you have guaranteed that a failure in one environment will not cascade and destroy your entire platform. You have traded the illusion of a single, easily managed pane of glass for the reality of a robust, survivable architecture.

There is no perfect system. There are only recoverable systems.


Frequently Asked Questions

How big should a Kubernetes cell be? There is no universal rule, but pragmatically, a cell should be sized to handle a specific percentage of your user base or a specific geographic region. If your total system can survive losing 20% of its capacity without degrading the user experience, then you should have at least five cells. Keep them small enough that losing one is an inconvenience, not a disaster.
Does running multiple cells cost more than one large cluster? Initially, yes. You are duplicating control planes (like the Kubernetes API server and etcd) and foundational services (like ingress controllers and logging daemonsets). However, this overhead is the premium you pay for insurance. The cost of a single total-system outage at 3 AM almost always dwarfs the compute cost of running redundant control planes.
How do we route user traffic to the correct cell? You need a global load balancer or an intelligent DNS routing layer sitting above your Kubernetes clusters. When a user makes a request, the global router determines which cell they belong to (often using a shard key like their User ID or geographic location) and forwards the traffic to that specific cell's ingress controller.
What if a cell needs to be updated? This is where cellular architecture shines. You update your Git repository to roll out the new version to Cell A only. You monitor Cell A for errors. If it remains stable, you update the Git definitions for Cell B, Cell C, and so on. This is a true canary deployment at the infrastructure level.

Related Posts

☁️ Cloud & DevOps
Cellular vs Flat Kubernetes: Which Architecture Scales?
Apr 6, 2026
☁️ Cloud & DevOps
Lowering AI Infrastructure Costs and Frontend Blind Spots
Apr 5, 2026
☁️ Cloud & DevOps
Kubernetes Policy as Code: Kyverno & Argo CD
Apr 3, 2026