☁️ Cloud & DevOps

Cellular vs Flat Kubernetes: Which Architecture Scales?

Marcus Cole
Marcus Cole
Cloud & DevOps Lead

Platform engineer who's been through every infrastructure era — bare metal, VMs, containers, serverless. Has strong opinions about YAML files and even stronger opinions about over-engineering.

flat clustersGitOps deploymentArgo CD scalingcluster isolationblast radius

We have all been there. It is 3 AM, your pager is screaming, and you are staring at a dashboard painted entirely in red. You didn't deploy anything. Your team didn't deploy anything. But somewhere in your massive, sprawling Kubernetes cluster, a minor internal tool got stuck in a crash loop, overwhelmed the control plane, and took down the entire production environment.

This is the reality of scaling modern infrastructure. We like to pretend that Kubernetes is a magical orchestrator that effortlessly handles whatever we throw at it. It is not. It is a distributed database duct-taped to a fleet of node agents. And when you push it past its limits, it breaks in spectacular, unpredictable ways.

Recently, Franka Passing shared insights into Duolingo's migration of 500+ backend services on Kubernetes. To survive that level of scale and complexity, they had to make a fundamental shift away from traditional infrastructure patterns.

Today, we are looking at the two primary ways to structure your environments at scale: Kubernetes cellular architecture versus the traditional flat clusters.

The Reality Check

For the past few years, the default advice in the DevOps community has been to consolidate. "Put everything in one big cluster," they said. "It improves resource utilization and simplifies management."

And for a while, it works. But as your engineering organization grows, a flat cluster becomes a massive open-plan warehouse. It is highly efficient for moving things around, but if a fire starts in the corner, the entire building burns down.

In a flat cluster, every service shares the same control plane, the same network space, and the same DNS resolution. A single misconfigured ingress controller or a runaway service that exhausts your AWS API rate limits will cause collateral damage to completely unrelated services. I have seen entire e-commerce platforms go dark because a staging deployment consumed all available IP addresses in the VPC.

The Core Problem: Blast Radius

The bottleneck in scaling infrastructure is rarely compute power; it is the blast radius.

Think of a major shipping harbor. A flat cluster is like having one massive dock where every ship, from tiny fishing boats to massive oil tankers, unloads at the same time. If a crane breaks or a ship catches fire, the entire port halts operations.

Cellular architecture is the process of building multiple, physically isolated docks (cells). Each dock has its own cranes, its own staff, and its own entry channels. If Dock A catches fire, Docks B, C, and D continue operating as if nothing happened. It costs more to build and maintain, but it guarantees that a single failure cannot sink your entire logistics network.

Under the Hood: The Hard Way

Before we compare them, we need to understand what is physically happening inside the servers.

In a flat cluster, every time you deploy a new service, the Kubernetes control plane updates etcd (its brain). The kubelet and kube-proxy on every single node in the cluster must then update their local routing rules (iptables or IPVS) to know how to reach that new service. If you have 500 services and 1,000 nodes, a single deployment triggers a massive wave of network updates. Eventually, the control plane chokes, or you run out of IPv4 addresses in your VPC—which is exactly why companies like Duolingo are forced to migrate to IPv6-only pods.

In a cellular architecture, you divide your infrastructure into self-contained units (cells). A cell might be a dedicated Kubernetes cluster, or a strictly isolated node group in its own VPC. Each cell contains a replica of your core services. A global routing layer (like Route53 or a global load balancer) sits above the cells and distributes user traffic. If a cell's control plane dies, the global router simply stops sending traffic to it. The state is localized. The routing tables are small. The blast radius is contained.

Side-by-Side Analysis

Let's break down how these two approaches compare across the realities of daily operations.

1. Blast Radius & Fault Tolerance

Flat Clusters: High risk. A failure in the control plane, a bad Custom Resource Definition (CRD) rollout, or a network plugin crash will affect all workloads simultaneously. Cellular Architecture: Low risk. Because each cell operates independently, a catastrophic failure in Cell 1 has zero impact on Cell 2. You can safely route traffic away from the burning cell while you investigate.

2. Network Overhead & IP Exhaustion

Flat Clusters: As you scale, you will hit the limits of your VPC's IP space. Every pod needs an IP. Workarounds like custom CNIs or dual-stack IPv6 add immense complexity to a system that is already hard to debug. Cellular Architecture: Network spaces are compartmentalized. Because cells are smaller and isolated, you can reuse IP ranges across different cells (if they don't need to peer directly), drastically reducing the pressure on your network infrastructure.

3. Operational Complexity

Flat Clusters: Relatively low. You have one API server to authenticate against, one set of monitoring tools to check, and one deployment pipeline. Cellular Architecture: Very high. You are now managing a distributed system of distributed systems. You cannot manually apply configurations to 10 different cells.

4. Cost Efficiency

Flat Clusters: Highly efficient. You can bin-pack workloads tightly onto large nodes, sharing resources and minimizing wasted compute. Cellular Architecture: Less efficient. You are paying for redundant control planes, redundant ingress controllers, and baseline overhead for every single cell you provision.

The GitOps Requirement

If you choose to move to a cellular architecture, you must fundamentally change how you deploy software.

When you have five isolated cells, manually running kubectl apply is a recipe for disaster. Human error guarantees that Cell A will eventually be running version 1.2 of your app, while Cell B runs version 1.3. This configuration drift will cause unpredictable bugs that are nearly impossible to trace.

We need a way to declare our desired state in one place and have all cells automatically pull and apply that state. This is why a GitOps deployment tool like Argo CD becomes mandatory.

Instead of pushing code to clusters, we write an ApplicationSet that tells Argo CD to stamp out our configuration across all registered cells. Here is what that looks like under the hood:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: backend-services
spec:
  generators:
  # This dynamically targets all our isolated cells
  - clusters: {}
  template:
    metadata:
      name: '{{name}}-backend'
    spec:
      project: default
      source:
        repoURL: https://github.com/your-org/manifests.git
        targetRevision: HEAD
        path: apps/backend
      destination:
        server: '{{server}}'
        namespace: backend-prod

This isn't just automation for the sake of being fancy. It is a structural requirement to ensure that your isolated cells remain identical clones of each other.

Feature Comparison

CriteriaFlat Kubernetes ClusterCellular Architecture
Primary GoalResource efficiency and simplicityFault isolation and high availability
Blast RadiusCluster-wideContained to a single cell
Control Plane LoadHigh (Exponential growth)Low (Distributed across cells)
Deployment ModelDirect CI/CD push or GitOpsStrict GitOps required (Argo CD/Flux)
Infrastructure CostOptimized (Shared resources)Higher (Redundant control planes)
IP Address PressureSevere at scaleManageable per cell


Architecture Decision Flowchart Are you managing >100 services? Yes No Do single failures cause cluster-wide outages? No Yes Do you have a strict GitOps pipeline? No Yes Stick to Flat Cluster Adopt Cellular


Which Should You Choose?

If you are running fewer than 100 services, or if your engineering team is smaller than 50 people, stick to a flat cluster.

I know the temptation to over-engineer is strong. We all want to build the architectures we read about on engineering blogs. But the best code is code you don't write, and the best infrastructure is infrastructure you don't have to manage. A well-tuned flat cluster with proper namespace isolation and resource quotas will carry you incredibly far. Don't take on the burden of distributed state management until your business absolutely demands it.

However, if you are hitting hard limits—if your etcd latency is spiking, if you are running out of IP addresses, or if a single bad deployment frequently takes down unrelated services—it is time to look at Kubernetes cellular architecture.

When you reach the scale of a company like Duolingo, the priority shifts from "how cheaply can we run this?" to "how do we guarantee this never goes down?" Cellular architecture provides the physical bulkheads needed to stop cascading failures. Just ensure you have the operational maturity (and a rock-solid GitOps foundation) to handle the complexity before you make the leap.

There is no perfect system. There are only recoverable systems.

FAQ

What exactly defines a "cell" in Kubernetes?

A cell is a fully self-contained unit of infrastructure capable of serving a subset of your user traffic independently. Pragmatically, this usually means an entirely separate Kubernetes cluster with its own control plane, worker nodes, and local data caches, isolated within its own network boundary.

Can I use namespaces instead of a cellular architecture?

Namespaces provide logical isolation, not physical isolation. While a namespace can restrict RBAC and resource usage (via quotas), all namespaces still share the same underlying control plane, etcd database, and network ingress. If the control plane crashes, all namespaces fail.

Why is GitOps mandatory for cellular architecture?

When operating multiple independent cells, configuration drift becomes your biggest enemy. If you rely on manual deployments or push-based CI/CD scripts, one cell will inevitably end up with a different configuration than the others. GitOps tools like Argo CD ensure that a single source of truth in Git is continuously synchronized across all cells automatically.

📚 Sources

Related Posts

☁️ Cloud & DevOps
Lowering AI Infrastructure Costs and Frontend Blind Spots
Apr 5, 2026
☁️ Cloud & DevOps
Kubernetes Policy as Code: Kyverno & Argo CD
Apr 3, 2026
☁️ Cloud & DevOps
Kubernetes Policy Management: GitOps, Kyverno, and AI
Apr 2, 2026