☁️ Cloud & DevOps

Kubernetes Platform Engineering: The RBC Case Study

Marcus Cole
Marcus Cole
Cloud & DevOps Lead

Platform engineer who's been through every infrastructure era — bare metal, VMs, containers, serverless. Has strong opinions about YAML files and even stronger opinions about over-engineering.

cluster lifecycle managementconfiguration driftimmutable infrastructureenterprise DNS integrationGitOps deployment

Let's start with a reality check. We spend years building beautiful, stateless microservices, wrapping them in neat little containers, and deploying them with sophisticated CI/CD pipelines. We think we've abstracted away the messy reality of hardware. But then, at 3 AM on a Sunday, your pager goes off. A pod won't schedule. Network traffic is blackholing. You dig through the logs, bleary-eyed, only to realize the problem isn't your elegant application code. The problem is that the underlying virtual machine running your Kubernetes node has a slightly different kernel version than the rest of the fleet because a patch script failed three months ago.

Kubernetes didn't eliminate our infrastructure problems; it just pushed them down a layer.

Recently, Erick Bourgeois and the team at RBC Capital Markets shared their journey of modernizing their Kubernetes platform across 50+ clusters. Their story isn't about chasing the latest shiny object. It's a masterclass in pragmatic Kubernetes platform engineering in a highly regulated environment. They didn't build a complex abstraction layer to hide their problems; they fixed the plumbing.

Here is how they tackled the hard reality of operating distributed systems at scale, and what we can learn from their approach.

The Challenge: When the Abstraction Leaks

If you run a single Kubernetes cluster, you can afford to hand-craft a few things. But RBC Capital Markets operates over 50 clusters spanning on-premises VMware environments and multiple public clouds. In the capital markets sector, this isn't just an operational headache; it's a strict compliance issue. Frameworks like SOX and PCI-DSS don't care about your deployment velocity; they care about auditability and drift prevention.

The core bottleneck for the platform team wasn't the technology itself—it was the manual, ticket-driven lifecycle of the infrastructure. They identified three massive gaps:

1. Configuration Drift: Virtual machines that had been patched, mutated, and tweaked over time were becoming impossible to reason about. They were 'snowflake' servers.
2. Cluster Provisioning: Spinning up new clusters for trading desks was a multi-day manual exercise.
3. DNS Integration: Every new service endpoint required a manual ticket to the network team to update enterprise DNS.

Think of a busy shipping harbor. Kubernetes is the crane system, efficiently moving containers around. But if the concrete docks (your nodes) are crumbling, and the harbor master's ledger (your DNS) takes three days to update via paper mail, the efficiency of your cranes is completely irrelevant.

Under the Hood: Fixing the Plumbing

To solve these problems, the team didn't write a massive, fragile orchestration script. They looked for targeted tools that solved specific underlying problems: Kairos for immutable nodes, k0rdent for cluster lifecycle management, and bindy for DNS.

Let's break down how these components interact without the marketing fluff.

The End of Snowflake Nodes with Kairos

For years, our industry relied on configuration management tools (like Ansible or Chef) to log into running servers and mutate them into a desired state. But over time, servers drift. A developer manually installs a package to debug an issue and forgets to remove it. A disk fills up.

The pragmatic solution? Stop patching servers. The best code is code you don't write, and the best server patching strategy is not patching servers at all.

This is where immutable infrastructure comes in. RBC adopted Kairos to ensure every node is reproducible and tamper-evident at boot.

The Immutable Node Boot Process Traditional (Mutable) Node Boot from persistent disk Apply configuration scripts (Drift risk) Run workloads on mutated state Immutable Node (Kairos) Boot from verified container image Mount OS as Read-Only Join cluster (Identical state guaranteed)

Under the hood, Kairos allows you to build bootable OS images from standard container images. When a node boots, it pulls this image, verifies its cryptographic signature, and runs it entirely in memory (or as an immutable layer). If a node starts behaving strangely, you don't SSH into it to debug. You kill it and let the cluster provision a fresh one. It's like a restaurant kitchen: if a cutting board gets contaminated, you don't try to scrub it while the chef is chopping vegetables. You throw it in the wash and grab a fresh, identical board.

Cluster Lifecycle Management with k0rdent

Managing one cluster is a task. Managing 50 requires a system. RBC utilized k0rdent to handle the lifecycle of the clusters themselves.

Before we look at any configuration, understand why we need a declarative approach to clusters. If you create clusters by clicking through a cloud provider's web console, you have no audit trail. If the cluster dies, rebuilding it requires human memory. By defining a cluster as code, the infrastructure becomes self-documenting.

Here is what a simplified declarative cluster definition looks like in a GitOps workflow. We define the desired state, and a controller running in a management cluster constantly reconciles reality against this file:

apiVersion: k0rdent.mirantis.com/v1alpha1
kind: Cluster
metadata:
  name: trading-desk-prod-01
  namespace: fleet-management
spec:
  version: 1.28.4
  topology:
    controlPlane:
      replicas: 3
      machineTemplate: vsphere-cp-template
    workers:
      - name: high-frequency-pool
        replicas: 10
        machineTemplate: vsphere-worker-template

No magic here. Just a control loop reading a file and making API calls to VMware or AWS to ensure 3 control plane nodes and 10 worker nodes exist. If a node vanishes, the loop replaces it.

Bridging the DNS Gap with Bindy

This is perhaps the most painful part of enterprise infrastructure. You deploy a new service to Kubernetes. It gets an internal IP. But for the rest of the company to talk to it, you need a DNS record. In many organizations, this means submitting a Jira ticket to the network team, waiting three days, and hoping they don't make a typo in the Infoblox console.

RBC used bindy to integrate Kubernetes service discovery directly with their enterprise DNS infrastructure.

Automated Enterprise DNS Flow Kubernetes Cluster 1. Ingress Resource Created 2. Bindy Controller Watches 3. API Call (RFC 2136 / Provider) Enterprise Network Enterprise DNS Server api.trading.rbc.internal A 10.45.2.100

How does this work? It's just an event listener. When a developer deploys an Ingress or Service of type LoadBalancer, the Kubernetes API server broadcasts an event. The bindy controller catches that event, extracts the desired hostname and the assigned IP address, and makes an authenticated API call to the enterprise DNS provider to create or update the record. When the service is deleted, the record is removed.

No tickets. No stale records pointing to dead IPs. Just clean, deterministic plumbing.

Results & Numbers: The Pragmatic Impact

When you stop fighting your infrastructure and start treating it as disposable, the metrics speak for themselves. While exact internal figures vary based on the specific workload, the architectural shift at RBC Capital Markets represents a night-and-day difference in operational overhead.

MetricBefore (Mutable & Manual)After (Immutable & GitOps)
Cluster Provisioning TimeDays to Weeks (Ticket-based)Minutes (Declarative API)
Node Configuration DriftHigh (Snowflake servers)Zero (Tamper-evident boot)
DNS Update Lead Time24-72 Hours (Network team queue)Seconds (Event-driven controller)
Compliance AuditabilityManual log gatheringGit commit history as source of truth
3 AM Operator PanicHigh (Unknown system states)Low (Kill and replace nodes)

Lessons for Your Team

If you are staring down a multi-cluster Kubernetes environment, don't start by installing a dozen service meshes or complex traffic routing tools. Look at your base layer.

1. Standardize the Base: If your nodes aren't identical, your clusters will never be stable. Move away from configuration management and toward immutable OS images. Treat your nodes like cattle, truly.
2. Automate the Tedious, Not the Complex: DNS updates aren't complex; they are tedious. By automating the bridge between Kubernetes and enterprise systems, you remove human bottlenecks without adding unnecessary architectural complexity.
3. Embrace GitOps for Infrastructure: Your clusters should be defined in a Git repository, just like your application code. If a data center burns down, you shouldn't have to remember how to rebuild the control plane.

We spend too much time arguing about which shiny new tool to use, and not enough time ensuring our foundational layers are rock solid. Technology is just a tool for solving problems. If your tool requires a human to manually patch a kernel or submit a DNS ticket, it's not solving the problem; it's just keeping you busy.

There is no perfect system. There are only recoverable systems.


FAQ

What is configuration drift and why is it dangerous? Configuration drift occurs when a server's actual state diverges from its intended state over time due to manual patches, hotfixes, or failed updates. It is dangerous because it makes systems unpredictable; a deployment might succeed on Node A but fail on Node B, making troubleshooting incredibly difficult during an outage.
How does an immutable OS like Kairos differ from traditional Linux? Traditional Linux distributions are mutable; you can write to the root filesystem, update packages in place, and change configurations on the fly. An immutable OS boots from a verified, read-only image. If you need to update a package, you don't patch the running system—you build a new image and reboot the node into the new state.
Why not just use external-dns instead of bindy? While external-dns is a fantastic standard tool for syncing Kubernetes services to cloud DNS providers (like Route53 or Cloudflare), highly regulated enterprise environments often use complex, on-premises DNS infrastructure (like Infoblox or custom Bind setups) that require specific authentication, policy enforcement, and integration patterns that purpose-built tools handle more gracefully.
Does GitOps replace Terraform for cluster provisioning? Not necessarily. They often work together. Terraform is excellent for provisioning the raw underlying infrastructure (VPCs, subnets, IAM roles), while GitOps controllers (like FluxCD or k0rdent) excel at continually reconciling the state of the Kubernetes clusters and workloads running on top of that infrastructure.

📚 Sources

Related Posts

☁️ Cloud & DevOps
Cellular vs Flat Kubernetes: Which Architecture Scales?
Apr 6, 2026
☁️ Cloud & DevOps
Serverless vs Provisioned Databases: Which to Choose?
May 10, 2026
☁️ Cloud & DevOps
Pragmatic Microservices Architecture Patterns: Sidecars & CI/CD
May 8, 2026