☁️ Cloud & DevOps

Pragmatic Cloud-Native Infrastructure Management

📅 May 27, 2026

Marcus Cole

Cloud & DevOps Lead

Platform engineer who's been through every infrastructure era — bare metal, VMs, containers, serverless. Has strong opinions about YAML files and even stronger opinions about over-engineering.

infrastructure-as-codeplatform engineeringobservability pipelinestate driftkubernetes ecosystem

It is 3:14 AM. Your phone vibrates on the nightstand. The incident alert simply says: High Latency on Payment Gateway. You groggily open your laptop, pull up a dashboard that looks like the cockpit of a commercial airliner, and stare at a wall of red lines. You didn't write the payment service. You didn't deploy the latest Helm chart. But you are the one holding the bag because the infrastructure abstraction leaked, and the system is failing silently.

Look, I've been there. We have spent the last decade building incredibly complex systems to manage... other complex systems. We call it cloud-native infrastructure management, but most days, it feels like we are just stacking abstractions until the tower inevitably collapses. We treat Kubernetes like magic, assuming that if we just write enough YAML, the system will take care of itself.

Today's news highlights this exact struggle. Platform Engineering Labs just announced that their open-source Infrastructure-as-Code (IaC) platform, formae, is adding full Kubernetes support and native Helm integration. On the other side of the spectrum, observability vendors like Selector AI are pushing platforms that ingest multi-domain data to make sense of our sprawling networks, while analysts like Kin Lane are warning us that the financial bill for all this complexity is rapidly coming due.

Let's strip away the marketing fluff and look at what is actually happening under the hood. Because before you adopt another tool to manage your tools, you need to understand the fundamental plumbing of your infrastructure.

The Reality Check: Abstraction is a Liability

In our quest for developer velocity, we have created a plumbing nightmare. Think of your infrastructure like a city's water system. You have the main reservoir (your cloud provider), the main pipes (your network), and the faucets in people's homes (your applications).

Instead of just connecting the pipes to the faucets, we've installed automated pressure regulators, smart meters, routing meshes, and dynamic valve controllers. When water stops flowing, you no longer just check for a leak. You have to check if the smart meter misread the pressure, which caused the routing mesh to divert water to a backup reservoir, which triggered a security policy that shut down the valve entirely.

This is what happens when we blindly adopt microservices and complex CI/CD pipelines. The complexity doesn't disappear; it just shifts from the application code to the infrastructure layer. And when it breaks, it breaks hard.

The Core Problem: Disconnected State and Telemetry Overload

The real bottleneck in our infrastructure isn't a lack of features. It is a lack of unified state and actionable visibility. We are fighting two distinct battles:

1. State Drift: Your IaC tool (like Terraform or formae) thinks the infrastructure looks one way, but Kubernetes—which is constantly adjusting itself—has changed the actual state.
2. Telemetry Overload: We are generating terabytes of logs, metrics, and traces, but when the system crashes, we still can't find the root cause because the data is siloed.

Let's break down how these systems actually interact.

Under the Hood: The State Reconciliation Conflict

To understand why integrating Helm and Kubernetes into an IaC tool like formae is a big deal, we have to look at how these systems handle "state."

Think of Kubernetes like a busy restaurant kitchen. You (the user) write an order on a ticket (a declarative YAML manifest). You hand it to the Head Chef (the Kubernetes Control Plane). The Head Chef looks at the ticket and yells at the line cooks (the Kubelets running on worker nodes) to make the kitchen match the ticket. If a line cook drops a steak, the Head Chef sees the discrepancy and orders another steak to be cooked. This is a continuous reconciliation loop.

Now, think of Infrastructure-as-Code (IaC) as the restaurant's architect. The architect comes in once a month with blueprints, builds the kitchen, and leaves.

When you try to manage Kubernetes resources with traditional IaC, the architect and the Head Chef get into a fight. The IaC tool applies a state, but Kubernetes immediately starts changing things (scaling pods, updating load balancers). The next time the IaC tool runs, it sees that the kitchen doesn't match the blueprints, assumes something is wrong, and tries to tear down the Head Chef's work.

What formae is attempting to do by adding native Kubernetes and Helm support is bridge this gap. Instead of treating Kubernetes as a black box that it occasionally throws YAML at, it is trying to understand the cluster's continuous state. It reads the .tfvars, compiles the Helm charts natively, and continuously discovers changes made by external tools.

But remember: a tool doesn't fix a broken process. If your developers are manually editing deployments via kubectl edit in production, no IaC platform will save you from the resulting chaos.

Under the Hood: The Telemetry Pipeline

Once your infrastructure is running, you have to monitor it. The recent push toward AIOps platforms, like the one presented by Selector AI, addresses a very real pain point: telemetry fragmentation.

Imagine a massive shipping harbor. You have thousands of containers arriving every hour. Some containers hold metrics (CPU usage, memory), some hold logs (application errors), and some hold traces (the path a user took through your microservices).

If you just dump all these containers into a giant pile on the dock, you have data, but you don't have observability. When a ship sinks, you can't find the manifest to figure out what went wrong.

Before you can apply any advanced analytics or "AIOps" magic, you need a robust, unified data pipeline. You need a system that ingests raw telemetry, normalizes it, and correlates it before trying to draw conclusions.

The reason modern distributed systems are moving toward platforms like Selector AI isn't because of the buzzwords. It is because these platforms prioritize the data-centric foundation. They ingest the metrics, logs, and topology into a single analytics layer. If you feed garbage, un-correlated data into an analytics engine, you will just get highly confident garbage out.

Furthermore, as Kin Lane rightly points out, this level of observability isn't free. If you log every single debug event in a 500-pod microservice cluster, your observability bill will quickly eclipse your compute bill. You must be pragmatic about what you measure.

The Pragmatic Solution: Fundamentals Over Flash

So, how do we manage cloud-native infrastructure without losing our minds? We stick to the fundamentals. The best code is code you don't write, and the best infrastructure is infrastructure you don't have to manage.

Here is how different approaches stack up:

Approach	State Management	Observability Strategy	Best For
Traditional IaC	Point-in-time apply. High risk of drift.	Siloed tools (separate metrics and logs).	Static infrastructure (VMs, managed databases).
K8s Native (GitOps)	Continuous reconciliation via controllers.	Sidecar-based scraping (Prometheus/Fluentd).	Mature teams fully committed to the K8s ecosystem.
Unified Platforms	Integrated discovery (like formae).	Multi-domain ingestion (like Selector AI).	Hybrid environments managing legacy and cloud-native.

If you are running a simple web application, you do not need a multi-region Kubernetes cluster managed by a unified IaC platform with an AIOps observability layer. You need a managed PaaS and a good night's sleep.

But if you are operating at scale, you need to enforce strict boundaries.

1. Stop fighting Kubernetes. If you use IaC, use it to provision the cluster and the base controllers. Let Kubernetes native tools (like ArgoCD or Flux) handle the application state inside the cluster.
2. Clean up your telemetry. Before buying an expensive observability platform, audit your logs. Drop debug logs in production. Ensure your metrics have consistent labeling.
3. Understand the cost. Every abstraction has a compute cost, a network cost, and a cognitive cost for your team.

What You Should Do Next

1. Audit Your State Drift: Run a plan or dry-run on your current IaC tools today. If you see dozens of unexpected changes pending, your team is making manual changes in production. Fix the culture before you change the tool.
2. Review Your Helm Charts: Are you passing hundreds of variables through .tfvars just to render a simple deployment? Simplify your charts. Hardcode sensible defaults.
3. Calculate Your Observability ROI: Look at your top 10 most expensive metrics or log streams. Ask your team if anyone has queried that data in the last 30 days. If the answer is no, drop the telemetry at the source.

There is no perfect system. There are only recoverable systems.

FAQ

Why shouldn't I use traditional IaC tools to manage Kubernetes deployments?

Traditional IaC tools are designed for point-in-time state application. Kubernetes is a continuous reconciliation engine. When you use static IaC to manage dynamic Kubernetes resources, you create a race condition where the IaC tool and the Kubernetes controllers constantly fight over the true state of the system, leading to configuration drift and deployment failures.

What is a data-centric foundation in observability?

A data-centric foundation means prioritizing the collection, normalization, and correlation of raw telemetry (metrics, logs, traces) into a single source of truth before applying analytics. Without this foundation, advanced observability tools will simply generate false positives based on fragmented data.

How does formae differ from standard Terraform when managing Helm?

Standard tools often treat Helm as a black box, simply executing a helm upgrade command and hoping for the best. Platforms like formae aim to natively compile and understand the Helm charts, tracking the resulting Kubernetes manifests as part of the overall infrastructure state, which provides better visibility into what is actually running.

Why is my observability bill so high?

Observability costs usually skyrocket due to "cardinality explosions" in metrics (too many unique labels) or logging raw, unstructured data at the debug level in production. You are paying to store and index data that no human or system will ever realistically query during an incident.