☁️ Cloud & DevOps

Kubernetes Infrastructure Reality: Metrics, Security, AI

📅 March 19, 2026

Marcus Cole

Cloud & DevOps Lead

Platform engineer who's been through every infrastructure era — bare metal, VMs, containers, serverless. Has strong opinions about YAML files and even stronger opinions about over-engineering.

container securityKubernetes metricscluster monitoringAI workloads

The Reality Check

It is 3:14 AM. Your phone vibrates on the nightstand. It's PagerDuty.

A node in your production cluster just went NotReady. You groggily open your laptop, connect to the VPN, and run kubectl get events. You discover that a massive new application workload just starved the node's system processes of memory. The cluster went blind, the workloads crashed, and you are left cleaning up the mess while the rest of the world sleeps.

Listen, I've been there more times than I care to admit. As engineers, we are constantly bombarded with the pressure to adopt complex architectures. Today, the industry is obsessed with running massive large-model AI workloads on Kubernetes, building intricate CI/CD pipelines, and deploying service meshes that require a PhD to operate. We add layer upon layer of abstraction, hoping the technology will magically manage itself.

It won't.

The reality is that complexity is the enemy of reliability. The best code is code you don't write, and the best infrastructure is the simplest one that solves the business problem. Technology is just a tool. When we forget the fundamentals—how to monitor basic system health, how to build secure and minimal containers, and how to safely schedule workloads—our systems collapse under their own weight.

Today, we are looking at three major conversations happening in the cloud-native ecosystem: Kubernetes metrics, container security, and scaling heavy compute infrastructure. We are going to strip away the marketing fluff and look at how these systems actually work under the hood.

The Core Problem: Ignoring the Plumbing

We treat Kubernetes infrastructure like a magical black box. We throw massive workloads at it, ignore container security until compliance forces our hand, and collect gigabytes of metrics without understanding what they mean.

The real bottleneck in our infrastructure isn't a lack of features in Kubernetes. The bottleneck is our lack of respect for the underlying plumbing.

Think of a Kubernetes cluster like a city's water system. You can build the most beautiful, modern high-rise building in the world (your application), but if the underground pipes (CPU, memory, network) are undersized, leaking, or unmonitored, the building is useless.

Let's break down the three pillars of a stable cluster based on today's industry movements.

Under the Hood: Kubernetes Metrics

The Cloud Native Computing Foundation (CNCF) recently highlighted the importance of Kubernetes metrics, specifically focusing on Node CPU and Node memory usage.

Before you rely on auto-scalers or fancy dashboards, you need to understand how Kubernetes actually knows what is happening on a node.

It isn't magic. It's just basic process monitoring.

Every worker node runs an agent called the kubelet. Inside the kubelet is a tool called cAdvisor (Container Advisor). cAdvisor constantly reads the Linux cgroups (control groups) of the running containers to see exactly how much CPU and memory they are consuming. The kubelet exposes this data, and a central component called the Metrics Server scrapes it so the Kubernetes API can use it to make scheduling decisions.

If you don't define resource limits on your pods, a single memory leak in an application will consume the node's memory until the Linux kernel steps in and violently kills processes (OOMKill) to save itself. Often, it kills your application. Sometimes, it kills the kubelet, causing the node to drop off the cluster entirely.

Understanding the difference between "Working Set Memory" (memory actively in use that cannot be swapped or reclaimed) and total memory is crucial. If you are flying blind without these metrics, you are just waiting for a crash.

Under the Hood: Container Security the Hard Way

The New Stack recently covered how Chainguard believes DevOps teams are solving container security the hard way. They are right, but we need to understand why.

Most developers start building a Docker image by pulling ubuntu:latest or node:alpine. These images contain package managers (apt, apk), shells (bash, sh), and network utilities (curl, wget).

Think of a container like a shipping container at a harbor logistics terminal. The goal of a shipping container is to transport cargo from point A to point B safely. You pack the cargo inside. You do not pack the loading crane, the forklift, and a spare truck inside the shipping container.

Yet, by including shells and package managers in our production images, we are packing the forklift. If an attacker finds a vulnerability in your application, they can use the included bash shell and curl utility to download malware and compromise your system.

The pragmatic approach is to use minimal or "distroless" images. These images contain nothing but your compiled application and the absolute bare minimum system libraries required to run it. No shell. No package manager.

Here is a breakdown of the trade-offs:

Feature	Standard Base Image (e.g., Ubuntu)	Minimal / Distroless Image
Image Size	100MB - 500MB+	10MB - 50MB
Included Tools	bash, curl, apt, coreutils	None
CVE Count (Average)	Dozens to hundreds	Near zero
Debugging	Easy (`kubectl exec -it pod -- bash`)	Harder (requires ephemeral debug containers)
Security Posture	Poor out-of-the-box	Excellent

Yes, debugging a distroless container is slightly harder because you can't just exec into it and run curl. But that friction is a feature, not a bug. It forces you to rely on proper logging and metrics (which we just discussed) rather than cowboy-patching production servers.

Under the Hood: Heavy Workloads at Scale

Finally, we see teams trying to run massive AI and compute-heavy workloads natively on Kubernetes. The assumption is that because Kubernetes can orchestrate web servers, it can effortlessly orchestrate GPU-bound batch jobs.

Think of a restaurant kitchen. Your standard web APIs are the prep cooks—they chop vegetables and prep ingredients quickly. They need standard counter space (CPU) and mixing bowls (Memory).

Heavy compute workloads, especially those requiring GPUs, are the specialty ovens. They take a long time to heat up, they consume massive amounts of power, and they are incredibly expensive. If you let the prep cooks pile their ingredients on top of the specialty ovens, the whole kitchen grinds to a halt.

In Kubernetes, if you do not explicitly isolate your heavy workloads, the scheduler will treat a GPU-enabled node just like any other node. It will schedule your basic internal DNS pods or logging daemonsets onto the expensive GPU nodes, wasting resources and potentially causing resource contention.

Before you write the YAML, understand the 'Why'. We need a way to tell Kubernetes: "Keep regular pods off this machine, and only allow specific, heavy workloads to run here."

We achieve this using Taints (applied to the node to repel pods) and Tolerations (applied to the pod to allow it to bypass the taint).

# We apply this toleration to our heavy workload pod.
# WHY: The node has a taint "workload=heavy:NoSchedule". 
# Without this toleration, the Kubernetes scheduler will refuse to place this pod on that node.
tolerations:
- key: "workload"
  operator: "Equal"
  value: "heavy"
  effect: "NoSchedule"

By combining taints, tolerations, and strict resource requests/limits, we ensure that our heavy workloads get the dedicated hardware they need without starving the rest of the cluster.

The Pragmatic Solution

So, how do we bring this all together without over-engineering our infrastructure? We focus on the fundamentals.

1. Stop Flying Blind: Ensure the metrics-server is deployed and healthy. You cannot manage what you cannot measure. Set up basic Prometheus alerts for Node Memory Pressure and CPU saturation.
2. Enforce Resource Limits: Never deploy a pod to production without defining CPU and memory requests and limits. This is the only way the kubelet can protect the node from a runaway application.
3. Shrink Your Attack Surface: Transition your build pipelines to use minimal or distroless base images. Stop shipping package managers to production. If developers push back because they "need to debug," implement Kubernetes Ephemeral Containers so they can attach a debug shell only when necessary, without leaving it in the permanent image.
4. Isolate Expensive Compute: Use taints and tolerations to keep standard workloads off your expensive, heavy-compute nodes.

You don't need a massive, expensive vendor tool to do these things. You just need discipline and a solid understanding of the underlying primitives.

What You Should Do Next

If you want to sleep through the night without pager alerts, take these concrete steps tomorrow morning:

Run kubectl top nodes and kubectl top pods. If you get an error, your metrics pipeline is broken. Fix it immediately.
Audit your container images. Run a tool like Trivy against your production images. If you see hundreds of vulnerabilities, it is time to evaluate distroless base images.
Review your workload scheduling. Ensure your expensive nodes (like GPU instances) are tainted so standard workloads aren't wasting premium compute cycles.

There is no perfect system. There are only recoverable systems.

FAQ

Why shouldn't I just use a larger node size to avoid memory issues?

Scaling up node sizes vertically (adding more RAM) only delays the inevitable. If an application has a memory leak or lacks resource limits, it will eventually consume all available memory, regardless of whether the node has 16GB or 256GB. Setting proper resource limits is the only sustainable fix.

How do I debug a distroless container if there is no shell?

Kubernetes supports Ephemeral Containers. This allows you to attach a temporary container (which includes debugging tools like a shell) to a running pod's namespace. This gives you the debugging access you need during an incident without permanently shipping those tools in your production image.

Is cAdvisor the same as Prometheus?

No. cAdvisor is an agent built into the kubelet that collects raw resource usage data from the Linux kernel. Prometheus is a separate, dedicated monitoring system that scrapes this data (along with data from many other sources), stores it over time, and evaluates alerting rules.

Why do heavy workloads need dedicated nodes?

Heavy compute tasks, especially those processing large models or batch jobs, tend to spike CPU and memory usage aggressively. If they share a node with critical system components or latency-sensitive web APIs, the resulting resource contention will cause timeouts, degraded performance, and potential node crashes.