☁️ Cloud & DevOps

Managing Kubernetes AI Workloads Without 3 AM Pages

Marcus Cole
Marcus Cole
Cloud & DevOps Lead

Platform engineer who's been through every infrastructure era — bare metal, VMs, containers, serverless. Has strong opinions about YAML files and even stronger opinions about over-engineering.

Dynamic Resource Allocationinfrastructure driftGPU schedulingcontinuous compliancecloud native AI

The Reality Check

Let's be honest about what happens when the business decides it's time to "do AI."

They hand you a massive, 20-gigabyte container image. It requires four specific GPUs, specialized drivers, and a highly specific network topology. They expect you to deploy it onto your existing infrastructure and assume it will "just work" like any other stateless web service.

But Kubernetes AI workloads are not stateless web services. When a standard microservice fails, Kubernetes quietly restarts it on another node in milliseconds. When a massive inference model fails because of resource contention, it takes five minutes just to pull the image, another three minutes to load the model weights into memory, and in the meantime, your users are seeing 502 Bad Gateway errors. And who gets the alert at 3 AM when the cluster runs out of GPU memory because of infrastructure drift? You do.

We've spent years building abstractions to hide the underlying hardware from developers. But when it comes to heavy compute, pretending the hardware doesn't exist is exactly what wakes you up in the middle of the night.

The Core Problem

The real bottleneck in running these workloads isn't the models themselves; it's infrastructure drift and resource contention.

For years, we managed GPUs in Kubernetes using the Device Plugin framework. It was a blunt instrument. You asked for nvidia.com/gpu: 1, and the scheduler found a node with an available slot. But GPUs aren't just generic compute units. They have specific memory capacities, they share PCIe buses, and they communicate over specialized links like NVLink.

When someone manually tweaks a node's configuration to fix a temporary issue, or when an unauthorized workload sneaks into a namespace and requests a GPU, your cluster drifts from its desired state. Suddenly, your critical inference pod is scheduled on a node where the GPU is technically "free" but doesn't have the memory bandwidth required. The pod crash-loops. The pager goes off.

Under the Hood

Think of a Kubernetes cluster like a busy commercial harbor. Standard web pods are like standard shipping containers. The harbor master (the Kubernetes scheduler) can stack them anywhere. They fit on any truck, on any ship.

GPUs are not shipping containers. They are specialized, heavy-duty dry docks.

The old Device Plugin method was like telling a blindfolded valet, "Park this massive cargo ship somewhere." The valet just counts the available docks and points. If the dock doesn't have the right cranes (NVLink) or deep enough water (VRAM), the ship gets stuck.

Dynamic Resource Allocation (DRA), which reached General Availability in Kubernetes 1.34, changes this. Instead of a blind valet, DRA acts as a dedicated harbor logistics coordinator. It separates the request for a resource from the pod itself, similar to how PersistentVolumeClaims separate storage requests from compute.

Before we look at the YAML, let's visualize how this plumbing actually connects.

AI Pod Needs 2 GPUs ResourceClaim (The Request) Topology-Aware Node Infrastructure DRA Driver (Harbor Master) GPU 0 GPU 1 NVLink

The Pragmatic Solution

We are going to implement a stable, fundamentals-focused approach to scheduling an AI inference workload using DRA, and we will enforce continuous compliance so that infrastructure drift doesn't break it later.

Prerequisites

Before you start applying YAML, verify your environment. You cannot fake hardware.

  • A Kubernetes cluster running version 1.34 or higher.

  • Nodes with physical GPUs installed.

  • A DRA-compatible device driver installed on the cluster (e.g., the official vendor DRA driver).

  • kubectl configured and authenticated.


Step 1: Define the ResourceClass (The Blueprint)

Before anyone can request a GPU, we need to define what kind of hardware is available. We don't want developers guessing. We define a ResourceClass. Think of this as the blueprint that tells the cluster, "We have high-memory GPUs available, and here is the driver that manages them."

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClass
metadata:
  name: high-vram-gpu
# The driver must match the DRA driver installed on your nodes
driverName: gpu.vendor.com/dra
structuredParameters: true

Why do we do this? By abstracting the hardware into classes, if we swap out the physical hardware next year, we just update the class. The developer's request doesn't need to change. We isolate the infrastructure complexity from the application layer.

Step 2: Create the ResourceClaim (The Request)

Now, instead of adding a resources.limits block to our Pod and hoping for the best, we create a dedicated ResourceClaim. This is an explicit request for hardware.

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  name: inference-gpu-claim
  namespace: ai-production
spec:
  resourceClassName: high-vram-gpu
  # We are explicitly asking for 2 devices that share a topology
  devices:
    requests:
      - name: dual-gpu
        deviceClassName: high-vram-gpu
        count: 2

Why do we do this? A ResourceClaim has a lifecycle independent of the Pod. If the Pod crashes and restarts, the claim holds the reservation on the hardware. No other workload can steal those GPUs while your massive container is rebooting. This eliminates the race conditions that cause 3 AM pages.

Step 3: Bind the Claim to the Pod (The Execution)

Now we write the Pod specification. We don't ask for GPUs in the container resources; instead, we reference the claim we just created.

apiVersion: v1
kind: Pod
metadata:
  name: inference-model-server
  namespace: ai-production
spec:
  containers:
  - name: model-server
    image: our-registry/inference-server:v2.1
    resources:
      requests:
        cpu: "4"
        memory: "16Gi"
  # Here is where the magic happens
  resourceClaims:
  - name: gpu-access
    source:
      resourceClaimName: inference-gpu-claim

Why do we do this? Separation of concerns. The container only cares about CPU and RAM. The Pod-level configuration handles the heavy machinery. If the claim cannot be satisfied (e.g., no node has 2 available GPUs), the Pod remains in a Pending state before it tries to pull a 20GB image.

Step 4: Enforce Continuous Compliance (Security as Code)

We've solved the scheduling problem, but we haven't solved the human problem. GPUs cost money. If a junior developer deploys a test pod into the default namespace and claims our high-vram-gpu class, they will block production workloads.

Compliance isn't a spreadsheet you fill out once a quarter. It must be enforced continuously. We will use a simple Kyverno policy to ensure that only authorized namespaces can use this ResourceClass.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-gpu-claims
spec:
  validationFailureAction: Enforce
  rules:
  - name: check-namespace-for-gpus
    match:
      any:
      - resources:
          kinds:
          - ResourceClaim
    validate:
      message: "GPU claims are only allowed in the ai-production namespace."
      pattern:
        metadata:
          namespace: "ai-production"

Why do we do this? The best code is code you don't write, and the best outage is the one that never happens. By treating security and compliance as code, we prevent infrastructure drift at the API server level. The request is rejected before it ever reaches the scheduler.

Verification

To confirm your setup is working, you don't just look at the Pod. You look at the claim.

Run this command:
kubectl describe resourceclaim inference-gpu-claim -n ai-production

You should see an Allocation section that explicitly lists the node and the exact hardware device IDs reserved for your workload. Next, check the pod:
kubectl get pod inference-model-server -n ai-production

If it's running, the hardware was successfully attached.

Troubleshooting

The Claim is stuck in Pending:
If your ResourceClaim is pending, the DRA driver cannot find hardware that matches your request. Check your ResourceClass name. Ensure the DRA driver daemonset is actually running on your GPU nodes.

The Pod is stuck in Pending, but the Claim is Allocated:
This means the scheduler found the GPUs, reserved them, but something else is preventing the Pod from scheduling on that specific node. Check for node taints, affinity rules, or insufficient standard CPU/memory resources on that node.

Policy Rejection:
If you get an error like admission webhook "validate.kyverno.svc" denied the request, your continuous compliance policy is working. You are trying to deploy a GPU workload in an unauthorized namespace.

What You Built

You replaced a fragile, implicit hardware request system with a declarative, topology-aware resource pipeline. You separated the hardware blueprint (ResourceClass) from the reservation (ResourceClaim) and the execution (Pod). Finally, you wrapped it in a continuous compliance policy to prevent infrastructure drift.

This isn't flashy. It requires writing a bit more YAML than the old device plugin method. But it is predictable, it is stable, and it respects the reality of physical hardware constraints.

There is no perfect system. There are only recoverable systems.


FAQ

Why can't I just use the old device plugin method for my AI models? You can, but device plugins lack topology awareness. They cannot guarantee that the GPUs assigned to your pod share a high-speed NVLink connection or sit on the same PCIe switch. For large language models or distributed training, this lack of topology awareness leads to massive latency spikes and unpredictable performance.
Does Dynamic Resource Allocation (DRA) work with cloud provider managed Kubernetes like EKS or GKE? Yes, as of Kubernetes 1.34, DRA is GA and supported by major cloud providers. However, you must ensure that your cloud provider's specific DRA drivers are installed on your node groups, as standard device plugins will not interface with ResourceClaims.
How does this prevent infrastructure drift? By using ResourceClaims combined with admission controllers (like Kyverno or OPA Gatekeeper), you define exactly who, what, and where hardware can be consumed. If a node's configuration drifts manually, the DRA driver will report the actual state to the API, and the scheduler will refuse to allocate claims to a degraded node, preventing workloads from failing silently.

📚 Sources

Related Posts

☁️ Cloud & DevOps
Case Study: Kubernetes Gateway API Migration on AWS
Mar 25, 2026
☁️ Cloud & DevOps
Backstage vs kro: Choosing Platform Engineering Tools
Mar 24, 2026
☁️ Cloud & DevOps
Kubernetes Infrastructure Reality: Metrics, Security, AI
Mar 19, 2026