The 5% problem
Cast AI sampled roughly 23,000 production Kubernetes clusters in early 2026. Average GPU utilization: 5%. Ninety-five out of every hundred GPU-hours billed are paid for and idle. A p5.48xlarge costs $98.32 an hour on-demand. A p4d.24xlarge runs $32.77. An idle H100 burns about $2,200 a month, and AWS raised H200 Capacity Block prices 15% in January — the first upward move on GPU pricing in two decades.
The waste isn't a Kubernetes bug. It's a default. The vanilla kube-scheduler treats a GPU the way it treats a CPU socket from 2015: one device, one pod, one number. That stopped working the day an inference workload arrived needing 2 GB of VRAM out of a 16 GB card.
Why kube-scheduler can't see your GPU properly
The native resource type nvidia.com/gpu: 1 is opaque. The scheduler doesn't know if your model wants NVLink to its sibling, doesn't know if a MIG slice would do, doesn't know whether the card has 40 GB or 80 GB. So your A100 gets handed to a Llama-3 7B inference pod that uses 12 GB and 11% of the SMs. The other 68 GB sits there.
Worse: there's no fairness. The first job in eats the cluster. Other teams queue forever with no ETA, no quota visibility, no preemption. A research team can reserve 8 A100s for a two-hour run and sit on the nodes for twelve. That's roughly $200 of cloud GPU per node per hour, gone.
DRA changed the floor in September 2025
Dynamic Resource Allocation (DRA) graduated to GA in Kubernetes 1.34 in September 2025, and v1.35 locked the feature gate on. You can't turn it off anymore. NVIDIA donated its k8s-dra-driver to the CNCF at KubeCon NA 2025, and Google followed with the matching TPU driver. DRA reached GA in OpenShift 4.21 and is generally available on GKE.
Practically, this means GPU requests are no longer integer counts. A pod claims a ResourceClaim describing what it actually needs — 20 GB of VRAM, MIG slice 2g.20gb, an NVLink-attached neighbor — and the scheduler binds a real device. Prioritized lists let you say "one H100, or two A100s, or four L4s" and the scheduler tries each in order. That alone moves utilization on mixed-shape workloads from a 20–30% baseline toward 70–80% in published case studies.
Picking a scheduler in 2026
Three names are worth knowing.
NVIDIA's KAI Scheduler hit v0.10.0 with topology-aware gang scheduling, hierarchical PodGroups, and time-based fair-share. KAI launches all pods in a multi-node distributed training job together or not at all — no half-admitted gangs starving the queue. Best fit when your workload is mostly GPU AI/ML and you want native DRA integration.
Volcano has been doing this since 2019. It replaces kube-scheduler for batched workloads, ships a PodGroup CRD for gang semantics, and has the most mature MPI and Slurm-style operators. Pick it if your teams already think in HPC terms.
Kueue (Kubernetes SIG-Scheduling) is a job-level layer that sits on top. It manages quotas, decides admission, handles preemption, and supports fair-share across cohorts of teams. CoreWeave runs Kueue under several frontier AI labs in production. It composes well: Kueue admits a job, Volcano or KAI schedules its pods.
Stack ranking for most teams: Kueue for quota and admission, KAI or Volcano for pod scheduling, DRA underneath both for the actual device claim.
Autoscale on the right signal, not GPU utilization
GPU utilization is a trailing indicator. By the time nvidia-smi reads 90%, the inference queue is already 80 requests deep. Scale on what's about to hurt the user: pending request count, queue duration, or P99 latency.
For Triton, scrape :8002/metrics and let KEDA v2.19+ react to nv_inference_pending_request_count and nv_inference_queue_duration_us. For vLLM, scale on vllm:num_requests_waiting and KV-cache pressure. KServe and the llm-d stack expose both natively.
Node autoscaling is Karpenter's job. Karpenter compatible with Kubernetes 1.35+ provisions GPU nodes in under 60 seconds and supports scale-to-zero. A team running eight hours of GPU workloads a day saves roughly 67% versus always-on nodes. Pair Karpenter with EC2 Spot for GPU instances — Spot saves 50–70% on P4d and P5 — and put a PodDisruptionBudget on anything you can't lose mid-batch.
What the math actually looks like
Take a small inference cluster: four p4d.24xlarge (8x A100 each), 24/7 on-demand. That's $32.77 × 4 × 730 = ~$95,700 a month, before egress.
Add MIG partitioning to give seven slices per card. Run KAI gang scheduling. Autoscale on queue depth. You typically pack three to four times more workloads into the same hardware. Drop the always-on count to two p4ds with Karpenter scaling up to four on Spot during peaks. Final bill: $28,000–$35,000 a month. The deltas published in production case studies hit 50–70% spend reductions reliably.
vLLM is the other lever. Its v0.8.0+ engine with FlashAttention 3 throws 4,741 tok/s through 2x H100 on GPT-OSS-120B at 100 concurrent requests. Continuous batching plus PagedAttention is roughly 24x the throughput of naive HuggingFace serving. If you're not running vLLM (or its newer sibling SGLang) on Kubernetes in 2026, you're paying for headroom you can't use.
What I'd build today
Start with EKS 1.34+ (or GKE Standard with DRA enabled). Install the NVIDIA GPU Operator v25.3.2 — it handles drivers, DCGM exporter, MIG configuration, and the DRA driver as one Helm release. Run Kueue at the cluster level for quotas. Put KAI underneath for AI workloads and keep kube-scheduler for everything else. Use Karpenter with a GPU NodePool that prefers Spot. Scale inference pods with KEDA on queue depth, never on GPU percent.
Two pieces of cluster hygiene matter more than the scheduler choice.
First: pod-level GPU resource requests must reflect real VRAM, not whole cards. DRA makes this possible. Your platform team enforces it.
Second: every GPU pod gets a Prometheus DCGM target and an SLO. If you can't see queue depth and P99 latency on one dashboard, you can't autoscale on the right signal.
Optimize the boring decisions. The exotic ones don't matter.