The $60,000 Problem
Character.ai was burning millions per month on inference in 2023 before they optimized their serving stack. Meta built an entirely new inference engine to serve Llama at scale. These aren't edge cases — they're what happens when a model that needs 80 GB of GPU memory meets real production traffic.
The pattern is always the same. Your model runs great on a single GPU during development. You deploy it. The first 100 concurrent users bring your server to its knees — latency spikes to 10 seconds, costs explode, users leave. Understanding inference infrastructure isn't optional anymore. It's the gap between a demo and a product.
Why One-Request-at-a-Time Dies Immediately
Most teams start simple: one GPU, one model, one request at a time. A user sends a prompt. The model generates token by token, each taking about 30 milliseconds. A 500-token response takes 15 seconds. While that's happening, every other user waits.
A single A100 costs $2/hour on cloud. Processing one request at a time, you serve 240 requests per hour — $0.008 per request. Sounds cheap. Then you hit 10,000 requests per hour and need 42 GPUs at $84/hour. That's $60,000/month for a model that sits idle between tokens.
The GPU isn't the bottleneck. The scheduling is. 90% of the time, your most expensive hardware is waiting — waiting for attention computation, waiting for memory transfers, waiting between tokens. You're paying for a $2/hour GPU to do nothing.
Five Techniques, 90% Cost Reduction
1. Continuous Batching
Static batching waits until you have a full batch, processes it, waits again. Continuous batching dynamically adds new requests as old ones finish tokens. The difference is staggering.
vLLM pioneered this. Throughput jumps 10-24x on the same hardware. A GPU that served 240 requests per hour now serves 2,400-5,760. Per-request cost drops from $0.008 to under $0.001. If you're running any production LLM inference and you're not using continuous batching, you're lighting money on fire.
2. KV Cache Management
During generation, the model computes key-value pairs for every token in the context. Without caching, it recomputes them for every new token. Wasteful.
PagedAttention (also from the vLLM team) manages the KV cache like virtual memory — allocating and freeing blocks dynamically. It eliminates 60-80% of memory waste, which means more concurrent requests on the same GPU or longer contexts without OOM crashes. This is the technique that made serving 128K-token contexts practical on hardware that theoretically shouldn't support it.
3. Quantization
Llama 70B at FP16 needs 140 GB of GPU memory. That's two A100s just to load the model. Quantize to 4-bit with GPTQ or AWQ and it fits on a single GPU with room for the KV cache.
The accuracy cost: less than 1% on most benchmarks. The dollar cost: 75% less GPU spend. This is the single highest-impact optimization most teams can make right now, and it takes an afternoon to implement.
One caveat. Aggressive quantization (below 4-bit) hurts mathematical reasoning and structured output. If your use case involves precise calculations or JSON generation, benchmark carefully before going below INT4.
4. Model Parallelism
For models too large for one GPU even after quantization, you split across multiple GPUs. Tensor parallelism splits individual layers. Pipeline parallelism distributes sequential layers across cards.
NVIDIA's TensorRT-LLM handles this automatically and optimizes the split pattern for your hardware. The key decision: tensor parallelism adds inter-GPU communication overhead but reduces latency; pipeline parallelism minimizes communication but adds pipeline bubbles. For interactive workloads, tensor parallelism usually wins.
5. Intelligent Routing
A load balancer that understands LLM workloads makes routing decisions based on request complexity. Short prompts go to smaller quantized models. Complex reasoning goes to full-precision. Similar prompts get routed to GPUs with warm KV caches, skipping redundant computation.
Anyscale and Together AI have built this into their inference platforms. The routing layer alone cuts costs 30-40% by matching request complexity to model capability. You don't need your 70B model to answer "what's the weather."
The Trade-Offs Are Real
Every optimization has a cost. Quantization saves memory but can hurt quality on math-heavy tasks. Continuous batching increases throughput but adds latency per request when the batch is large. KV cache management adds system complexity you'll need to monitor and debug.
The question that actually matters: what's your latency budget?
A chatbot needs sub-second time-to-first-token. A batch processing pipeline tolerates 5-second latency for 10x throughput. A coding assistant needs high accuracy, so aggressive quantization is risky. Start by measuring three numbers — GPU utilization, time-to-first-token, tokens per second — and optimize whichever one is furthest from target.
Key Takeaways
- Continuous batching is non-negotiable for production. If you're processing one request at a time, you're wasting 90% of your GPU. Start with vLLM.
- Quantize first — FP16 to INT4 cuts GPU costs 75% with minimal accuracy loss. Highest ROI for an afternoon of work.
- Don't optimize everything at once. Measure GPU utilization, find the bottleneck, fix that. Then measure again.
Related
- Building Efficient RAG Pipelines with Vector Databases — the retrieval layer that feeds your inference stack
- Multi-Agent AI: How Teams of Agents Replace Single Models — when your inference costs multiply across agent calls