Self-Hosting LLMs in 2026 — When It Makes Sense and When It Doesn't

A fintech infra lead I spoke with last month was paying $47,000 a month to Anthropic. He moved his classification workload — about 320 million tokens a day — to a pair of H100s on RunPod. New monthly bill: $4,360. Same accuracy. Same p95 latency. Five-month payback on the migration work.

That is the self-hosting pitch in 2026. It is also the trap. The same engineer's team tried to self-host their customer-facing chat — 14 million tokens a day — and gave up after eight weeks. APIs won at that volume by 4×.

The break-even is not a number you guess. It is roughly 500 million tokens per day at current GPU rental rates and current frontier API pricing. Below that, you are buying complexity. Above it, you are leaving money on the table.

The numbers, with sources

Two H100s on RunPod Community at $1.99/GPU-hr cost about $2,860/month at 100% uptime. Add storage, networking, monitoring, and you land near $4,400. That stack — vLLM serving Llama 3.3 70B at FP8 — handles roughly 1,200 tokens/second sustained. Multiply across a 30-day month at 70% utilization and you get a usable ceiling of about 2.2 billion tokens per month per node pair.

Frontier API equivalent at $3/M input + $15/M output (blended around $9/M) runs $9,000 per billion tokens. At 500M tokens/day — 15 billion a month — the API bill is $135,000. The self-hosted pair, even doubled for redundancy, is under $10,000 in raw infra.

The catch is everything not labelled "GPU." Senior MLOps time runs $750 to $3,000 a month for a system serving real traffic. Total real cost runs 3 to 5× the raw GPU rental once you include ops, model updates, and idle capacity. Below 500M tokens a day, that multiplier eats your savings.

The four conditions that flip the math

Cost alone almost never justifies self-hosting in 2026. Four other conditions do.

Regulatory data residency. HIPAA, PCI DSS, ITAR, FedRAMP, and the new EU AI Act tier-2 obligations make "send the prompt to OpenAI" a legal blocker, not a preference. Zero-retention API endpoints satisfy some legal teams. They do not satisfy all of them. If your DPO has already said no twice, self-hosting is not a cost decision.

Single-digit millisecond latency. Network round trip to any cloud API adds 80–200ms. For industrial control, high-frequency trading classifiers, and in-vehicle inference, that is a hard fail. Local inference on a co-located GPU eliminates the round trip entirely.

Fine-tuned models that outperform frontier on your task. If you trained a 7B on three years of your support tickets and it beats GPT-5.5 on your eval set, you have no choice. No API hosts your weights for free.

Sustained volume above 500M tokens/day. This is the only purely economic condition. Below it, the operational tax is bigger than the API bill. Above it, the math inverts hard — 5× to 10× cheaper at billion-token-per-day scale.

Everything else — "we want control," "we don't trust OpenAI," "what if prices go up" — is feelings, not strategy. The API providers cut prices 40% to 70% every nine months. They will keep doing it. Betting against that with a $200K/year GPU lease is a bet you will probably lose.

The inference engine choice has changed

vLLM was the obvious answer in 2024. In 2026, it is one of three real options.

vLLM still wins on time-to-production. 62-second cold start. Model swaps without recompilation. 24× throughput over raw HuggingFace Transformers thanks to PagedAttention and continuous batching. If you are shipping in two weeks, this is the answer.

TensorRT-LLM wins on absolute throughput once you commit to a model. 8 to 13% faster than vLLM at 50 concurrent requests on H100. p50 TTFT of 105ms versus 120ms for vLLM. The cost is a 28-minute compile step every time you change anything. Reserve this for the single model you will run for six months.

SGLang wins on prefix-heavy workloads. RAG pipelines, multi-turn chat, structured generation — anything where the system prompt is identical across requests — see up to 6× throughput over vLLM. PremAI measured 16,200 tok/s versus vLLM's 12,500 on small models. If your traffic is 90% the same prompt template, switch.

The right answer in 2026 is rarely "vLLM by default." It is "vLLM for week one, benchmark all three against your actual prompts in week three."

The model choice is also no longer obvious

Llama 3.3 70B is the default for general assistants. Strong, 128K context, the largest community of tooling around it. On 2× H100 at FP8 it serves about 1,200 tok/s and quality is within 5% of Claude Sonnet on most production evals.

Qwen3 30B-A3B is the dark horse. 30.5B total parameters but only 3.3B active per token via MoE. Runs at 120 to 196 tok/s on a single RTX 4090. For a startup with one consumer GPU and 50M tokens a day, this is the answer. The output quality surprises people coming from Llama.

DeepSeek-R1-Distill-Llama-70B is the reasoning pick. 94.5 on MATH-500, 57.5 on LiveCodeBench. Same hardware envelope as Llama 3.3 70B but the reasoning traces add 2–3× wall-clock latency. Use it for code review, not chat.

Default to Llama 3.3 70B unless you have a specific reason. The cost of getting the model choice wrong is one weekend of retesting. The cost of getting the deployment architecture wrong is a quarter.

What to actually do this month

If your API bill is under $5,000 a month, stop reading and stay on the API. Nothing here pays back.

If your API bill is $5,000 to $20,000 a month, audit which workloads can move to a smaller open model on a managed inference provider — Together AI, Fireworks, DeepInfra. You get 60% of the savings with none of the ops. This is the right answer for most teams pretending they want to self-host.

If your API bill is over $20,000 a month, run the math at your actual token volume and prompt mix. Spend two weeks benchmarking Llama 3.3 70B on a single H100 pair with vLLM. If quality holds, the payback is usually under six months.

If your data cannot leave your VPC, you do not have a choice. Start with vLLM on Llama 3.3 70B inside your existing Kubernetes cluster, add monitoring with Prometheus and Grafana, and accept that you are buying a 0.5 FTE of MLOps work permanently.

The teams shipping production self-hosted LLMs in 2026 are not the ones chasing the cost win. They are the ones whose lawyers said no, or whose API bill crossed six figures while nobody was looking. Everyone else should keep paying the bill.

The numbers, with sources

The four conditions that flip the math

The inference engine choice has changed

The model choice is also no longer obvious

What to actually do this month

Comments