A customer support RAG system running 10 queries per second on GPT-3.5-turbo costs roughly $15,000 a month in inference alone. Switch that to GPT-4 and you're looking at $300,000. We ran the same workload on a self-hosted DeepSeek V4 instance on a single A100 — $2,500/month, comparable quality on our internal evals. That's not a rounding error. That's a business model difference.
The API Tax Is Real, and It Compounds
Here's the math most teams skip. A 10 QPS RAG pipeline sends roughly 26 million requests a month. Each request embeds a 50-token query, retrieves 500 tokens of context, and generates a 200-token response. At OpenAI's GPT-3.5-turbo pricing ($0.0005/1K input, $0.0015/1K output), that's $0.000575 per query. Sounds tiny. It's $15,000/month.
The embedding layer adds another $1,400. Your vector DB adds hosting costs. Suddenly the "simple API integration" is a $20K line item that scales linearly with traffic. Double your users, double your bill. No volume discounts that matter at this scale.
This is the part that burns teams: the API is cheapest when you don't need it (prototyping, low traffic) and most expensive when you do (production, growing users).
DeepSeek V4 Changed the Self-Hosting Math
DeepSeek V4 dropped in April 2026 and shifted the calculus. 671B total parameters, but it's a Mixture-of-Experts model — only 37B active per token. That means it runs on hardware that would choke on a dense 671B model.
The numbers that matter:
| Metric | DeepSeek V4-Pro | DeepSeek V4-Flash | GPT-5.5 | Claude Opus 4.7 | |:-------|:----------------|:------------------|:--------|:----------------| | SWE-bench Verified | 80.6% | — | ~80% | 80.9% | | HumanEval | ~90% | — | — | — | | Context Length | 1M tokens | 1M tokens | — | — | | Input (per 1M tokens) | $0.04 | $0.004 | 5-10x higher | — | | Output (per 1M tokens) | $0.08 | $0.008 | 5-10x higher | — |

DeepSeek V4-Pro matches GPT-5.5 on SWE-bench (80.6% vs ~80%) and beats Claude on Terminal-Bench. The Flash variant is where it gets wild — $0.004 per million input tokens. That's 125x cheaper than GPT-4 for workloads where Flash-tier quality is sufficient.
The 1M token context window is the other story. Long-document RAG, multi-turn agents, code repositories — these are use cases where context length isn't a nice-to-have, it's the constraint that shapes your entire architecture.
When to Stay on the API (and When to Leave)
The break-even point is straightforward: if your monthly OpenAI bill consistently exceeds $5,000-$10,000, self-hosting pays for itself. Below that, the operational overhead isn't worth it.
A single A100 rents for $3-5/hour from AWS, Azure, or GCP. That's $2,500-$3,500/month. Add vLLM as your serving framework, and one A100 handles 10-20 QPS for a 7B model with sub-second latency. For the DeepSeek V4 MoE architecture, you'll want more VRAM, but the active parameter count (37B) keeps it tractable.
What you're trading for that cost savings: you now own GPU provisioning, model updates, monitoring, autoscaling, and the 3 AM pages when inference latency spikes. This isn't a weekend project. You need someone who's done MLOps before, or you'll spend the savings on engineering time debugging CUDA out-of-memory errors.
The honest recommendation: start on the API. Measure your actual spend for 60 days. If you're under $5K, stay. If you're over $10K and growing, start the self-hosting migration. The messy middle ($5K-$10K) depends on whether you have MLOps talent on the team already.
The Embedding Layer Is the Easy Win
Before you touch inference, self-host your embeddings. This is the lowest-risk, highest-return optimization and most teams overlook it.
OpenAI's text-embedding-ada-002 costs $0.0001 per 1K tokens. At 10 QPS, that's $1,400/month just for embeddings. Deploy bge-small-en-v1.5 or all-MiniLM-L6-v2 on a single T4 GPU ($0.50/hour, ~$360/month) and you get the same quality embeddings with 2-3ms local latency instead of 50-100ms API round-trips.
That single change saves $1,000/month and cuts your P95 latency. Do it first.
Key Takeaways
- API costs scale linearly — 10 QPS on GPT-3.5-turbo is $15K/month, GPT-4 is $300K. No volume break.
- DeepSeek V4 matches GPT-5.5 on coding benchmarks at 125x lower token cost (Flash tier).
- Break-even for self-hosting is $5K-$10K/month in API spend. Below that, stay on the API.
- Self-host embeddings first — easiest win, saves $1K/month, improves latency.
- The real cost of self-hosting is ops, not hardware. Budget for MLOps talent, not just GPUs.
Further Reading
- DeepSeek V4 Release — Official specs, pricing, and benchmark results
- Introducing GPT-5.5 — OpenAI's capabilities overview and pricing
- vLLM Project — The serving framework that makes self-hosting practical
- Self-Hosted LLMs vs OpenAI API: True Cost Analysis — Detailed financial comparison for startups