The Hallucination Problem Has a Fix
GPT-4 will confidently cite a paper that doesn't exist. Claude will quote API documentation from two versions ago. Every LLM hallucinates, and no amount of RLHF has fixed it. RAG — Retrieval-Augmented Generation — sidesteps the problem entirely: instead of asking the model to remember, you retrieve the relevant context first, inject it into the prompt, and let the model reason over real data. The result is answers grounded in your actual documents, with citations you can verify.
This isn't a nice-to-have pattern anymore. It's the default architecture for any LLM application that touches proprietary data — customer support bots, internal search, code assistants, compliance tools. Here's how to build one that actually works in production.
Five Stages, Five Places to Get It Wrong
A production RAG pipeline has five stages: document loading, chunking, embedding, vector storage, and retrieval-augmented querying. Most tutorials focus on stage five. Most production failures happen in stage two.
Document Loading
Your pipeline starts with raw data — PDFs, web pages, Notion docs, Slack exports. The loader extracts clean text while preserving structure. PyPDFLoader handles most PDF formats. Scanned documents need OCR preprocessing, which adds latency and error. The unglamorous truth: 30% of your development time goes here, cleaning up extraction edge cases that no library handles perfectly.
Chunking — Where Pipelines Fail Silently
Chunk too large and your embeddings become diluted — a 2,000-token chunk representing six different concepts matches everything weakly and nothing strongly. Chunk too small and you lose context — a 50-token fragment has no idea what paragraph it came from.
The sweet spot for most use cases: 512 tokens with 50-token overlap.
Recursive character splitting works because it respects natural boundaries — paragraphs first, then sentences, then words. This preserves semantic coherence within each chunk. Fixed-size splitting ignores boundaries and will cut a sentence in half mid-thought. The difference shows up as a 15-20% recall gap in our testing, which is enormous when your users are asking questions about specific clauses in a contract.
Embedding Models
The embedding model converts each chunk into a dense vector — a point in high-dimensional space where similar meanings cluster together. Your choice here directly controls retrieval quality.
For most production use cases, sentence-transformers/all-MiniLM-L6-v2 hits the right balance: 384 dimensions, fast local inference, strong semantic understanding, and it's free. Self-hosting it on a T4 GPU costs $0.50/hour and handles thousands of embeddings per second with 2-3ms latency.
If you need higher accuracy on technical content, BGE-large or Cohere's embed-v3 are measurably better — but they add latency and API costs that scale linearly with your corpus size. For a 500K-document corpus, the difference between free local embeddings and $0.0001/1K-token API embeddings is $800/month. Every month. Forever.
Vector Storage
The vector database stores your embeddings and handles similarity search at query time. At 100K vectors, every option is fast — Pinecone, Qdrant, ChromaDB, pgvector all return results in under 20ms. The choice only matters at scale.
Pinecone's free tier gives you 100K vectors with serverless scaling and zero infrastructure work. For self-hosted production past 1M vectors, Qdrant offers the best performance-per-dollar — 6ms p50 latency, pre-filtering on metadata, and a Rust engine that doesn't fall over under concurrent load. pgvector is the right call if you already run Postgres and want to skip managing a separate service.
We wrote a full comparison of vector databases at scale — read that before picking.
Retrieval + Generation
The final stage retrieves the top-k most relevant chunks and formats them into a prompt. LangChain's LCEL makes this clean: retriever pipes into prompt template pipes into LLM pipes into output parser.
k=4 is a reasonable default — enough context for grounded answers without blowing up your token budget. Going higher (k=8 or k=10) adds cost and can actually hurt accuracy by drowning the signal in noise. Going lower (k=2) risks missing the relevant chunk entirely. Measure retrieval recall on a test set before committing to a number.
Three Mistakes That Kill RAG Pipelines
Oversized chunks. The most common failure. Teams use 1,000 or 2,000-token chunks because "more context is better." It isn't. Larger chunks dilute embedding quality and return vaguely-relevant results instead of precisely-relevant ones. Start at 512 tokens, measure recall, and only go larger if your data genuinely requires it (long-form legal documents, for example).
No metadata filtering. Your vector search returns the 4 most similar chunks across your entire corpus. But the user asked about Q1 2026 specifically. Without metadata filters (date, source, tenant, document type), you retrieve semantically similar content from the wrong time period, wrong customer, or wrong document. Add metadata at ingestion time. Filter at query time.
Measuring generation quality without measuring retrieval quality. When the LLM gives a wrong answer, teams blame the model. Half the time, the retriever returned the wrong chunks and the model reasoned correctly over bad input. Always evaluate retrieval recall separately — what percentage of relevant chunks actually appeared in the top-k results? Fix retrieval first.
Key Takeaways
- 512 tokens, 50-token overlap is the chunking sweet spot for most RAG use cases
- Self-host embeddings on a T4 GPU for $0.50/hour instead of paying API costs that compound at scale
- pgvector is fine under 5M vectors; Qdrant wins past 10M; Pinecone if you have no infra team
- Measure retrieval recall separately from generation quality — most RAG failures are retrieval failures
- k=4 is a safe default; go higher only if recall testing shows you're missing relevant chunks
Related
- Vector Databases at Scale: pgvector vs Pinecone vs Qdrant — choosing the right vector store for your pipeline
- LLM Inference: How to Cut Your GPU Bill from $60K to $6K — optimizing the inference layer behind your RAG pipeline
- Multi-Agent AI: How Teams of Agents Replace Single Models — orchestrating multiple agents that each use RAG for grounded answers