NeuroscaleEngineering
AI Architecture

Building RAG on Amazon Bedrock — Knowledge Bases, Guardrails, and Agents in 2026

8 min readBy Neuroscale Engineering
Amazon BedrockRAGKnowledge BasesBedrock GuardrailsBedrock AgentsS3 VectorsAWS

December 3, 2025. AWS made S3 Vectors generally available — and the math on Bedrock RAG flipped overnight. A Pinecone Performance tier ran $1,750/month plus $0.40 per 1,000 queries. OpenSearch Serverless still bills a $345/month floor before a single document is ingested. S3 Vectors: roughly 90% off either of them, 2 billion vectors per index, sub-second cold latency.

That single launch killed the most common reason teams skipped Bedrock Knowledge Bases.

The Three Pieces That Finally Compose

Bedrock's RAG story is three services that finally fit together. Knowledge Bases handle ingestion, chunking, embedding, and retrieval. Guardrails sit between the user input and the model output. Agents wrap both with tool use through OpenAPI schemas and Lambda.

Each is mediocre alone. Stacked, they replace about 400 lines of LangChain glue with a single RetrieveAndGenerate call. Bedrock orchestrates the lookup, the prompt assembly, the model invocation, and the guardrail check. One round-trip.

This is the part most architecture posts get wrong. They treat Knowledge Bases as a managed vector DB. It is not. It is a managed RAG runtime, and the difference matters when you cost out the alternative.

Knowledge Bases: Pick Your Chunker Or Lose

Bedrock supports four chunking modes: fixed-size, hierarchical, semantic, and custom Lambda. The defaults are wrong for most documents.

Fixed-size is what every tutorial shows. 300 tokens, 60 overlap. Fast to set up. Terrible for contracts, financial filings, or any document where headings carry meaning. The chunker happily splits "Section 4.2.1: Termination Clauses" across two chunks and your retrieval accuracy drops 20+ points on heading-dependent queries.

Hierarchical chunking is the move for structured content. Parent chunks at 1,500 tokens. Child chunks at 300. Retrieval matches the child, returns the parent. Pinpoint precision with full context.

Semantic chunking uses embeddings to find natural breakpoints. It is the right default for unstructured prose — research papers, support transcripts, knowledge wikis. Slower to ingest. Cheaper at query time because retrieved chunks are more coherent.

For embeddings: Titan Text Embeddings V2 at 1,024 dimensions is fine. Drop to 512 and you keep 99% of retrieval accuracy at half the storage cost. Drop to 256 and you still hold 97%. Cohere Embed English narrowly beats Titan on standard MTEB benchmarks — pick it if response quality matters more than the no-egress AWS-native pricing.

The $350 OpenSearch Trap

Click through the Bedrock console and the default vector store is OpenSearch Serverless. It bills two compute units at $0.24/hour minimum. That is $345/month before you ingest a single document. With redundancy enabled, around $700/month.

For prototypes and pilots this is the silent budget killer. Three teams running side-by-side experiments will spend $2,100/month before they have a working demo.

Better defaults in 2026:

  • Trillion-vector scale, latency-tolerant: S3 Vectors. GA December 2025. Up to 90% cheaper than alternatives. Sub-second cold queries, around 100ms warm. Use this unless you genuinely need sub-50ms p99.
  • Small to mid workloads: Aurora Serverless v2 with pgvector. Floor under $50/month. Reported 90% cost cut versus OpenSearch in production migrations.
  • Hot, low-latency, high-throughput: OpenSearch Serverless. Pay the tax because you need the speed.

The decision tree is simple. Start on S3 Vectors. Migrate up only when measured latency dictates it.

Guardrails Stopped Being Optional

Guardrails dropped 80% in price last year. Content filters and denied-topic filters now cost $0.15 per 1,000 text units. PII detection and contextual grounding cost $0.10 per 1,000 units. A text unit is 1,000 characters. On a typical RAG response of 2,000 characters with all filters enabled, you pay $0.0005 per call. Five hundred microdollars. There is no economic argument left for skipping them.

The feature that actually changes the architecture shipped in 2025: Automated Reasoning checks. Formal mathematical verification against a policy document you encode — not probabilistic, not pattern-matching. PwC used it for EU AI Act compliance in financial services. AWS reports up to 99% accuracy on factual error detection inside constrained domains.

This is the first production-grade hallucination defense that proves a response is consistent with policy rather than guessing. Available today in us-east-1, us-east-2, us-west-2, eu-central-1, eu-west-1, and eu-west-3. If you ship into regulated workflows — healthcare, financial services, legal — turn it on. Skip it and a regulator eventually asks why.

Agents: Worth It When Grounded

Bedrock Agents are an orchestration loop plus action groups. An action group is an OpenAPI schema plus a Lambda function. The agent reads user input, decides which action to invoke, fills in parameters, and calls the Lambda. The OpenAPI schema is the contract.

Stand-alone agents are gimmicks. Agents over Knowledge Bases are the pattern that ships. The agent's planning step runs Retrieve against your KB, gets grounded chunks, then decides whether to invoke a tool or answer directly.

Multi-agent collaboration shipped in 2025 with a supervisor pattern. One orchestrator agent breaks the request into subtasks and routes them to specialists. Useful for any workflow with more than two natural phases — quote → underwrite → bind in insurance, research → draft → review in legal.

Common failure: skipping OpenAPI schema discipline. Vague parameter descriptions break orchestration. Specify required vs optional. Document parameter types. Include examples. The Lambda receives the schema's request body verbatim — sloppy schema, sloppy invocation.

What This Actually Costs

A mid-scale RAG assistant serving 50,000 monthly queries on Claude Sonnet 4.5 with a Bedrock Knowledge Base lands between $1,500 and $3,500/month for inference. Add the vector store. S3 Vectors at that volume runs around $50/month. OpenSearch Serverless: $350–$700/month.

Three optimizations that compound:

  1. Prompt caching. 90% discount on cached input tokens. Cuts costs 30–50% for repetitive RAG prompts where the system message and retrieved chunks repeat across a conversation.
  2. Lower embedding dimensions. Titan V2 at 512 dims = half the storage, 99% retrieval accuracy.
  3. Reranker on demand only. Amazon Rerank 1.0 is $1.00 per 1,000 queries. Skip it for narrow, high-precision queries. Use it on broad, ambiguous searches where top-5 retrieval is unreliable.

The Architecture Diagram

Production RAG on Amazon BedrockUser Queryapp / APIGuardrailsPII · deniedtopics · prompt-attackKnowledge Basehierarchical chunkingS3 Vectors · Titan V2Bedrock AgentOpenAPI schemaLambda action groupsFoundation ModelClaude Sonnet 4.5+ Automated ReasoningResponse+ citations

The arrows are the architecture. The boxes are the bill.

Ship This, Skip That

Ship: Bedrock Knowledge Bases on S3 Vectors with hierarchical chunking for structured docs and semantic chunking for prose; Guardrails with PII filter, contextual grounding, and Automated Reasoning where you can encode a policy; Agents only when you already have at least two distinct tools and a KB to ground them.

Skip: OpenSearch Serverless as the default vector store, fixed-size chunking on documents where headings carry meaning, multi-agent collaboration before single-agent retrieval works end-to-end, and Lambda action groups without precise OpenAPI schemas.

The architecture stopped being expensive in December 2025. The remaining question is whether your team is willing to read the chunking docs instead of cargo-culting the console defaults.

Get notified when we publish

One email per article. No spam. Unsubscribe anytime.

Comments