NeuroscaleEngineering
AI Architecture

Amazon Bedrock Pricing Deep Dive — Real Costs at 1M, 10M, and 100M Tokens

8 min readBy Neuroscale Engineering
amazon bedrockawsllm pricingcost optimizationai infrastructureclaudenova

A financial services team opened their AWS bill on a Tuesday in March. Bedrock line item: $42,800. The month before: $11,200. They hadn't shipped new code. They'd just hit real traffic.

Six weeks later that bill was $18,000. No model downgrade. No quality drop. They turned on three things that don't appear on the Bedrock pricing headline. We'll get there.

First: the actual math at 1M, 10M, and 100M tokens. Most teams budget Bedrock off a single number from a marketing page, then meet the rest of the bill in production.

Bedrock Cost Stack — Per Million Tokens (2026)User Request~70% input / 30% outputRouting LayerRule-based or intelligent prompt routingNova Micro / Lite$0.035 / $0.14 — Micro$0.06 / $0.24 — Lite~$0.07 / 1M (70/30)Nova Pro$0.80 in / $3.20 out— per million~$1.52 / 1M (70/30)Claude Sonnet 4.6$3 in / $15 out— per million~$6.60 / 1M (70/30)Discount StackPrompt Cacheup to 90% off inputBatch Inference50% off, 24h SLASmart Routingup to 30% off bill

The token math everyone gets wrong

Claude Sonnet 4.6 lists at $3 per million input tokens and $15 per million output tokens on Bedrock — identical to Anthropic's direct API in standard regions. The number that actually matters is the ratio. Output costs 5x more than input.

Most teams average the two and multiply. Wrong. RAG runs roughly 70/30 input-to-output. Chatbots are closer to 50/50. Code completion is heavy output. Pick the wrong ratio and your forecast misses by 40%.

Run the numbers honestly. Then run them again at scale.

At 1M, 10M, 100M tokens: what you actually pay

Sonnet 4.6 monthly spend at a 70/30 input/output split:

  • 1M tokens: $6.60
  • 10M tokens: $66
  • 100M tokens: $660
  • 1B tokens: $6,600

That looks cheap. Now the same volumes on Nova Pro, Amazon's mid-tier model at $0.80 input and $3.20 output per million: $1.52, $15.20, $152, $1,520.

Nova Pro is 4.3x cheaper than Sonnet at the same token mix. For classification, entity extraction, summarization, or simple Q&A, the quality gap is single digits. Nova Micro sits at the bottom at $0.035 input and $0.14 output — about $0.07 per million tokens at 70/30. A workload that costs $660 on Sonnet runs roughly $7 on Micro, for the right tasks.

Llama 3.3 70B is $0.72 flat for both directions. Mistral Large 2 is $3 in, $9 out. The Sonnet number is what teams quote out loud. The Nova number is what gets shipped to production once the spreadsheet exists.

The OpenSearch trap

Bedrock Knowledge Bases costs nothing. What sits underneath does.

The default vector store, OpenSearch Serverless, has a floor of 2 OpenSearch Compute Units at $0.24 per hour. That is $345 per month before you embed a single document. Run a proof-of-concept over a weekend, leave it on, come back Monday — you spent $32 hosting nothing. At 100K queries a month, OpenSearch is still the biggest line on your Bedrock bill. Bigger than the foundation model.

In December 2025 AWS launched Amazon S3 Vectors. Up to 90% cheaper than OpenSearch Serverless. Trillions of vectors. Sub-second latency. If you're starting a new Knowledge Base today, S3 Vectors is the right default. Teams who built on OpenSearch through 2024 are migrating now and reporting $300-$400 monthly savings per index.

Provisioned throughput vs on-demand: when the math flips

Most teams should never touch Provisioned Throughput.

A Claude Model Unit on a one-month commitment runs roughly $50 per hour. That's $36,000 per month per MU. To break even against on-demand Sonnet pricing, you need to push around 12 billion input tokens or 2.4 billion output tokens through that single MU every month. Steady, predictable traffic at that scale wins on Provisioned. Spiky traffic loses by half.

Most production Bedrock workloads are spiky. Stay on-demand until your p95 throughput sits at 80% of an MU for four consecutive weeks. Anything earlier is paying for idle capacity.

The three discounts that cut bills in half

Batch inference is 50% off the on-demand rate. Submit a JSONL file to S3, results come back within 24 hours. Use it for embeddings, FM-as-judge evaluations, nightly report summarization, anything that doesn't need a real-time response. AWS estimates 30-40% of typical Bedrock workloads have no latency requirement. Almost nobody moves them.

Prompt caching gives up to 90% off cached input tokens. Cache a 4,000-token system prompt across 100 requests per minute and you save roughly $0.27 per 1,000 requests on Sonnet input alone. At 1 million requests per day that is $8,100 per month in savings from a single configuration flag. The cache TTL is five minutes — short, but long enough for almost every chat session and RAG batch.

Intelligent prompt routing between Sonnet and Haiku cuts up to 30% with no measurable accuracy hit. One AWS customer's ticket classification task scored 94% on Haiku versus 97% on Sonnet — at one-fifteenth the cost. Build a rule-based router at the application layer first. Use Bedrock's managed router only after you've measured your own traffic.

That financial services team from the opening? Batch inference saved 22%. Caching saved 18%. Routing saved 25%. Compound them and a $42,800 bill becomes $18,000. None of those changes touched a single line of model-facing code.

The hidden surcharges

Cross-region inference adds 10% to every token. Teams turn it on for reliability and never check the markup. The newer global cross-region inference for Claude Sonnet 4.5 inverts the math — it runs roughly 10% cheaper than geographic cross-region. Read the docs on which mode your inference profile is using before you assume.

GovCloud adds 20-30% on top of standard pricing. Guardrails are billed separately: content filters at $0.15 per 1,000 text units, PII detection at $0.10 per 1,000, contextual grounding at $0.10 per 1,000. A text unit caps at 1,000 characters, so a 5,600-character prompt is six units. Heavy guardrail use on a chat product can add $200-$400 a month to a $1,000 inference bill.

Bedrock Agents carry no separate fee. The orchestration tokens are the fee. A five-tool agent task consumes roughly 5x the tokens of a single direct invocation because every reasoning step generates input and output. Teams quote agent costs based on the user-facing prompt and ignore the reasoning loop. That gap is where the surprise bill lives.

What to do this week

If you're already on Bedrock and your monthly bill is over $5,000, do these three things in this order. Turn on prompt caching for your system prompts. Move anything not latency-sensitive to batch. Rule-route easy queries to a smaller model. Each is a config change, not an architecture change. Expect a 40-60% reduction inside one billing cycle.

If you're evaluating Bedrock against direct Anthropic or OpenAI APIs, the per-token math is roughly even. The real lock-in is AWS infrastructure — VPC isolation, IAM, KMS, no data leaving your account. That's worth a 10% premium for regulated workloads. It's worth nothing if you're a startup with no compliance pressure.

Real companies spend 1.5 to 2x their initial Bedrock estimate. A $50K annual budget becomes $85K. A $200K budget becomes $340K. Build your forecast at 2x. If you come in under, you bought runway. If you come in at 2x, you weren't surprised.


Want the video version of this analysis? Subscribe to Neuroscale Engineering for weekly deep-dives on AI infrastructure and system design.

Get notified when we publish

One email per article. No spam. Unsubscribe anytime.

Comments