Context Engineering: Why a Bigger Context Window Makes Your Agent Worse

Chroma ran a test in 2025 that should have ended the million-token marketing war. They took 18 frontier models — GPT-4.1, Claude Opus 4, Gemini 2.5 — and gave them a task a child could pass: copy a repeated word, find a semantic match. Then they made the input longer. Every single model got worse. Accuracy fell 20 to 50 percent between 10K and 100K tokens, on a task that never changed. The context windows were nowhere near full.

That result has a name now. Context rot. And it's why your agent gets dumber exactly when you feed it more.

The window is not the budget

Gemini 2.5 Pro holds 2 million tokens. GPT-4.1 and Claude's beta tier hold 1 million. The spec sheet says you can pour a small library into a single prompt. The benchmarks say don't. NoLiMa, the ICML 2025 long-context benchmark from Adobe Research, found that at just 32K tokens, 11 of 13 models had dropped to half their short-context accuracy. Databricks Mosaic measured the same cliff: correctness starts sliding after 32K. So you paid for a 2-million-token window, and your reliable working budget is 32,000.

Picture a desk with fixed surface area, not a filing cabinet you keep loading. Pile on more paper and older sheets slide off the edge — they don't stack neatly out of the way. The window is the model's attention, and attention is rivalrous. Every token you add quietly steals influence from the tokens already on the desk.

Why the middle disappears

The failure mode has been documented since Liu's 2023 "Lost in the Middle" paper. Models over-weight the start and end of a prompt and systematically under-attend everything between. Bury a critical fact in the middle of 150K tokens and the model treats it like it isn't there. An agent carrying 150K tokens of tool output is effectively ignoring 100K of it — and it won't tell you which 100K.

This is where agents die. A 15-step refactor with file reads, shell output, and sub-agent hand-offs balloons past 100,000 tokens before the model has reasoned about anything. Engineers call it agent suicide by context: the agent takes a reasonable action, generates 250K tokens of output, blows past its 200K window, and the task dies. The agent never understood why. It couldn't — the explanation was sitting in the part of the context it had already stopped reading.

Context engineering is the actual job

Andrej Karpathy named it in June 2025. By September, Anthropic had shipped a formal framework. The definition is boring and correct: selecting, structuring, and maintaining the information a model uses to reason. The hard part is deciding what to leave out.

Three techniques carry most of the weight.

Compaction. When conversation history grows, summarize the old turns and drop the raw transcript. Keep the decisions, discard the keystrokes. A 40-turn debugging session compresses to a paragraph of "what we tried, what failed, what's left."

Structured note-taking. Write important state to a file outside the conversation — a plan, a list of constraints, a running summary — so it survives after the turns that created it rotate out of the window. The model rereads the note; it doesn't relive the transcript.

Sub-agents. This one wins biggest. Hand risky, token-heavy work to a sub-agent with its own clean window. It can burn 50K tokens crawling a codebase and return a 2K-token summary to the main agent, which never sees the mess. Claude Code is built exactly this way: the orchestrator holds a high-level plan, and the sub-agents do the dirty reads in isolation.

The numbers say it pays

This stopped being theory in 2026, and the lift is measurable. Anthropic's own evaluations put context editing alone at a 29% performance gain, and pairing it with a memory tool at 39%. On April 23, 2026, they shipped Memory for Managed Agents into public beta — agents persist what they learn as files, with per-write audit logs, and share it across a workspace. Rakuten ran it in production and reported a 97% drop in error rate, 27% lower cost, and 34% lower latency on agent workloads. Those three numbers move together for one reason: a smaller, cleaner context is cheaper to process, faster to process, and easier to get right.

Adoption tracked the results. Datadog's State of AI Engineering found agentic framework use nearly doubled in a year — from 9% of organizations in early 2025 to 18% by early 2026. The teams pulling ahead aren't the ones with the longest context window. They're the ones who treat context as a budget they actively spend down.

What to actually do Monday

Stop measuring your agent by how much you can cram in. Measure the floor where it breaks. Run your real task at 8K, 32K, and 64K tokens and watch where accuracy falls off — for most models it starts around 32K, not at the 200K limit. Set your working budget below that line and engineer to stay under it: compact aggressively, push state to files, and route any operation that reads more than ~20K tokens to a sub-agent that returns a summary instead of a raw dump. The million-token window is real. It's just not where your agent does its best thinking, and pretending otherwise is the most expensive mistake in production AI right now.