Designing Netflix's Recommendation Engine With AI Agents
In March 2025, Netflix's engineering team published a paper that quietly retired 30+ specialized recommendation models in favor of one autoregressive transformer-based foundation model. The new architecture ingests every user signal — clicks, pauses, scrolls, time-of-day — as tokens. Like words in a sentence. That single model now drives roughly 80% of everything watched on the platform, and the recommendation engine is widely credited with generating around $1B in retained subscriber revenue annually.
So what's left to design? The agents around it.
The 200ms wall
Anyone proposing to drop an LLM agent into a recommendation hot path needs to stare at one number first: the P99 latency budget for end-to-end personalization is roughly 200ms. Retrieval gets under 20ms. Ranking gets the rest. Amazon's old finding still applies — every 100ms of added latency costs about 1% of revenue.
A reasoning agent that calls tools and inspects memory does not finish in 20ms. Not even close. That is why "agentic recommendations" in 2026 is not about putting a chatty LLM in front of your retrieval layer. It is about adding a slower, smarter outer loop that compiles, refines, and supervises the fast inner loop.
Think of it as a control system. The foundation model handles the millisecond decisions. Agents handle the minute-to-minute strategy.
The architecture
A defensible design at Netflix scale looks like this:
Inner loop (sync, every request). Two-stage funnel. Candidate generation pulls a few thousand items via ANN search over embeddings produced by the foundation model. Ranking scores them with a multi-task model. Netflix's UniCoRn — the unified contextual recommender shipped at RecSys 2024 — lifted search quality by 7% and recommendation quality by 10% by training one model on both jobs instead of two. Output: an ordered row of titles in under 200ms.
Outer loop (async, runs every few minutes to every few hours). An agent system. This is where the new design lives.
The outer loop has four roles, mapped to four agents:
-
Planner agent. Decides what to recompute and why. New episode dropped? Refresh the "continue watching" row. User finished a thriller at 2 AM? Generate a "wind down" candidate set. The planner sequences which downstream agents run, in what order, and with what budget.
-
Memory agent. Maintains the durable interpretation of the user. Not the embedding — that is the foundation model's job — but the symbolic layer on top. "User prefers 45-minute episodes on weekday nights." "User dropped three crime dramas in a row, back off the genre." Queried by the planner, updated by the evaluator.
-
Tool agent. Calls the actual systems. Vector search. The real-time distributed graph Netflix built for member actions, fed by Kafka. External metadata. The foundation model's embedding endpoint. MCP-style tool contracts make this layer swappable without retraining anything upstream.
-
Evaluator agent. Runs after recommendations are served. Did the user click? Skip? Add to list? It updates memory and flags drift to the planner. Most teams cut corners here. Most production systems silently degrade because of it.
This is the orchestrator-plus-isolated-subagent pattern that Anthropic, OpenAI, LangChain, AutoGen and Cognition all converged on by early 2026. Each subagent has a narrow job, a bounded context, and a clean handoff. That is why it survives production.
What the numbers actually say
The honest version: agents on top of a strong foundation model do not double your metrics. The AgentRec paper (arXiv 2510.01609, October 2025) ran a hierarchical multi-agent recommender on three real-world datasets and reported a 1.9% lift in NDCG@10, a 2.8% better conversation success rate, and 3.2% better conversation efficiency over state-of-the-art baselines.
Small numbers. Also worth tens of millions of dollars at Netflix scale. A 1.9% lift on a system driving 80% of watch time is not marginal — it is the kind of gain that pays for the entire infrastructure rebuild.
The bigger unlock is qualitative. Conversational recommendations like "something funny but not stupid, under 90 minutes, no romance" were impossible with the old funnel because intent compression was lossy. Netflix's CRAG (Collaborative Retrieval-Augmented Generation), published in 2025, was the first system to fuse collaborative filtering with LLM-driven conversational recommendation. That capability only exists with an agent in the loop.
Where it breaks
Three failure modes show up immediately in production.
Latency creep. A planner that "just calls one more tool" adds 400ms. Stack three agents and the outer loop now runs every 15 minutes instead of every 5. Refresh staleness becomes the user-visible bug, not the reasoning quality. Fix: enforce per-agent latency budgets and kill any agent that exceeds them. No retries.
Memory consistency. Multi-agent memory consistency is the open research problem of 2026. Two agents read the same user state, both write updates, neither sees the other's change. Three-layer memory hierarchies — I/O, cache, durable — are showing up in production designs to fence this off. Most teams underestimate the engineering work here by 3x.
Cost. The foundation model alone is expensive. Agents call it repeatedly. A naive implementation can 5x your inference bill in a quarter. Caching planner decisions and reusing tool outputs across users with similar profiles is the only thing that keeps unit economics sane.
Build the inner loop first
If you are designing a recommender today, the order matters. Build the two-stage funnel with a strong embedding model. Get P99 under 200ms with real production traffic. Ship a multi-task ranker before you ship a planner agent. The foundation model does 90% of the work. Agents add a thin, expensive, high-value layer on top.
Skip that order and you end up with an agent system that hallucinates recommendations nobody watches, served on top of a retrieval layer that was never ready. The agent gets blamed. The retrieval layer was the real problem.
Netflix did the model consolidation first. The agent layer is the next chapter. Build in that order, or do not build it at all.
Architecture diagram
The full multi-agent walkthrough — planner, memory, tools, evaluator — is covered in the Neuroscale Engineering YouTube channel video on how AI agents actually work in production systems.