A new model drops. The obvious move.
Anthropic released Claude Fable 5 today. I have been running the Sonnet line for day-to-day orchestration work, so whenever a new model lands the question is always the same: does it handle complexity better, and does it know when to delegate?
The test I use is a multi-agent workflow review. Give the model a production site, a clear goal, and explicit permission to spawn as many sub-agents as it needs. Then watch the output.
Today it ran 62 agents in a single session.
How the workflow is structured
The model was not given a list of things to check. It was given a goal: increase conversion. Everything else it figured out.
The pattern it chose:
- Phase 1: Fan out. Multiple finder agents, each searching a different angle simultaneously. UX, copy, analytics, email flow, SEO, code bugs, pricing logic.
- Phase 2: Adversarial verify. Every finding gets a separate skeptic agent whose only job is to refute it. If the skeptic cannot, the finding survives. If it can, the finding is dropped.
- Phase 3: Synthesise. Surviving findings are ranked by impact and grouped into action tiers.
The result was a prioritised plan: fix-now bugs, week-1 improvements, week-2 experiments, long-term bets. Each item independently shippable.
What the adversarial pass actually catches
This is the part that surprised me even knowing how it works. The adversarial pass is not just a quality filter. It changes which findings survive.
A model reviewing code alone will flag things that feel like problems. A separate model, prompted to actively look for reasons the finding is wrong, catches a different class of failure. It asks: does this actually matter in production? Is the context the first model had complete? Is the proposed fix actually safe?
Two findings were killed in verification this way. Not because they were hallucinated, but because they did not hold up under pressure from a model that was actively trying to break them. That is exactly what you want.
Parallel execution at this scale is fast
62 agents sounds slow. It is not. The workflow tool caps concurrent execution and queues the rest, so you are never waiting for one agent to finish before the next starts. Multiple finders run in parallel. Verification for finding A runs while finding B is still being discovered.
The wall-clock time is roughly: slowest single agent chain, not sum of all agents. In practice this means a 62-agent review takes a fraction of what sequential review would cost in time.
The token cost is real. Millions of tokens in one session. That is the trade-off. You are buying thoroughness and confidence, not speed alone.
What Fable 5 did differently
A few observations from today's run specifically.
The model structured the work without being told how. When I said "review this site for conversion improvements, you can use dynamic workflow," it chose the fan-out and adversarial-verify pattern on its own. It assigned the right agents to the right phases. It did not get lost in its own context when synthesising across 62 agent outputs.
The handoff quality was also noticeably good. After the review, I asked it to write implementation instructions for the next engineer picking up the work. It produced a fully self-contained brief: exact file paths, line numbers, constraints, things not to touch, and definitions of done. No context from the session was assumed. A new agent could walk in cold and execute.
And then that new agent did. Same session, different sub-agent, all four work items built and deployed before the session ended.
The fun numbers
Some things that stand out from a day of running this:
- The single orchestrating model held coherent context across the outputs of 62 sub-agents without losing the thread.
- The verification phase reduced the finding count from raw to verified in a single pass, with a rejection rate low enough that you trust what remains.
- Four separate work items were implemented, tested (tsc, 54 unit tests, full build), and deployed to production in the same session. The model ran the gate checks itself after each commit.
- The total elapsed time from "review this site" to "production deployment confirmed ready" was one conversation session.
- At no point did I write any code manually or run any commands myself.
Where this fits in the stack
I already wrote about the multi-agent setup I run day to day: Reidar as orchestrator, specialised local agents underneath, pipelines for repeating work. That stack handles volume and reliability.
What today's test shows is a different mode: high-stakes, one-shot, resource-heavy review. You would not run 62 agents every day. You run it when you need thorough, confident analysis of something that matters. A product before a launch. A codebase before a refactor. A site before a growth push.
The economics only make sense if the output quality justifies the token spend. In this case it did. 52 verified findings, four of them immediately shippable and now live, is a real output.
What I would watch
Two things I am paying attention to as I use Fable 5 more:
Context coherence at scale. Synthesising across dozens of agent outputs while keeping constraints straight is hard. Today it held up. I want to see if that holds across longer sessions and larger agent pools.
Self-delegation quality. The model choosing which sub-agents to spawn, with what prompts, is the hardest part to evaluate. The output quality depends entirely on the quality of those choices. Today they were good. I am curious how it handles domains it knows less well.
The bottom line
Fable 5 handles multi-agent orchestration well. The adversarial verification pattern works at this scale. Millions of tokens in one session is a real cost, but the output is proportionally more reliable than a single-pass review.
If you are already running workflows with Claude, it is worth switching to Fable 5 for the orchestrator layer. The coherence improvement alone is noticeable.
More as I run it through more tests. Follow @HeVeMi99 for updates.