A new model drops. The obvious move.

Anthropic released Claude Fable 5 today. I have been running the Sonnet line for day-to-day orchestration work, so whenever a new model lands the question is always the same: does it handle complexity better, and does it know when to delegate?

The test I use is a multi-agent workflow review. Give the model a production site, a clear goal, and explicit permission to spawn as many sub-agents as it needs. Then watch the output.

Today it ran 62 agents in a single session.

62 agents spawned
52 verified findings
2 rejected as false
4 priority tiers

How the workflow is structured

The model was not given a list of things to check. It was given a goal: increase conversion. Everything else it figured out.

The pattern it chose:

The result was a prioritised plan: fix-now bugs, week-1 improvements, week-2 experiments, long-term bets. Each item independently shippable.

52 verified findings. 2 rejected. A false-positive rate under 4% across 62 agents is not something you get from a single-pass review, no matter how good the model is.

What the adversarial pass actually catches

This is the part that surprised me even knowing how it works. The adversarial pass is not just a quality filter. It changes which findings survive.

A model reviewing code alone will flag things that feel like problems. A separate model, prompted to actively look for reasons the finding is wrong, catches a different class of failure. It asks: does this actually matter in production? Is the context the first model had complete? Is the proposed fix actually safe?

Two findings were killed in verification this way. Not because they were hallucinated, but because they did not hold up under pressure from a model that was actively trying to break them. That is exactly what you want.

Parallel execution at this scale is fast

62 agents sounds slow. It is not. The workflow tool caps concurrent execution and queues the rest, so you are never waiting for one agent to finish before the next starts. Multiple finders run in parallel. Verification for finding A runs while finding B is still being discovered.

The wall-clock time is roughly: slowest single agent chain, not sum of all agents. In practice this means a 62-agent review takes a fraction of what sequential review would cost in time.

The token cost is real. Millions of tokens in one session. That is the trade-off. You are buying thoroughness and confidence, not speed alone.

What Fable 5 did differently

A few observations from today's run specifically.

The model structured the work without being told how. When I said "review this site for conversion improvements, you can use dynamic workflow," it chose the fan-out and adversarial-verify pattern on its own. It assigned the right agents to the right phases. It did not get lost in its own context when synthesising across 62 agent outputs.

The handoff quality was also noticeably good. After the review, I asked it to write implementation instructions for the next engineer picking up the work. It produced a fully self-contained brief: exact file paths, line numbers, constraints, things not to touch, and definitions of done. No context from the session was assumed. A new agent could walk in cold and execute.

And then that new agent did. Same session, different sub-agent, all four work items built and deployed before the session ended.

Review, plan, implement, deploy. All in one session. The model handed off to itself.

The fun numbers

Some things that stand out from a day of running this:

Where this fits in the stack

I already wrote about the multi-agent setup I run day to day: Reidar as orchestrator, specialised local agents underneath, pipelines for repeating work. That stack handles volume and reliability.

What today's test shows is a different mode: high-stakes, one-shot, resource-heavy review. You would not run 62 agents every day. You run it when you need thorough, confident analysis of something that matters. A product before a launch. A codebase before a refactor. A site before a growth push.

The economics only make sense if the output quality justifies the token spend. In this case it did. 52 verified findings, four of them immediately shippable and now live, is a real output.

What I would watch

Two things I am paying attention to as I use Fable 5 more:

Context coherence at scale. Synthesising across dozens of agent outputs while keeping constraints straight is hard. Today it held up. I want to see if that holds across longer sessions and larger agent pools.

Self-delegation quality. The model choosing which sub-agents to spawn, with what prompts, is the hardest part to evaluate. The output quality depends entirely on the quality of those choices. Today they were good. I am curious how it handles domains it knows less well.

The bottom line

Fable 5 handles multi-agent orchestration well. The adversarial verification pattern works at this scale. Millions of tokens in one session is a real cost, but the output is proportionally more reliable than a single-pass review.

If you are already running workflows with Claude, it is worth switching to Fable 5 for the orchestrator layer. The coherence improvement alone is noticeable.

More as I run it through more tests. Follow @HeVeMi99 for updates.