The problem with most AI setups
Everyone is building chatbots. One model, one prompt, one hope. Fine for simple tasks. The moment you ask it to do something complex — research a topic, write production code, review its own output — the quality falls off a cliff.
The reason is simple. You are asking one model to be a researcher, a coder, a reviewer, and a manager at the same time. No human team works that way. Neither should an AI.
The solution: an org chart
Treat the AI setup exactly like a company. Each agent has a defined role, a model suited to that role, and a clear reporting line. Reasoning-heavy roles get heavier models; fast ones get lighter models.
The actual org chart in use:
I talk to Reidar. Reidar delegates to Bent and Steinar. They delegate down further. Nothing reaches me unless it has passed through the chain.
The hardware
Local models run on the desktop:
- CPU: AMD Ryzen 9950X3D
- GPU: NVIDIA RTX 5090 (32GB VRAM)
- RAM: 64GB
The orchestrator (Reidar) runs on Claude Sonnet 4.6 via API. Everything else runs locally on Ollama at localhost:11434. Zero cloud compute for agent tasks. API cost for Claude stays minimal — it only orchestrates, not the heavy lifting.
The models and why
Claude Sonnet 4.6 for orchestration. Best reasoning at the orchestrator layer. It reads context, delegates correctly, and synthesises output from multiple agents into something coherent.
qwen3.5:35b for Bent and Steinar. A Mixture of Experts model. Fits in VRAM, handles complex reasoning, and follows structured instructions reliably.
qwen2.5-coder:32b for Harald and Sigrid. Purpose-built for code. Significantly better than a general model when you need Python, PowerShell, or Bash to be correct.
The pipelines
Every repeating task has a pipeline. These run through Agno — a production-first agent framework chosen over CrewAI for better runtime handling.
Permanent agents on the org chart handle orchestration and oversight. The actual work in pipelines is done by ephemeral workers that spin up on demand and disappear. No memory, no persona, just a model with a specific prompt and a job to finish.
Research pipeline
Three ephemeral workers: Researcher gathers sources, Analyst synthesises findings, Writer produces the output. In: a topic. Out: a formatted report posted to Discord. Time: 50 seconds.
Code review pipeline
Four steps: Planner breaks down the task, Coder writes the implementation, Reviewer checks logic and edge cases, Steinar (permanent QA) does the final pass. Nothing ships without clearing all four.
Content pipeline
Researcher, Drafter, Formatter. Blog posts, LinkedIn, Discord announcements. Same ephemeral chain, different output format.
What surprised me
A dedicated QA agent improved quality more than expected. A single model reviewing its own code misses things it wrote confidently. A separate model with no context of the original intent catches different failure modes.
The other surprise: specialisation beats size. qwen2.5-coder:32b writes better Python than a general 70B model. Smaller, faster, better at the specific job.
What I would do differently
Start with two agents, not eight. One orchestrator, one worker. Get that working reliably before adding complexity. The full org chart on day one is a trap.
Also: write decisions to memory files immediately, not at the end of the session. If something breaks and you context-switch, that context is gone.
What is next
Building a hybrid memory architecture so agents can retrieve relevant context from past sessions without loading everything into the context window. Goal: agents that remember without ballooning token costs.
More when it is done. Follow @HeVeMi99 for updates.