The AI community is building multi-agent systems — teams of LLMs that divide work, communicate, and collaborate to solve problems no single model can handle. A 'researcher' agent gathers information, a 'coder' agent writes implementation, a 'reviewer' agent checks quality, a 'coordinator' agent manages the workflow. The pitch is compelling: specialized agents collaborating like a well-run engineering team.
If this sounds familiar, it should. Distributed systems researchers have been studying coordinating independent processes for 40 years. Multi-agent LLM systems are rediscovering, from first principles, problems that the distributed systems community solved (or proved unsolvable) decades ago. Understanding these parallels saves you from reinventing solutions — and from building systems that fail in predictable, well-understood ways.
The Coordination Tax
The first lesson of distributed systems: coordination has overhead. Two processes working independently on separate tasks are twice as fast as one. Two processes working on the same task, needing to coordinate, are often slower than one — because the communication and synchronization overhead exceeds the parallelism benefit.
Multi-agent LLM systems hit this immediately. A 'researcher' agent produces a report. The 'coder' agent needs to understand the report to write code. The 'reviewer' agent needs to understand both the report and the code to give useful feedback. Each handoff requires serializing context into a prompt, which the receiving agent must parse and understand. This is exactly the distributed systems problem of shared state — and it gets worse as you add agents.
Single agent vs. multi-agent for a coding task:
Single agent:
1. Understand the problem (1 LLM call)
2. Research relevant APIs (2 LLM calls)
3. Write implementation (1 LLM call)
4. Review and fix (1 LLM call)
Total: 5 LLM calls, full context throughout
Multi-agent (researcher + coder + reviewer):
1. Coordinator describes task (1 LLM call)
2. Researcher reads and plans (1 LLM call)
3. Researcher does research (3 LLM calls)
4. Researcher summarizes findings (1 LLM call)
5. Coder reads summary (1 LLM call) ← context lost here
6. Coder writes implementation (1 LLM call)
7. Reviewer reads code + summary (1 LLM call) ← context lost here
8. Reviewer provides feedback (1 LLM call)
9. Coordinator synthesizes (1 LLM call)
Total: 11 LLM calls, context degraded at each handoff
More agents = more communication = more cost = worse context.
In distributed systems, this is Amdahl's Law applied to communication: the speedup from parallelism is limited by the sequential communication that can't be parallelized. Adding more agents to a task that requires tight coordination makes the system slower, not faster.
The Consensus Problem
When multiple agents work on the same task, they need to agree on things. What are the requirements? What approach should we take? Is this implementation correct? In distributed systems, this is the consensus problem, and it's provably hard — the FLP impossibility result shows that deterministic consensus is impossible in an asynchronous system with even one faulty process.
LLM agents are worse than traditional distributed processes for consensus because they're non-deterministic. Ask the same agent the same question twice and you might get different answers. Two agents reviewing the same code might disagree on whether it's correct. A 'coordinator' agent asked to resolve the disagreement might flip a coin.
Multi-agent frameworks typically handle this by designating a coordinator agent with final authority (a centralized consensus model — simple but creates a single point of failure) or by majority voting (expensive — you need at least three agents per decision). Both approaches work, but they're solving a problem that only exists because you split the work across multiple agents in the first place.
Failure Modes That Should Be Familiar
Multi-agent systems fail in ways that distributed systems engineers will immediately recognize.
- Cascading failures. Agent A produces bad output. Agent B, working from A's output, produces worse output. Agent C, reviewing B's work, doesn't catch the original error because it inherited the wrong assumptions. This is the distributed systems equivalent of error propagation — garbage in, garbage out, amplified at each stage.
- Deadlocks. Agent A waits for Agent B's output to proceed. Agent B waits for Agent A's feedback to proceed. Neither can make progress. In practice, this manifests as infinite loops where agents pass work back and forth without converging.
- Split brain. Two agents working on related tasks develop inconsistent views of the problem. One assumes the API returns JSON; the other assumes XML. Their outputs are individually correct but mutually incompatible.
- Thundering herd. A coordinator agent dispatches work to multiple agents simultaneously. All of them hit the same API, exhaust the same rate limit, or attempt to modify the same file. Resource contention that a single agent would never encounter.
When Multi-Agent Actually Helps
The parallel to distributed systems cuts both ways. Distributed systems have real benefits for specific problems. Multi-agent systems do too — when used for problems that genuinely benefit from decomposition.
Embarrassingly parallel tasks. If you need to analyze 50 codebases for the same pattern, 50 agents working independently are genuinely 50x faster. No coordination needed — each agent works on a separate input and produces an independent output. This is the MapReduce pattern, and it works as well for LLM agents as it does for distributed data processing.
Diverse perspectives. Asking three agents with different system prompts to review the same code can surface issues that a single review misses. This is the redundancy pattern — like having multiple reviewers on a PR. The cost is 3x the compute, but the coverage is broader.
Specialization with clean interfaces. A translation agent that takes text and returns translated text has a clean interface. A summarization agent that takes a document and returns a summary has a clean interface. Composing these — translate, then summarize — works well because the interface between them is simple. The distributed systems equivalent: microservices with well-defined APIs work better than microservices with chatty, complex interfaces.
Lessons from Distributed Systems
If you're building multi-agent LLM systems, decades of distributed systems wisdom applies directly.
- Prefer fewer, more capable agents over many specialized ones. Just as a well-designed monolith outperforms a poorly-designed microservice architecture, a single capable agent with good prompting outperforms a team of narrow agents for most tasks. Add agents only when you've proven that a single agent can't handle the workload.
- Define clear interfaces between agents. The input and output of each agent should be well-defined. Vague handoffs ('figure out what the researcher found and implement it') lead to context loss and misinterpretation. Structured data contracts between agents work better than freeform text.
- Make agents idempotent. If an agent fails midway, you should be able to restart it with the same input and get a correct result. This requires agents that don't have hidden state and don't cause side effects until they've confirmed their output is correct.
- Add observability. Log every inter-agent message, every decision, every failure. When a multi-agent system produces wrong output, you need to trace the error back to which agent made the wrong decision and why. Without observability, debugging is guesswork.
- Design for partial failure. Any agent can fail, produce garbage, or time out. The system should handle this gracefully — retry, fall back to a simpler approach, or ask a human. Never assume all agents will succeed.
The Uncomfortable Truth
Most multi-agent LLM systems would work better as a single agent with a good prompt. The coordination overhead, context loss, and failure modes of multi-agent architectures almost always outweigh the benefits for tasks that a single model can handle. As AI tooling evolves, the temptation to build elaborate multi-agent systems grows, but the distributed systems lesson is clear: don't distribute what doesn't need to be distributed.
The exceptions are real — embarrassingly parallel tasks, genuine specialization with clean interfaces, and problems that exceed a single model's context window. For everything else, a single agent with a well-structured prompt, good tools, and a thoughtful retry strategy will outperform a team of agents. It's less architecturally exciting, but it works better. The best distributed system is the one you didn't have to build.