The Cost Nobody Talks About
When teams talk about the cost of AI-powered code generation, they usually focus on generation itself. But a new empirical study on token consumption in LLM Multi-Agent (LLM-MA) systems flips that assumption entirely.
Analyzing 30 software development tasks performed by the ChatDev framework with a GPT-5 reasoning model, researchers mapped internal phases to standard SDLC stages: Design, Coding, Code Completion, Code Review, Testing, and Documentation. The results are striking.
Key Finding: Code Review Is the Hunger
The iterative Code Review stage accounts for 59.4% of total token consumption on average. That's not a rounding error — it's the dominant cost driver in agentic software engineering.
Worse, input tokens consistently make up 53.9% of total consumption across all stages. This means agents are spending more tokens reading and understanding code than producing new code. The bottleneck isn't generation — it's the endless back-and-forth of automated refinement.
What This Means for Builders
For technical founders and product-minded developers, these numbers carry real implications:
- Cost prediction is broken. Most teams estimate AI costs based on output tokens. But input-heavy stages like Code Review can silently inflate bills by 2–3×.
- Agent collaboration has hidden overhead. Multi-agent pipelines feel efficient because tasks parallelize. But each agent's iteration cycle compounds token usage in ways a single-agent setup doesn't.
- The real cost is verification, not creation. Automated refinement and quality checks are where agentic engineering burns budget. This suggests the biggest gains come from optimizing review loops, not generation prompts.
Token Distribution by Stage
| Stage | Token Share |
|---|---|
| Code Review | 59.4% |
| Coding + Completion | ~25% |
| Testing + Documentation | ~15% |
| Design | <1% |
Practical Takeaways
The Bigger Picture
This study is preliminary — 30 tasks in one framework (ChatDev) with one model (GPT-5). But the methodology is sound and the direction is clear: agentic software engineering's cost structure is fundamentally different from single-shot AI coding.
The next frontier isn't faster models. It's smarter collaboration protocols that reduce unnecessary review cycles. Teams that understand this will build more cost-efficient AI systems — and more importantly, they'll know where to actually optimize.
Source: Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering — arXiv:2601.14470