The Inference Plateau
We've optimized KV caches. We've compiled GPUs. But inference latency and throughput still hit a wall. The culprit? Token batching inefficiency.
As models grow and inference workloads diversify, the ability to batch requests efficiently determines whether your inference cluster runs hot or cold. A 10% improvement in batching strategy can yield 30–40% throughput gains—without touching hardware.
Why Batching Matters Now
In the prefill phase, GPUs love work. Batching 32 requests together keeps compute units saturated. But in the decode phase, each token generation is sequential. A single user's request can starve the GPU while waiting for the next token.
Dynamic batching—accepting new requests mid-decode, pausing low-priority ones, and reordering—lets you fill idle GPU cycles. The math is simple: more tokens per second per GPU = lower cost per inference.
The Practical Bottleneck
Most inference servers use static batching: lock in a batch size at startup, hope it matches traffic patterns. Reality is messier. Peak hours see 100 concurrent requests; off-peak sees 5. Static batching wastes GPU memory and introduces unnecessary latency.
Dynamic batching systems (like vLLM's continuous batching) solve this by:
- Accepting requests on-the-fly without waiting for a full batch
- Pausing low-priority tokens to prioritize high-priority ones
- Reordering compute to maximize GPU utilization
The Trade-off: Complexity
Dynamic batching isn't free. It requires:
- Sophisticated scheduling logic
- Memory fragmentation management
- Priority queue overhead
- Careful tuning per model/hardware combo
What's Next
The frontier is speculative batching—predicting which tokens a user will request next and pre-computing them speculatively. Combined with dynamic batching, this could unlock another 2x throughput gain.
For builders: if your inference latency feels stuck, batching strategy is the first lever to pull. Measure your GPU utilization during decode. If it's below 70%, you're leaving throughput on the table.