DeepSeek recently dropped the technical report for V4, revealing a massive 1.6-trillion parameter MoE (Mixture of Experts) model that achieves a 1-million token context window natively. But the real story isn't just the sheer size; it's the radical architectural gamble DeepSeek is making to redefine efficiency in the LLM space.
Here is a breakdown of the core innovations in DeepSeek V4 and what they mean for the future of AI.
1. The Cost Paradox: 1.6T Total, 49B Active
DeepSeek V4-Pro boasts 1.6 trillion parameters, yet it only activates 49 billion parameters during inference per token. This represents a highly sparse MoE architecture.
- The Benefit (Compute): Inference is incredibly fast and cheap, costing only about as much computing power as a dense 49B model. This explains DeepSeek's aggressively low API pricing.
- The Cost (Memory): To run it, you still need enough VRAM to hold all 1.6T parameters in memory (roughly 1.1 TB of VRAM, requiring a cluster like 8x H200s).
2. Hybrid Attention: CSA and HCA
How do you handle 1 million tokens without melting your GPUs? DeepSeek abandoned traditional attention mechanisms for a dual-layered approach:
- Compressed Sparse Attention (CSA): Compresses the KV cache (4:1) and uses a "Lightning Indexer" to pick only the top-k most relevant token summaries to attend to.
- Heavily Compressed Attention (HCA): Compresses the cache massively (128:1) and reads the entire shortened sequence.
3. Stability at Scale: mHC and Muon
Training a 60+ layer MoE model usually results in exploding signals and crashed training runs. V4 introduces two stabilizers:
- Manifold-Constrained Hyper-Connections (mHC): Replaces standard residual connections with parallel lanes that mathematically cannot amplify signals beyond a factor of 1, preventing the network from blowing up.
- Muon Optimizer: Replaces the standard AdamW optimizer for most modules. Muon orthogonalizes the weight updates, ensuring the model trains stably without skewing heavily in one mathematical direction.
4. FP4 Quantization-Aware Training (QAT)
To save memory, V4 trains its heaviest components natively in 4-bit precision (FP4). Unlike compressing a model *after* training (which loses accuracy), Quantization-Aware Training forces the model to adapt to low-precision math *during* the training process. DeepSeek also engineered a clever lossless conversion trick to dequantize FP4 back to FP8 seamlessly.
The Real Bet: Agentic Coding over General Reasoning
DeepSeek V4 openly admits it trails behind GPT-5.4 and Claude 4.5 in top-tier reasoning benchmarks by "3 to 6 months." Why release it then?
Because benchmarks test short contexts. DeepSeek is betting that the future isn't about solving complex logic puzzles in a vacuum; it's about Agentic Coding—feeding a 500K-token git repository into the context window and letting the model see the whole picture at a fraction of the cost.
DeepSeek V4 isn't just a cheaper alternative; it's a model optimized for a race the US tech giants haven't fully committed to yet: hyper-efficient, long-context orchestration.
Credit
- Original article: DeepSeek V4 deep dive: CSA, HCA, mHC và canh bạc 1 triệu token context
- Original author: Nguyễn Anh Bình (Omelet)
- Source: Omelet.tech
- Rewritten by: Lugon (TeguFy)