The Misconception: "Compilers Are Becoming Obsolete"
With AI agents now capable of writing code, a tempting narrative has emerged: if large language models can generate CUDA kernels, why do we need a sophisticated compiler toolchain? Why not simply point the model at each hardware vendor's backend and let it emit native code directly?
The skeptics envision compilers as historical artifacts—much like assembly programmers of the 1980s—soon to be displaced by agentic code synthesis. But this reasoning commits a critical category error.
LLMs and Compilers: Fundamentally Different Problems
A compiler is a deterministic function. Given identical source code, it produces identical machine code, every time, forever. This property is not incidental—it underpins why we can trust billions of lines of production code on hardware we've never touched.
An LLM is a probabilistic function. Same input, different output, varying by temperature, sampling strategy, model version, or other stochastic factors. This property makes LLMs invaluable for ideation, exploration, and synthesis. It is precisely wrong for the metal layer.
You do not want stochastic correctness on floating-point multiply-accumulate operations, memory fences, or atomic operations. A kernel producing subtly different results across runs—because the model decided to get creative—is a silent correctness failure.
The brain decides what to compute. The compiler forges how it runs, deterministically.
What Agents Actually Need From Their Substrate
Once you accept that agents and compilers operate in different categories, a productive question emerges: what does the agent need from its substrate—the compiler, runtime, and semantics—to be maximally productive?
Three things, mostly:
1. Fast and Structured Feedback
An agent iterating on a kernel needs compile errors it can parse, deterministic failure modes, and reproducible builds. The faster the feedback loop, the fewer iterations required to reach a working solution. A toolchain emitting cryptic template errors is nearly useless to an agent with no intuition for what "feels wrong."
2. One Mental Model, Not N
Every fragmentation of the substrate fragments the training distribution and multiplies hallucination surfaces. If the agent must remember that intrinsics are named differently on NVIDIA, AMD, and Intel, and that memory models have subtly different guarantees on each—its competence degrades on all of them. The model that deeply knows one substrate beats the model that knows N substrates shallowly.
3. An Environment That Doesn't Lie
Agent productivity is bounded by how quickly it can iterate against something whose behavior matches its documentation. A toolchain with silent miscompiles, undocumented edge cases, or platform-specific surprises at runtime is a productivity sink. Humans can develop intuition that something smells wrong; agents cannot.
The Volume Question: Why Scale Changes Everything
Human-authored GPU code is a small corpus: tens of thousands of serious kernels, mostly written and reviewed by specialists. The substrate can afford sharp edges because the humans know where they are.
Agent-authored GPU code presents fundamentally different volume.
Potentially millions of kernels, increasingly directed by engineers who aren't GPU specialists and shouldn't have to be. A product builder thinking about inference paths shouldn't need to reason about memory coalescing patterns. The stack should handle that.
Without a robust compiler and runtime, you don't get 10x engineering output—you 10x the number of failed attempts. The thing that turns "generated code" into "running code" at scale is the toolchain. Remove it and the agent's apparent productivity collapses into broken kernels, and agents burning through inference tokens trying to fix their own mistakes.
CUDA Is the Substrate Agents Already Know
Of all GPU programming substrates, CUDA has by far the largest public corpus: kernels, documentation, error messages, decades of Stack Overflow archaeology, and stable semantics over the longest window.
When frontier models are asked to write GPU code, they excel at CUDA by a considerable margin.
The "emit native code for each backend" plan throws this away. It asks the model to do its worst-performing task on every backend except NVIDIA's, fragmenting the training distribution it does have. The agent ends up weaker at all of them.
The better strategy: let the agent write the substrate it knows well (CUDA), and let the compiler handle the targeting. The agent's competence stays concentrated. The silicon stays interchangeable.
The Compiler Workload Is Growing, Not Shrinking
As AI agents become first-class consumers of the compiler stack, the substrate must evolve:
- Machine-parseable diagnostics: Error messages designed for LLM consumption, not humans—schema-stable, dense with actionable information
- Incremental builds and tight loops: Making exploration cheap at scale
- Autotuning surfaces: That agents can drive end-to-end
- Profile-guided feedback: That agents can act on to optimize kernels
- Runtime introspection: Exposing device behavior in forms agents can reason about
The Closing Argument
Agentic AI doesn't make compilers obsolete. It raises their leverage.
The winners in the agent era will be companies whose substrate agents can target without ever thinking about the hardware underneath. Stable semantics, structured diagnostics, deterministic behavior, one mental model that works everywhere—these are compiler properties.
The brain decides what to compute. The hammer forges the code to make it run. With better compilers, agents become more productive, not less.