The Compounding Reliability Problem
When you chain LLM calls into a multi-step agentic workflow, every step multiplies your failure rate. Ninety percent per-step accuracy sounds decent — but stack five steps together and you've got a 40% failure rate before lunch. No existing framework seemed to address this mechanical reliability issue. They were all tailor-made for cloud frontier APIs.
Antoine Zambelli, AI Director at Texas Instruments, found this out the hard way building home automation agents on a budget GPU. So he built Forge — an open-source reliability layer that adds domain-and-tool-agnostic guardrails to self-hosted LLM tool-calling. The result: an 8B model jumps from ~53% to ~99% on multi-step agentic tasks without touching the model weights.
What Forge Actually Does
Forge is a Python framework that wraps local models running on Ollama, Llamafile, or any OpenAI-compatible endpoint. It adds five independently toggleable guardrail layers:
ToolResolutionError exception class so the model can retry instead of silently passing garbage downstream.nvidia-smi at startup and derives a VRAM-safe token budget. Both Ollama and Llamafile silently fall back to CPU when VRAM runs out — no warning, just 10–100x slower inference. Forge prevents this from happening.The Numbers
Forged with a peer-reviewed eval harness across 97 model/backend configurations, 18 scenarios, 50 runs each. Published results:
- Ministral 8B + Forge: 99.3% vs. Ministral 8B alone: ~53%
- Claude Sonnet + Forge: 100%
- Ministral 8B + Forge (99.3%) > Claude Sonnet alone (87.2%) — a free local 8B with the right framework beats the best result from a frontier API without guardrails
- Every model tested scored 0% on error recovery without Forge — not a capability gap, an architectural absence
Why This Matters for Builders
If you're running agentic workflows — coding assistants, home automation, data pipelines, internal tools — and you're paying for frontier API calls, Forge's proxy server mode lets you drop in a local model without rewriting anything. Point any OpenAI-compatible client at Forge and it handles guardrails transparently.
The ACM CAIS 2026 demo in San Jose (May 26–29) covers the full peer-reviewed methodology and the interactive eval dashboard so anyone can reproduce the numbers.
Getting Started
pip install forge-ai
Or clone the repo and run the eval harness on your model. Results get shared on the community dashboard.
- Repo: antoinezambelli/forge (1,948 stars)
- Paper: forge_ieee_preprint.pdf
- Demo video: YouTube