EngineeringJune 1, 2026Updated: June 1, 20265 min read

Forge Takes an 8B Local Model from 53% to 99% on Agentic Tasks

Forge is an open-source Python framework that adds reliability guardrails to self-hosted LLM tool-calling — taking a local 8B model from 53% to 99% accuracy on multi-step agentic tasks without changing the model itself.

L

Lugon

Vibe Engineer

Share article
Forge Takes an 8B Local Model from 53% to 99% on Agentic Tasks

The Compounding Reliability Problem

When you chain LLM calls into a multi-step agentic workflow, every step multiplies your failure rate. Ninety percent per-step accuracy sounds decent — but stack five steps together and you've got a 40% failure rate before lunch. No existing framework seemed to address this mechanical reliability issue. They were all tailor-made for cloud frontier APIs.

Antoine Zambelli, AI Director at Texas Instruments, found this out the hard way building home automation agents on a budget GPU. So he built Forge — an open-source reliability layer that adds domain-and-tool-agnostic guardrails to self-hosted LLM tool-calling. The result: an 8B model jumps from ~53% to ~99% on multi-step agentic tasks without touching the model weights.

What Forge Actually Does

Forge is a Python framework that wraps local models running on Ollama, Llamafile, or any OpenAI-compatible endpoint. It adds five independently toggleable guardrail layers:

  • Retry nudges — when a step fails, the model gets a structured nudge to retry with a corrected prompt. The biggest impact, accounting for 24–49 point swings in ablation studies.
  • Error recovery — handles the distinction between "tool ran and returned data" vs. "tool ran and found nothing." Most systems treat both as success. Forge introduces a ToolResolutionError exception class so the model can retry instead of silently passing garbage downstream.
  • Step enforcement — keeps models with weaker sequencing discipline on track. Less critical for frontier models, significant for local ones.
  • Rescue parsing — handles malformed function-calling outputs on tricky local backends.
  • Context compaction — queries nvidia-smi at startup and derives a VRAM-safe token budget. Both Ollama and Llamafile silently fall back to CPU when VRAM runs out — no warning, just 10–100x slower inference. Forge prevents this from happening.
  • The Numbers

    Forged with a peer-reviewed eval harness across 97 model/backend configurations, 18 scenarios, 50 runs each. Published results:

    • Ministral 8B + Forge: 99.3% vs. Ministral 8B alone: ~53%
    • Claude Sonnet + Forge: 100%
    • Ministral 8B + Forge (99.3%) > Claude Sonnet alone (87.2%) — a free local 8B with the right framework beats the best result from a frontier API without guardrails
    • Every model tested scored 0% on error recovery without Forge — not a capability gap, an architectural absence
    The serving backend also matters far more than expected. The same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. A 75-point swing from infrastructure alone — a number nobody had published because standard benchmarks don't control for serving backend.

    Why This Matters for Builders

    If you're running agentic workflows — coding assistants, home automation, data pipelines, internal tools — and you're paying for frontier API calls, Forge's proxy server mode lets you drop in a local model without rewriting anything. Point any OpenAI-compatible client at Forge and it handles guardrails transparently.

    The ACM CAIS 2026 demo in San Jose (May 26–29) covers the full peer-reviewed methodology and the interactive eval dashboard so anyone can reproduce the numbers.

    Getting Started

    pip install forge-ai

    Or clone the repo and run the eval harness on your model. Results get shared on the community dashboard.

    forgellmagenticopen-sourcepythonlocal-aitool-callingguardrails
    Share article
    Start Your Project

    Ready to transform?

    Discover how TeguFy can help your business simplify, amplify, and fortify with AI, Blockchain, and cutting-edge technology.

    Forge Takes an 8B Local Model from 53% to 99% on Agentic Tasks