The Accountant: Optimizing AI Costs with Semantic Routing

We’ve solved the Reliability problem with The Judge. We have a system that can scientifically prove whether our Forensic Team is accurate. But there’s a new problem that keeps Directors and CFOs up at night: Sustainability.

In an enterprise environment, using a massive, high-reasoning model (like Claude 3.5 or GPT-4o) for every single bibliography lookup is a “Cognitive Budget” disaster. It’s like hiring a Senior Architect to fix a broken link.

Today, we introduce The Accountant: A Semantic Router that classifies task complexity and routes requests to the cheapest model capable of passing the Judge’s rubric.

1. The Concept of “Tiered Intelligence”

Not all forensic tasks require the same level of “gray matter.” To scale effectively, we must categorize our workload:

  • LEVEL 1 (Operational): “Find the standard page count for the 1925 edition of Gatsby.” This is a lookup and retrieval task. Local SLMs (Small Language Models) like Phi-4 or Llama 3.2 excel here.
  • LEVEL 2 (Forensic): “Compare the binding grain and typography inconsistencies between two suspected forgeries.” This requires high-dimensional analysis and deep reasoning. This is a job for the Cloud.
Architectural diagram of a Semantic Router called The Accountant. A user request enters the router, which classifies it into Level 1 (Simple/Metadata) or Level 2 (Complex Forensic). Level 1 is routed to a local Tier 1 SLM like Phi-4 or Llama 3.2, while Level 2 is routed to a Tier 2 Frontier Cloud model like Claude 3.5. Both paths converge to produce a final Audit Report.
The Semantic Router Architecture—Implementing Tiered Intelligence to optimize cognitive budget and reduce inference costs.

2. Implementing the Router (The Gatekeeper Pattern)

We’ve added router.py to our repository. The logic acts as a gatekeeper.
1. Classification: A lightweight model (the Accountant) reviews the user’s query against our config/prompts.yaml.
2. Economic Decision: If the query is “Level 1”, we trigger the ollama provider. If it’s “Level 2,” we escalate to the anthropic provider.

# The Accountant's Decision Engine in router.py
level = await classify_query(query)
provider = get_provider_for_level(level)

if level == "LEVEL_1":
    print("Accountant Decision: LEVEL_1 - Routing to Local SLM to save budget")
else:
    print("Accountant Decision: LEVEL_2 - Routing to High-Reasoning Cloud Model")

By defaulting to LEVEL_2 if classification fails, we ensure that we never sacrifice accuracy for cost – we only save money when we are certain the tasks are simple.

3. Projecting the ROI with The Judge

While we built the Accountant (the router), we haven’t yet run a full-scale economic audit in this repository. However, the architecture is designed to scientifically measure this trade-off using the Judge Agent (from our last post).

In an enterprise environment, a Director would use this framework to benchmark a representative sample of historical queries. A typical analysis for tiered intelligence systems shows that the vast majority of “forensic” requests are actually simple metadata lookups. By routing those to a local SLM (Phi-4 or Llama 3.2), we can achieve comparable reliability scores to a frontier cloud model while zeroing out the marginal cost of those specific tokens.

The Theoretical Savings (100k Calls/Month):

  • Current Cost (Frontier Cloud for 100% of tasks): ~$7,600/month
  • Projected Cost (90/10 Routed Split): ~$1,800/month
  • Total Savings: ~76% reduction in inference costs.
Task Category Estimated Volume “Status Quo” Cost (Frontier Cloud) “Routed” Cost (Accountant/SLM)
Level 1 (Standard Lookup/Formatting) 90% (90k calls) ~$4,500 ~$0 (Local/Self-Hosted)
Level 2 (Deep Forensic Analysis) 10% (10k calls) ~$3,100 ~$1,800*
Total Cognitive Budget 100% ~$7,600 ~$1,800

* Note: Level 2 “Routed” costs are lower here because the Accountant ensures only the most complex 10% of tokens hit the high-cost provider, whereas the “Status Quo” assumes a higher average cost across all 100k calls due to the lack of optimization.

Cognitive Budgeting Insights

As a Director, the responsibility is to build Sustainable Intelligence. If 80% of an AI workload can be moved to local infrastructure or cheaper “Flash” models without dropping our reliability score, I’m not just a developer—I’m a profit center. Semantic routing allows us to scale AI horizontally without the cloud bill scaling vertically.

🛠️ Step into the Clean-Room

The Accountant logic is now live in the repository. You can test the routing logic yourself by running the local orchestrator with the --use-accountant flag.

Explore the Code: MCP Forensic Analyzer on GitHub

(If this architecture helps your team justify their AI spend, consider dropping a ⭐ on the repo!)

The Production-Grade AI Series

  • Post 1: The Judge Agent: Who Audits the Auditors? (Reliability)
  • Post 2: The Accountant: Optimizing AI Costs with Semantic Routing (Sustainability) – You’re Here
  • Post 3: The Guardian: Human-in-the-Loop Governance (Safety) – Coming Soon

Looking for the foundation? Check out my previous series: The Zero-Glue AI Mesh with MCP.

Facebooktwitterredditlinkedinmail

From Cloud to Laptop: Running MCP Agents with Small Language Models

Large Models Build Systems. Small Models Run Them.

For most developers, modern AI systems feel locked behind massive infrastructure.

We’ve been conditioned to believe that “Intelligence” is a service we rent from a data center—a luxury that requires GPU clusters, $10,000 hardware, and ever-climbing cloud inference bills.

Last week, when we built our Multi-Agent Forensic Team, you likely assumed that coordinating a Supervisor, a Librarian, and an Analyst required the reasoning horsepower of a 400B+ parameter model.

Today, we’re cutting the cord. We are moving the entire Forensic Team—the agents, the orchestration, and the data—onto a standard laptop. No cloud. No API costs. No data leaving your local network.

This is the power of Edge AI combined with the Model Context Protocol (MCP).

The Pivot: The “Forensic Clean-Room”

In the world of rare book forensics, data sovereignty isn’t a “nice-to-have.” When you are auditing high-value archival records or sensitive provenance data, the “Clean-Room” approach is the gold standard. You want the data isolated.

By moving our stack to the Edge, we transform a laptop into a portable forensic lab.

The Edge Architecture

Architecture diagram showing an MCP-based multi-agent system running locally with small language models where a supervisor and specialist agents interact with an MCP server and local archive database on a laptop.
Running MCP agents locally: small language models power the supervisor and specialist agents while the MCP server provides structured tool access to local data.

Notice that the architecture we built in Post 2 doesn’t change. Because we used MCP as our “USB-C” interface, we don’t have to rewrite our tools or our agents. We only swap the Inference Engine.

Why SLMs Love MCP

Small language models struggle when tasks are open-ended.

However, MCP dramatically reduces the search space.

Instead of inventing answers, the model interacts with structured primitives:

  • tools
  • resources
  • prompts

Each defined with strict schemas.

The Thesis: Large models are great for designing the system and writing the initial code. Small models are the perfect runtime engines for executing those standardized tasks.

The “How-To”: Swapping the Engine

In our updated orchestrator.py, we’ve introduced a provider flag. Instead of hitting a remote API, the Python supervisor now talks to a local inference server (like Ollama or LM Studio).

# [Post 3 - Edge AI] Swapping the Inference Provider
if args.provider == "ollama":
# Pointing to the local SLM engine
client = OllamaClient(base_url="http://localhost:11434")
model = "phi4"
else:
# Standard Cloud Provider
client = AnthropicClient()
model = "claude-3-5-sonnet"

Because our TypeScript MCP Server is running locally via stdio, the latency is nearly zero. The “Librarian” fetches metadata from the local database, and the “Analyst” runs the audit—all without a single packet hitting the open web.

Benchmarking the Forensic Team: Cloud vs. Edge

Does a 14B model perform as well as a 400B model for forensics? When constrained by MCP schemas, the results are surprising.

Criteria Cloud (Claude/GPT-4) Edge (Phi-4/Mistral)
Reasoning Depth Extremely High High (with MCP Tool Constraints)
Latency 1.5s – 3s (Network Dependent) < 500ms (Local Inference)
Cost Per-token billing $0.00
Privacy Data processed externally 100% Data Sovereignty
Scalability Infinite Limited by local RAM/NPU

The Reveal: Same System, New Home

If you look at the latest update to the repository, you’ll see that the orchestration logic is nearly identical. The architecture stack from earlier posts remains unchanged.

Comparison diagram showing cloud-based AI architecture using large models and remote inference versus edge AI architecture using small language models and local MCP tool servers.
Edge AI architecture replaces cloud inference with local small language models while retaining MCP-based tool access.

Nothing about the agents changed.

Nothing about the tools changed.

Only the inference engine moved.

The “Zero-Glue” promise is realized here.

We didn’t build a cloud app; we built a protocol-driven system. The fact that it can live on a server or a laptop is simply a deployment choice.

What’s Next?

We’ve built the server. We’ve orchestrated the team. We’ve moved it to the edge.
In the final post of this series, we tackle the “Final Boss” of AI systems: Enterprise Governance. We’ll explore how to take this forensic lab and scale it across an organization using Oracle 26ai, ensuring that every audit is secure, permissioned, and defensible.

Ready to go local?

Check out the orchestrator.py update and try running the Forensic Team on your own machine.
👉 MCP Forensic Analyzer – Edge AI Example

The “Zero-Glue” Series

Facebooktwitterredditlinkedinmail