The Accountant: Optimizing AI Costs with Semantic Routing

We’ve solved the Reliability problem with The Judge. We have a system that can scientifically prove whether our Forensic Team is accurate. But there’s a new problem that keeps Directors and CFOs up at night: Sustainability.

In an enterprise environment, using a massive, high-reasoning model (like Claude 3.5 or GPT-4o) for every single bibliography lookup is a “Cognitive Budget” disaster. It’s like hiring a Senior Architect to fix a broken link.

Today, we introduce The Accountant: A Semantic Router that classifies task complexity and routes requests to the cheapest model capable of passing the Judge’s rubric.

1. The Concept of “Tiered Intelligence”

Not all forensic tasks require the same level of “gray matter.” To scale effectively, we must categorize our workload:

  • LEVEL 1 (Operational): “Find the standard page count for the 1925 edition of Gatsby.” This is a lookup and retrieval task. Local SLMs (Small Language Models) like Phi-4 or Llama 3.2 excel here.
  • LEVEL 2 (Forensic): “Compare the binding grain and typography inconsistencies between two suspected forgeries.” This requires high-dimensional analysis and deep reasoning. This is a job for the Cloud.
Architectural diagram of a Semantic Router called The Accountant. A user request enters the router, which classifies it into Level 1 (Simple/Metadata) or Level 2 (Complex Forensic). Level 1 is routed to a local Tier 1 SLM like Phi-4 or Llama 3.2, while Level 2 is routed to a Tier 2 Frontier Cloud model like Claude 3.5. Both paths converge to produce a final Audit Report.
The Semantic Router Architecture—Implementing Tiered Intelligence to optimize cognitive budget and reduce inference costs.

2. Implementing the Router (The Gatekeeper Pattern)

We’ve added router.py to our repository. The logic acts as a gatekeeper.
1. Classification: A lightweight model (the Accountant) reviews the user’s query against our config/prompts.yaml.
2. Economic Decision: If the query is “Level 1”, we trigger the ollama provider. If it’s “Level 2,” we escalate to the anthropic provider.

# The Accountant's Decision Engine in router.py
level = await classify_query(query)
provider = get_provider_for_level(level)

if level == "LEVEL_1":
    print("Accountant Decision: LEVEL_1 - Routing to Local SLM to save budget")
else:
    print("Accountant Decision: LEVEL_2 - Routing to High-Reasoning Cloud Model")

By defaulting to LEVEL_2 if classification fails, we ensure that we never sacrifice accuracy for cost – we only save money when we are certain the tasks are simple.

3. Projecting the ROI with The Judge

While we built the Accountant (the router), we haven’t yet run a full-scale economic audit in this repository. However, the architecture is designed to scientifically measure this trade-off using the Judge Agent (from our last post).

In an enterprise environment, a Director would use this framework to benchmark a representative sample of historical queries. A typical analysis for tiered intelligence systems shows that the vast majority of “forensic” requests are actually simple metadata lookups. By routing those to a local SLM (Phi-4 or Llama 3.2), we can achieve comparable reliability scores to a frontier cloud model while zeroing out the marginal cost of those specific tokens.

The Theoretical Savings (100k Calls/Month):

  • Current Cost (Frontier Cloud for 100% of tasks): ~$7,600/month
  • Projected Cost (90/10 Routed Split): ~$1,800/month
  • Total Savings: ~76% reduction in inference costs.
Task Category Estimated Volume “Status Quo” Cost (Frontier Cloud) “Routed” Cost (Accountant/SLM)
Level 1 (Standard Lookup/Formatting) 90% (90k calls) ~$4,500 ~$0 (Local/Self-Hosted)
Level 2 (Deep Forensic Analysis) 10% (10k calls) ~$3,100 ~$1,800*
Total Cognitive Budget 100% ~$7,600 ~$1,800

* Note: Level 2 “Routed” costs are lower here because the Accountant ensures only the most complex 10% of tokens hit the high-cost provider, whereas the “Status Quo” assumes a higher average cost across all 100k calls due to the lack of optimization.

Cognitive Budgeting Insights

As a Director, the responsibility is to build Sustainable Intelligence. If 80% of an AI workload can be moved to local infrastructure or cheaper “Flash” models without dropping our reliability score, I’m not just a developer—I’m a profit center. Semantic routing allows us to scale AI horizontally without the cloud bill scaling vertically.

🛠️ Step into the Clean-Room

The Accountant logic is now live in the repository. You can test the routing logic yourself by running the local orchestrator with the --use-accountant flag.

Explore the Code: MCP Forensic Analyzer on GitHub

(If this architecture helps your team justify their AI spend, consider dropping a ⭐ on the repo!)

The Production-Grade AI Series

  • Post 1: The Judge Agent: Who Audits the Auditors? (Reliability)
  • Post 2: The Accountant: Optimizing AI Costs with Semantic Routing (Sustainability) – You’re Here
  • Post 3: The Guardian: Human-in-the-Loop Governance (Safety) – Coming Soon

Looking for the foundation? Check out my previous series: The Zero-Glue AI Mesh with MCP.

Facebooktwitterredditlinkedinmail

The Backyard Quarry, Part 6: Scaling the Quarry

So far, the Backyard Quarry system has worked well.

We have:

  • a schema
  • a capture process
  • stored assets
  • searchable data
  • digital twins

For a small dataset, everything feels manageable.

A few rocks here and there.

A handful of records.

It’s easy to reason about the system.

When the Dataset Grows

The moment the dataset starts to grow, the assumptions change.

Instead of a few rocks, imagine:

  • hundreds
  • thousands
  • eventually, many thousands

At that point, a few new questions appear:

  • How do we process incoming data efficiently?
  • Where do we store large assets?
  • How do we keep queries fast?
  • What happens when processing takes longer than capture?

These are the same questions that show up in any system dealing with real-world data.

The Pipeline Becomes the System

At small scale, the pipeline is implicit.

You take a photo.

You upload it.

You update a record.

At larger scale, that approach breaks down.

The pipeline becomes explicit.

Diagram showing a scalable data pipeline for physical objects including capture, ingestion queue, processing workings, storage, and indexing.
At scale, simple data flows evolve into multi-stage pipelines with decoupled processing and storage.

Each stage now has a role:

  • capture generates raw input
  • ingestion buffers incoming data
  • processing transforms it
  • storage persists it
  • indexing makes it usable

What used to be a simple flow becomes a system of components.

Decoupling the System

One of the first things that happens at scale is decoupling.

Instead of doing everything at once, we separate concerns:

  • capture does not block processing
  • processing does not block storage
  • storage does not block indexing

This introduces queues and asynchronous work.

Instead of:

take photo → process → store → done

we now have:

take photo → enqueue → process later → update system

This improves resilience.

It also introduces complexity.

Storage Starts to Matter

At small scale, storage decisions are easy.

At larger scale, they matter.

We now have different types of data:

  • metadata (small, structured)
  • images (large, unstructured)
  • 3D models (larger, computationally expensive to generate)

These tend to be stored differently:

  • database for structured data
  • object storage for assets
  • references connecting the two

This separation becomes critical for performance and cost.

Processing Becomes a Bottleneck

Not all steps in the pipeline are equal.

Some are fast:

  • inserting metadata
  • updating records

Others are slow:

  • generating 3D models
  • running image processing
  • extracting features

As the dataset grows, these slower steps become bottlenecks.

Which leads to another pattern:

Parallelization.

Instead of one process handling everything, we distribute the work.

Multiple workers.

Multiple jobs.

Multiple stages running simultaneously.

Indexing at Scale

Search also changes at scale.

At small scale:

  • simple queries are fast
  • no special indexing required

At larger scale:

  • indexes must be built and maintained
  • similarity search requires preprocessing
  • updates must propagate through the system

Search becomes an active part of the pipeline, not just a query on top of it.

Failure Becomes Normal

At small scale, failures are rare and easy to fix.

At larger scale, failures are expected.

Examples:

  • missing images
  • failed processing jobs
  • incomplete models
  • inconsistent metadata

The system must tolerate these failures.

Not eliminate them.

This leads to:

  • retries
  • partial results
  • eventual consistency

In other words, the system becomes more realistic.

A Familiar Architecture

At this point, the Backyard Quarry starts to resemble a typical data platform.

Layered architecture diagram showing physical world input flowing through capture, ingestion, processing, storage, indexing, and application layers.
A common architectural pattern for systems that transform physical inputs into digital data.

Different domains implement this differently.

But the structure is remarkably consistent.

The Tradeoff

Scaling introduces tradeoffs.

We gain:

  • throughput
  • flexibility
  • resilience

We lose:

  • simplicity
  • immediacy
  • ease of reasoning

What was once a straightforward system becomes a collection of interacting parts.

The Real Shift

The most important change isn’t technical.

It’s conceptual.

At small scale, you think about individual objects.

At larger scale, you think about systems.

You stop asking:

How do I store this rock?

And start asking:

How does the system handle many rocks over time?

That shift is what turns a project into a platform.

What Comes Next

At this point, the Backyard Quarry is no longer just a small experiment.

It’s a miniature version of a data platform.

And the patterns we’ve seen — schema design, pipelines, indexing, scaling — show up in many places.

In the next post, we’ll zoom out even further.

Because once you start recognizing these patterns, you begin to see them everywhere.

Not just in rock piles.

But in systems across industries.

And somewhere along the way, the Quarry stopped being about rocks.

It became about how systems grow.

The Rock Quarry Series

Facebooktwitterredditlinkedinmail