Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability

We’ve built a powerful Forensic Team. They can find books, analyze metadata, and spot discrepancies using MCP.

But in the enterprise, ‘it seems to work’ isn’t a metric. If an agent misidentifies a $50,000 first edition, the liability is real.

Today, we move from Subjective Trust to Quantitative Reliability. We are building The Judge—a high-reasoning evaluator that audits our Forensic Team against a ‘Golden Dataset’ of ground-truth facts.

Before you Begin

Prerequisites: You should have an existing agentic workflow (see my MCP Forensic Series) and a high-reasoning model (Claude 3.5 Opus/GPT-4o) to act as the Judge.

1. The “Golden Dataset”

Before we can grade the agents, we need an Answer Key. We’re creating tests/golden_dataset.json. This file contains the “Ground Truth”—scenarios where we know there are errors.

Example Entry:

{
"test_id": "TC-001",
"input": "The Great Gatsby, 1925",
"expected_finding": "Page count mismatch: Observed 218, Standard 210",
"severity": "high"
}

Director’s Note: In an enterprise setting, “Reliability” is the precursor to “Permission”. You will not get the budget to scale agents until you can prove they won’t hallucinate $50k errors. This framework provides the data you need for that internal sell.

2. The Judge’s Rubric

A good Judge needs a rubric. We aren’t just looking for “Yes/No.” We want to grade on:

  • Precision: Did it find only the real errors?
  • Recall: Did it find all the real errors?
  • Reasoning: Did it explain why it flagged the record?

3. Refactoring for Resilience

Before building the Judge, we had to address a common “Senior-level” trap: hardcoding agent logic. Based on architectural reviews, we moved our system prompts from the Python client into a dedicated config/prompts.yaml.

This isn’t just about clean code; it’s about Observability. By decoupling the “Instructions” from the “Execution,” we can now A/B test different prompt versions against the Judge to see which one yields the highest accuracy for specific models.

4. The Implementation: The Evaluation Loop

We’ve added evaluator.py to the repo. It doesn’t just run the agents; it monitors their “vital signs.”

  • Error Transparency: We replaced “swallowed” exceptions with structured logging. If a provider fails, the system logs the incident for diagnosis instead of failing silently.
  • The Handshake: The loop runs the Forensic Team, collects their logs, and submits the whole package to a high-reasoning Judge Agent.

The Evaluator-Optimizer Blueprint

This diagram represents our move from “Does the code run?” to Does the intelligence meet the quality bar?” This closed-loop system is required before we can start the fiscal optimization of choosing smaller models to handle simpler tasks.

Architectural diagram of an AI Evaluator-Optimizer loop. It shows a Golden Dataset feeding into an Agent Execution layer, which then passes outputs and logs to a Judge Agent for scoring against a rubric. The final Reliability Report provides a feedback loop for prompt tuning and iterative improvement.
The Evaluator-Optimizer Loop-Moving from manual vibe-checks to automated, quantitative reliability scoring.

Director-Level Insight: The “Accuracy vs. Cost” Curve

As a Director, I don’t just care about “cost per token.” I care about Defensibility. If a forensic audit is challenged, I need to show a historical accuracy rating. By implementing this Evaluator, we move from “Vibe-checking” to a Quantitative Reliability Score. This allows us to set a “Minimum Quality Bar” for deployment. If a model update or a prompt change drops our accuracy by 2%, the Judge blocks the deployment.

The Production-Grade AI Series

  • Post 1: The Judge Agent — You are here
  • Post 2: The Accountant (Cognitive Budgeting & Model Routing) — Coming Soon
  • Post 3: The Guardian (Human-in-the-Loop Handshakes) — Coming Soon

Looking for the foundation? Check out my previous series: The Zero-Glue AI Mesh with MCP.

Facebooktwitterredditlinkedinmail

The Backyard Quarry, Part 5: Digital Twins for Physical Objects

At this point in the Backyard Quarry project, something subtle has happened.

We started with a pile of rocks.

We now have:

  • a schema
  • a capture process
  • stored images
  • searchable metadata
  • classification
  • lifecycle states

Each rock has a record.

Each record represents something in the physical world.

And that leads to a useful observation.

We’re no longer just cataloging rocks.

We’re building digital representations of them.

What Is a Digital Twin?

In simple terms, a digital twin is:

A structured digital representation of a physical object.

That representation can include:

  • identity
  • properties
  • visual data
  • state
  • history

In the context of the Quarry, a rock’s digital twin might look like:

rock_id: QRY-042
weight_lb: 12.3
dimensions_cm: 18 x 10 x 7
color: gray
rock_type: granite
status: for_sale
images: [rock_042_1.jpg, rock_042_2.jpg]
model: rock_042.obj

It’s not the rock itself.

But it’s a useful abstraction of it.

More Than Just Metadata

At first glance, a digital twin might look like a simple database record.

But there’s an important difference.

A well-designed digital twin combines multiple types of data:

  • structured metadata (easy to query)
  • unstructured assets (images, models)
  • derived attributes (classification, embeddings)
  • state over time

It’s not just describing the object.

It’s enabling interaction with it through software.

The Time Dimension

One of the most important aspects of a digital twin is that it can change over time.

Even a rock — which is about as static as objects get — has a lifecycle in the system:

collected → cataloged → listed_for_sale → sold

Each transition adds context.

Now we’re not just storing a snapshot.

We’re tracking a history.

This becomes much more important in other domains.

Where This Shows Up

The interesting part is that this pattern isn’t unique to rocks.

It appears in many different systems.

Manufacturing

  • digital twins of machine parts
  • tracking condition and usage
  • linking physical components to system data

Museums and Archives

  • artifacts with metadata, images, provenance
  • digitized collections
  • searchable historical records

Agriculture

  • crops tracked over time
  • environmental data
  • growth and yield metrics

Healthcare and Motion

  • human movement captured as data
  • gait analysis
  • rehabilitation tracking

This last one starts to look a lot like something else entirely.

From Objects to Systems

What the Backyard Quarry demonstrates, in a small way, is that once you:

  • represent objects as data
  • capture their properties
  • store and index them

you’ve created the foundation for a larger system.

The digital twin becomes a building block.

And systems are built from collections of these building blocks.

The Abstraction Layer

A useful way to think about digital twins is as an abstraction layer.

They sit between:

Diagram showing how physical objects are captured and represented as digital twins with metadata, assets, and application layers.
Digital twins act as a bridge between physical objects and software systems.

Applications don’t interact with rocks directly.

They interact with the representation of rocks.

That layer enables:

  • search
  • analytics
  • visualization
  • automation

Without it, everything remains manual and unstructured.

The Limits of the Model

Of course, digital twins are not perfect representations.

They are approximations.

Some properties are easy to capture.

Others are difficult or impossible.

Even in the Quarry:

  • weight is approximate
  • dimensions are imprecise
  • visual data depends on lighting
  • 3D models may be incomplete

The goal isn’t perfect fidelity.

It’s usefulness.

The Real Insight

At this point, the Backyard Quarry starts to feel less like a joke and more like a small version of a much larger idea.

Many modern systems are built around digital twins.

Not because the concept is new.

But because we now have the tools to make it practical:

  • cheap sensors
  • high-resolution cameras
  • scalable storage
  • machine learning

The pattern has existed for a long time.

The difference is that we can now implement it at scale.

What Comes Next

So far, the Quarry system works at a small scale.

A handful of rocks.

A manageable dataset.

But what happens when the number of objects grows?

When the dataset becomes:

  • hundreds
  • thousands
  • or millions

The next post explores that question.

Because designing a system for a small dataset is one thing.

Designing a system that scales is something else entirely.

And somewhere along the way, it becomes clear that a pile of rocks is enough to illustrate ideas that show up across entire industries.

Yet another surprise in this Backyard Quarry journey.

The Rock Quarry Series

Facebooktwitterredditlinkedinmail