Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability

We’ve built a powerful Forensic Team. They can find books, analyze metadata, and spot discrepancies using MCP.

But in the enterprise, ‘it seems to work’ isn’t a metric. If an agent misidentifies a $50,000 first edition, the liability is real.

Today, we move from Subjective Trust to Quantitative Reliability. We are building The Judge—a high-reasoning evaluator that audits our Forensic Team against a ‘Golden Dataset’ of ground-truth facts.

Before you Begin

Prerequisites: You should have an existing agentic workflow (see my MCP Forensic Series) and a high-reasoning model (Claude 3.5 Opus/GPT-4o) to act as the Judge.

1. The “Golden Dataset”

Before we can grade the agents, we need an Answer Key. We’re creating tests/golden_dataset.json. This file contains the “Ground Truth”—scenarios where we know there are errors.

Example Entry:

{
"test_id": "TC-001",
"input": "The Great Gatsby, 1925",
"expected_finding": "Page count mismatch: Observed 218, Standard 210",
"severity": "high"
}

Director’s Note: In an enterprise setting, “Reliability” is the precursor to “Permission”. You will not get the budget to scale agents until you can prove they won’t hallucinate $50k errors. This framework provides the data you need for that internal sell.

2. The Judge’s Rubric

A good Judge needs a rubric. We aren’t just looking for “Yes/No.” We want to grade on:

  • Precision: Did it find only the real errors?
  • Recall: Did it find all the real errors?
  • Reasoning: Did it explain why it flagged the record?

3. Refactoring for Resilience

Before building the Judge, we had to address a common “Senior-level” trap: hardcoding agent logic. Based on architectural reviews, we moved our system prompts from the Python client into a dedicated config/prompts.yaml.

This isn’t just about clean code; it’s about Observability. By decoupling the “Instructions” from the “Execution,” we can now A/B test different prompt versions against the Judge to see which one yields the highest accuracy for specific models.

4. The Implementation: The Evaluation Loop

We’ve added evaluator.py to the repo. It doesn’t just run the agents; it monitors their “vital signs.”

  • Error Transparency: We replaced “swallowed” exceptions with structured logging. If a provider fails, the system logs the incident for diagnosis instead of failing silently.
  • The Handshake: The loop runs the Forensic Team, collects their logs, and submits the whole package to a high-reasoning Judge Agent.

The Evaluator-Optimizer Blueprint

This diagram represents our move from “Does the code run?” to Does the intelligence meet the quality bar?” This closed-loop system is required before we can start the fiscal optimization of choosing smaller models to handle simpler tasks.

Architectural diagram of an AI Evaluator-Optimizer loop. It shows a Golden Dataset feeding into an Agent Execution layer, which then passes outputs and logs to a Judge Agent for scoring against a rubric. The final Reliability Report provides a feedback loop for prompt tuning and iterative improvement.
The Evaluator-Optimizer Loop-Moving from manual vibe-checks to automated, quantitative reliability scoring.

Director-Level Insight: The “Accuracy vs. Cost” Curve

As a Director, I don’t just care about “cost per token.” I care about Defensibility. If a forensic audit is challenged, I need to show a historical accuracy rating. By implementing this Evaluator, we move from “Vibe-checking” to a Quantitative Reliability Score. This allows us to set a “Minimum Quality Bar” for deployment. If a model update or a prompt change drops our accuracy by 2%, the Judge blocks the deployment.

The Production-Grade AI Series

  • Post 1: The Judge Agent — You are here
  • Post 2: The Accountant (Cognitive Budgeting & Model Routing) — Coming Soon
  • Post 3: The Guardian (Human-in-the-Loop Handshakes) — Coming Soon

Looking for the foundation? Check out my previous series: The Zero-Glue AI Mesh with MCP.

Facebooktwitterredditlinkedinmail

Building Your First MCP Server: TypeScript vs. Python

The 5-Minute “Hello World” Comparison

We’ve spent the last month talking about the End of Glue Code and the Enterprise AI Mesh. But if you’re a developer, you don’t just want to see the blueprint—you want to hold the tools.

Whether you are a TypeScript veteran or a Python enthusiast, building an MCP server is surprisingly simple. Today, we’re going to build the same “Hello World” tool in both languages to show you exactly how the protocol abstracts away the complexity.

1. The TypeScript Approach (Node.js)

TypeScript is the “native” language of the Model Context Protocol, and the @modelcontextprotocol/sdk is exceptionally robust for high-performance enterprise tools.

Prerequisites:

npm install @modelcontextprotocol/sdk zod

The Code:

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";

const server = new Server({
  name: "hello-world-server",
  version: "1.0.0",
}, {
  capabilities: { tools: {} }
});

// Define a simple greeting tool
server.tool(
  "greet_user",
  { name: z.string().describe("The name of the person to greet") },
  async ({ name }) => {
    return {
      content: [{ type: "text", text: `Hello, ${name}! Welcome to the MCP Mesh.` }]
    };
  }
);

async function main() {
  const transport = new StdioServerTransport();
  await server.connect(transport);
}

main().catch(console.error);

2. The Python Approach

For data scientists and AI engineers, the Python SDK offers a beautifully decorative approach. It feels more “agent-native” and integrates seamlessly with existing AI libraries.

Prerequisites:

pip install mcp

The Code:

import asyncio
from mcp.server.fastmcp import FastMCP

# Initialize FastMCP - the "Quick Start" wrapper
mcp = FastMCP("HelloWorld")

@mcp.tool()
async def greet_user(name: str) -> str:
    """Greets a user by name."""
    return f"Hello, {name}! Welcome to the MCP Mesh."

if __name__ == "__main__":
    mcp.run(transport='stdio')

Side-by-Side: Which Should You Choose?

Feature TypeScript (Standard SDK) Python (FastMCP)
Best For High-performance, Type-safe tools Rapid prototyping, AI logic
Validation Zod (Explicit & Strict) Pydantic / Type Hints (Implicit)
Verbosity Moderate (Structured) Minimal (Decorator-based)
Transport STDIO, SSE, Custom STDIO, SSE

How to Test Your Server

Once you’ve saved your code, you don’t need a complex frontend to test it. Use the MCP Inspector:

# For TypeScript
npx @modelcontextprotocol/inspector node build/index.js

# For Python
npx @modelcontextprotocol/inspector python your_script.py

This will launch a local web interface where you can perform the “Protocol Handshake” and trigger your tools manually. It’s the best way to verify your “Zero-Glue” infrastructure before connecting it to an agent.

Conclusion

The “Zero-Glue” architecture isn’t about which language you use—it’s about the Protocol. As you can see, the logic for the “Hello World” tool is nearly identical in both versions. The Model Context Protocol ensures that no matter how you build your tools, your agents can discover and use them in a standardized way.

Ready to build your own?

Check out the reference repo for more complex examples, including Notion and Oracle 26ai integrations.

MCP Forensic Analyzer Repository

The “Zero-Glue” Series

What’s Next?

The Mesh is built.
The agents are ready.
But can you trust them?

In my next series, we explore the ‘Science of Reliability’—building the evaluators that turn AI experiments into production-grade systems.

Facebooktwitterredditlinkedinmail