The Backyard Quarry, Part 6: Scaling the Quarry

So far, the Backyard Quarry system has worked well.

We have:

  • a schema
  • a capture process
  • stored assets
  • searchable data
  • digital twins

For a small dataset, everything feels manageable.

A few rocks here and there.

A handful of records.

It’s easy to reason about the system.

When the Dataset Grows

The moment the dataset starts to grow, the assumptions change.

Instead of a few rocks, imagine:

  • hundreds
  • thousands
  • eventually, many thousands

At that point, a few new questions appear:

  • How do we process incoming data efficiently?
  • Where do we store large assets?
  • How do we keep queries fast?
  • What happens when processing takes longer than capture?

These are the same questions that show up in any system dealing with real-world data.

The Pipeline Becomes the System

At small scale, the pipeline is implicit.

You take a photo.

You upload it.

You update a record.

At larger scale, that approach breaks down.

The pipeline becomes explicit.

Diagram showing a scalable data pipeline for physical objects including capture, ingestion queue, processing workings, storage, and indexing.
At scale, simple data flows evolve into multi-stage pipelines with decoupled processing and storage.

Each stage now has a role:

  • capture generates raw input
  • ingestion buffers incoming data
  • processing transforms it
  • storage persists it
  • indexing makes it usable

What used to be a simple flow becomes a system of components.

Decoupling the System

One of the first things that happens at scale is decoupling.

Instead of doing everything at once, we separate concerns:

  • capture does not block processing
  • processing does not block storage
  • storage does not block indexing

This introduces queues and asynchronous work.

Instead of:

take photo → process → store → done

we now have:

take photo → enqueue → process later → update system

This improves resilience.

It also introduces complexity.

Storage Starts to Matter

At small scale, storage decisions are easy.

At larger scale, they matter.

We now have different types of data:

  • metadata (small, structured)
  • images (large, unstructured)
  • 3D models (larger, computationally expensive to generate)

These tend to be stored differently:

  • database for structured data
  • object storage for assets
  • references connecting the two

This separation becomes critical for performance and cost.

Processing Becomes a Bottleneck

Not all steps in the pipeline are equal.

Some are fast:

  • inserting metadata
  • updating records

Others are slow:

  • generating 3D models
  • running image processing
  • extracting features

As the dataset grows, these slower steps become bottlenecks.

Which leads to another pattern:

Parallelization.

Instead of one process handling everything, we distribute the work.

Multiple workers.

Multiple jobs.

Multiple stages running simultaneously.

Indexing at Scale

Search also changes at scale.

At small scale:

  • simple queries are fast
  • no special indexing required

At larger scale:

  • indexes must be built and maintained
  • similarity search requires preprocessing
  • updates must propagate through the system

Search becomes an active part of the pipeline, not just a query on top of it.

Failure Becomes Normal

At small scale, failures are rare and easy to fix.

At larger scale, failures are expected.

Examples:

  • missing images
  • failed processing jobs
  • incomplete models
  • inconsistent metadata

The system must tolerate these failures.

Not eliminate them.

This leads to:

  • retries
  • partial results
  • eventual consistency

In other words, the system becomes more realistic.

A Familiar Architecture

At this point, the Backyard Quarry starts to resemble a typical data platform.

Layered architecture diagram showing physical world input flowing through capture, ingestion, processing, storage, indexing, and application layers.
A common architectural pattern for systems that transform physical inputs into digital data.

Different domains implement this differently.

But the structure is remarkably consistent.

The Tradeoff

Scaling introduces tradeoffs.

We gain:

  • throughput
  • flexibility
  • resilience

We lose:

  • simplicity
  • immediacy
  • ease of reasoning

What was once a straightforward system becomes a collection of interacting parts.

The Real Shift

The most important change isn’t technical.

It’s conceptual.

At small scale, you think about individual objects.

At larger scale, you think about systems.

You stop asking:

How do I store this rock?

And start asking:

How does the system handle many rocks over time?

That shift is what turns a project into a platform.

What Comes Next

At this point, the Backyard Quarry is no longer just a small experiment.

It’s a miniature version of a data platform.

And the patterns we’ve seen — schema design, pipelines, indexing, scaling — show up in many places.

In the next post, we’ll zoom out even further.

Because once you start recognizing these patterns, you begin to see them everywhere.

Not just in rock piles.

But in systems across industries.

And somewhere along the way, the Quarry stopped being about rocks.

It became about how systems grow.

The Rock Quarry Series

Facebooktwitterredditlinkedinmail

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability

We’ve built a powerful Forensic Team. They can find books, analyze metadata, and spot discrepancies using MCP.

But in the enterprise, ‘it seems to work’ isn’t a metric. If an agent misidentifies a $50,000 first edition, the liability is real.

Today, we move from Subjective Trust to Quantitative Reliability. We are building The Judge—a high-reasoning evaluator that audits our Forensic Team against a ‘Golden Dataset’ of ground-truth facts.

Before you Begin

Prerequisites: You should have an existing agentic workflow (see my MCP Forensic Series) and a high-reasoning model (Claude 3.5 Opus/GPT-4o) to act as the Judge.

1. The “Golden Dataset”

Before we can grade the agents, we need an Answer Key. We’re creating tests/golden_dataset.json. This file contains the “Ground Truth”—scenarios where we know there are errors.

Example Entry:

{
"test_id": "TC-001",
"input": "The Great Gatsby, 1925",
"expected_finding": "Page count mismatch: Observed 218, Standard 210",
"severity": "high"
}

Director’s Note: In an enterprise setting, “Reliability” is the precursor to “Permission”. You will not get the budget to scale agents until you can prove they won’t hallucinate $50k errors. This framework provides the data you need for that internal sell.

2. The Judge’s Rubric

A good Judge needs a rubric. We aren’t just looking for “Yes/No.” We want to grade on:

  • Precision: Did it find only the real errors?
  • Recall: Did it find all the real errors?
  • Reasoning: Did it explain why it flagged the record?

3. Refactoring for Resilience

Before building the Judge, we had to address a common “Senior-level” trap: hardcoding agent logic. Based on architectural reviews, we moved our system prompts from the Python client into a dedicated config/prompts.yaml.

This isn’t just about clean code; it’s about Observability. By decoupling the “Instructions” from the “Execution,” we can now A/B test different prompt versions against the Judge to see which one yields the highest accuracy for specific models.

4. The Implementation: The Evaluation Loop

We’ve added evaluator.py to the repo. It doesn’t just run the agents; it monitors their “vital signs.”

  • Error Transparency: We replaced “swallowed” exceptions with structured logging. If a provider fails, the system logs the incident for diagnosis instead of failing silently.
  • The Handshake: The loop runs the Forensic Team, collects their logs, and submits the whole package to a high-reasoning Judge Agent.

The Evaluator-Optimizer Blueprint

This diagram represents our move from “Does the code run?” to Does the intelligence meet the quality bar?” This closed-loop system is required before we can start the fiscal optimization of choosing smaller models to handle simpler tasks.

Architectural diagram of an AI Evaluator-Optimizer loop. It shows a Golden Dataset feeding into an Agent Execution layer, which then passes outputs and logs to a Judge Agent for scoring against a rubric. The final Reliability Report provides a feedback loop for prompt tuning and iterative improvement.
The Evaluator-Optimizer Loop-Moving from manual vibe-checks to automated, quantitative reliability scoring.

Director-Level Insight: The “Accuracy vs. Cost” Curve

As a Director, I don’t just care about “cost per token.” I care about Defensibility. If a forensic audit is challenged, I need to show a historical accuracy rating. By implementing this Evaluator, we move from “Vibe-checking” to a Quantitative Reliability Score. This allows us to set a “Minimum Quality Bar” for deployment. If a model update or a prompt change drops our accuracy by 2%, the Judge blocks the deployment.

The Production-Grade AI Series

  • Post 1: The Judge Agent — You are here
  • Post 2: The Accountant (Cognitive Budgeting & Model Routing) — Coming Soon
  • Post 3: The Guardian (Human-in-the-Loop Handshakes) — Coming Soon

Looking for the foundation? Check out my previous series: The Zero-Glue AI Mesh with MCP.

Facebooktwitterredditlinkedinmail