The Backyard Quarry, Part 6: Scaling the Quarry

So far, the Backyard Quarry system has worked well.

We have:

  • a schema
  • a capture process
  • stored assets
  • searchable data
  • digital twins

For a small dataset, everything feels manageable.

A few rocks here and there.

A handful of records.

It’s easy to reason about the system.

When the Dataset Grows

The moment the dataset starts to grow, the assumptions change.

Instead of a few rocks, imagine:

  • hundreds
  • thousands
  • eventually, many thousands

At that point, a few new questions appear:

  • How do we process incoming data efficiently?
  • Where do we store large assets?
  • How do we keep queries fast?
  • What happens when processing takes longer than capture?

These are the same questions that show up in any system dealing with real-world data.

The Pipeline Becomes the System

At small scale, the pipeline is implicit.

You take a photo.

You upload it.

You update a record.

At larger scale, that approach breaks down.

The pipeline becomes explicit.

Diagram showing a scalable data pipeline for physical objects including capture, ingestion queue, processing workings, storage, and indexing.
At scale, simple data flows evolve into multi-stage pipelines with decoupled processing and storage.

Each stage now has a role:

  • capture generates raw input
  • ingestion buffers incoming data
  • processing transforms it
  • storage persists it
  • indexing makes it usable

What used to be a simple flow becomes a system of components.

Decoupling the System

One of the first things that happens at scale is decoupling.

Instead of doing everything at once, we separate concerns:

  • capture does not block processing
  • processing does not block storage
  • storage does not block indexing

This introduces queues and asynchronous work.

Instead of:

take photo → process → store → done

we now have:

take photo → enqueue → process later → update system

This improves resilience.

It also introduces complexity.

Storage Starts to Matter

At small scale, storage decisions are easy.

At larger scale, they matter.

We now have different types of data:

  • metadata (small, structured)
  • images (large, unstructured)
  • 3D models (larger, computationally expensive to generate)

These tend to be stored differently:

  • database for structured data
  • object storage for assets
  • references connecting the two

This separation becomes critical for performance and cost.

Processing Becomes a Bottleneck

Not all steps in the pipeline are equal.

Some are fast:

  • inserting metadata
  • updating records

Others are slow:

  • generating 3D models
  • running image processing
  • extracting features

As the dataset grows, these slower steps become bottlenecks.

Which leads to another pattern:

Parallelization.

Instead of one process handling everything, we distribute the work.

Multiple workers.

Multiple jobs.

Multiple stages running simultaneously.

Indexing at Scale

Search also changes at scale.

At small scale:

  • simple queries are fast
  • no special indexing required

At larger scale:

  • indexes must be built and maintained
  • similarity search requires preprocessing
  • updates must propagate through the system

Search becomes an active part of the pipeline, not just a query on top of it.

Failure Becomes Normal

At small scale, failures are rare and easy to fix.

At larger scale, failures are expected.

Examples:

  • missing images
  • failed processing jobs
  • incomplete models
  • inconsistent metadata

The system must tolerate these failures.

Not eliminate them.

This leads to:

  • retries
  • partial results
  • eventual consistency

In other words, the system becomes more realistic.

A Familiar Architecture

At this point, the Backyard Quarry starts to resemble a typical data platform.

Layered architecture diagram showing physical world input flowing through capture, ingestion, processing, storage, indexing, and application layers.
A common architectural pattern for systems that transform physical inputs into digital data.

Different domains implement this differently.

But the structure is remarkably consistent.

The Tradeoff

Scaling introduces tradeoffs.

We gain:

  • throughput
  • flexibility
  • resilience

We lose:

  • simplicity
  • immediacy
  • ease of reasoning

What was once a straightforward system becomes a collection of interacting parts.

The Real Shift

The most important change isn’t technical.

It’s conceptual.

At small scale, you think about individual objects.

At larger scale, you think about systems.

You stop asking:

How do I store this rock?

And start asking:

How does the system handle many rocks over time?

That shift is what turns a project into a platform.

What Comes Next

At this point, the Backyard Quarry is no longer just a small experiment.

It’s a miniature version of a data platform.

And the patterns we’ve seen — schema design, pipelines, indexing, scaling — show up in many places.

In the next post, we’ll zoom out even further.

Because once you start recognizing these patterns, you begin to see them everywhere.

Not just in rock piles.

But in systems across industries.

And somewhere along the way, the Quarry stopped being about rocks.

It became about how systems grow.

The Rock Quarry Series

Facebooktwitterredditlinkedinmail

The Backyard Quarry, Part 3: Capturing the Physical World

In the previous post, we designed a schema for representing rocks as structured data.

On paper, everything looked clean.

Each rock would have:

  • an identifier
  • dimensions
  • weight
  • metadata
  • possibly images or even a 3D model

The structure made sense.

The problem was getting the data.

From Schema to Reality

Designing a schema is straightforward.

You can sit down with a notebook or a whiteboard and define exactly what you want the system to store.

Capturing real-world data is a different problem entirely.

The moment you step outside, a few complications become obvious.

Lighting changes.

Objects aren’t uniform.

Measurements are approximate.

And perhaps most importantly:

The dataset doesn’t behave consistently.

The Scale Problem

The Backyard Quarry dataset spans a wide range of sizes:

pea-sized
hand-sized
wheelbarrow-sized
engine-block-sized

That variability immediately affects how data can be captured.

Small rocks can be photographed on a table.

Medium rocks might need to be placed on the ground with careful framing.

Large rocks don’t move easily at all.

Each category introduces different constraints.

This is a pattern that shows up in many real-world systems.

The same pipeline rarely works for every object.

Image Capture

The simplest form of data capture is photography.

Take a few images of each rock from different angles.

Store them.

Attach them to the record.

Even this introduces decisions:

  • how many images per object?
  • what angles?
  • what lighting conditions?
  • what background?

Inconsistent capture leads to inconsistent data.

And inconsistent data leads to unreliable systems.

Introducing Photogrammetry

If we take the idea a step further, we can generate a 3D model of each rock.

Photogrammetry works by combining multiple images to reconstruct the shape of an object.

Conceptually:

  • take overlapping photos
  • feed them into a processing tool
  • generate a 3D mesh

This produces a much richer representation than a single image.

But it also introduces:

  • processing time
  • storage requirements
  • failure cases

Not every rock will produce a clean model.

The Capture Pipeline

At this point, the process starts to look like a pipeline.

Diagram showing a data pipeline for capturing physical objects, including image capture, photogrammetry processing, metadata extraction, and storage.
A simplified pipeline for turning a physical object into structured data and associated assets.

Each step transforms the data in some way.

The output of one stage becomes the input of the next.

This is a common pattern in data engineering.

The difference here is that the input isn’t a clean dataset.

It’s the physical world.

Imperfect Data

No matter how carefully you design the pipeline, real-world data introduces imperfections.

Examples:

  • missing images
  • inconsistent lighting
  • partially occluded objects
  • measurement errors

A rock might be:

  • too reflective
  • too uniform in texture
  • partially buried
  • awkwardly shaped

All of these affect the output.

This means the system has to tolerate incomplete or imperfect data.

Which leads to an important realization:

Data systems are rarely about perfect data.
They are about handling imperfect data gracefully.

Storage Considerations

Once data is captured, it needs to be stored.

Different types of data behave differently:

  • metadata → small, structured, easy to query
  • images → larger, unstructured
  • 3D models → even larger, more complex

This reinforces a pattern introduced earlier:

Separate structured data from large assets.

Store references rather than embedding everything directly.

A Familiar Pattern

At this point, the Backyard Quarry pipeline looks surprisingly familiar.

It resembles systems used for:

  • scanning historical artifacts
  • capturing industrial parts
  • generating 3D models for manufacturing
  • building datasets for computer vision

The specifics change.

The pattern remains the same.

What Comes Next

Once data is captured and stored, the next problem emerges.

How do we find anything?

A dataset of a few rocks is manageable.

A dataset of hundreds or thousands quickly becomes difficult to navigate without structure.

In the next post, we’ll look at how to index and search the dataset — and how even a pile of rocks benefits from thoughtful retrieval systems.

And somewhere along the way, it becomes clear that the hard part isn’t designing the schema.

It’s building systems that can reliably turn messy reality into usable data.

The Rock Quarry Series

Facebooktwitterredditlinkedinmail