From Cloud to Laptop: Running MCP Agents with Small Language Models

Large Models Build Systems. Small Models Run Them.

For most developers, modern AI systems feel locked behind massive infrastructure.

We’ve been conditioned to believe that “Intelligence” is a service we rent from a data center—a luxury that requires GPU clusters, $10,000 hardware, and ever-climbing cloud inference bills.

Last week, when we built our Multi-Agent Forensic Team, you likely assumed that coordinating a Supervisor, a Librarian, and an Analyst required the reasoning horsepower of a 400B+ parameter model.

Today, we’re cutting the cord. We are moving the entire Forensic Team—the agents, the orchestration, and the data—onto a standard laptop. No cloud. No API costs. No data leaving your local network.

This is the power of Edge AI combined with the Model Context Protocol (MCP).

The Pivot: The “Forensic Clean-Room”

In the world of rare book forensics, data sovereignty isn’t a “nice-to-have.” When you are auditing high-value archival records or sensitive provenance data, the “Clean-Room” approach is the gold standard. You want the data isolated.

By moving our stack to the Edge, we transform a laptop into a portable forensic lab.

The Edge Architecture

Architecture diagram showing an MCP-based multi-agent system running locally with small language models where a supervisor and specialist agents interact with an MCP server and local archive database on a laptop.
Running MCP agents locally: small language models power the supervisor and specialist agents while the MCP server provides structured tool access to local data.

Notice that the architecture we built in Post 2 doesn’t change. Because we used MCP as our “USB-C” interface, we don’t have to rewrite our tools or our agents. We only swap the Inference Engine.

Why SLMs Love MCP

Small language models struggle when tasks are open-ended.

However, MCP dramatically reduces the search space.

Instead of inventing answers, the model interacts with structured primitives:

  • tools
  • resources
  • prompts

Each defined with strict schemas.

The Thesis: Large models are great for designing the system and writing the initial code. Small models are the perfect runtime engines for executing those standardized tasks.

The “How-To”: Swapping the Engine

In our updated orchestrator.py, we’ve introduced a provider flag. Instead of hitting a remote API, the Python supervisor now talks to a local inference server (like Ollama or LM Studio).

# [Post 3 - Edge AI] Swapping the Inference Provider
if args.provider == "ollama":
# Pointing to the local SLM engine
client = OllamaClient(base_url="http://localhost:11434")
model = "phi4"
else:
# Standard Cloud Provider
client = AnthropicClient()
model = "claude-3-5-sonnet"

Because our TypeScript MCP Server is running locally via stdio, the latency is nearly zero. The “Librarian” fetches metadata from the local database, and the “Analyst” runs the audit—all without a single packet hitting the open web.

Benchmarking the Forensic Team: Cloud vs. Edge

Does a 14B model perform as well as a 400B model for forensics? When constrained by MCP schemas, the results are surprising.

Criteria Cloud (Claude/GPT-4) Edge (Phi-4/Mistral)
Reasoning Depth Extremely High High (with MCP Tool Constraints)
Latency 1.5s – 3s (Network Dependent) < 500ms (Local Inference)
Cost Per-token billing $0.00
Privacy Data processed externally 100% Data Sovereignty
Scalability Infinite Limited by local RAM/NPU

The Reveal: Same System, New Home

If you look at the latest update to the repository, you’ll see that the orchestration logic is nearly identical. The architecture stack from earlier posts remains unchanged.

Comparison diagram showing cloud-based AI architecture using large models and remote inference versus edge AI architecture using small language models and local MCP tool servers.
Edge AI architecture replaces cloud inference with local small language models while retaining MCP-based tool access.

Nothing about the agents changed.

Nothing about the tools changed.

Only the inference engine moved.

The “Zero-Glue” promise is realized here.

We didn’t build a cloud app; we built a protocol-driven system. The fact that it can live on a server or a laptop is simply a deployment choice.

What’s Next?

We’ve built the server. We’ve orchestrated the team. We’ve moved it to the edge.
In the final post of this series, we tackle the “Final Boss” of AI systems: Enterprise Governance. We’ll explore how to take this forensic lab and scale it across an organization using Oracle 26ai, ensuring that every audit is secure, permissioned, and defensible.

Ready to go local?

Check out the orchestrator.py update and try running the Forensic Team on your own machine.
👉 MCP Forensic Analyzer – Edge AI Example

The “Zero-Glue” Series

Facebooktwitterredditlinkedinmail

The Forensic Team: Architecting Multi-Agent Handoffs with MCP

Why One LLM Isn’t Enough—And How to Build a Specialized Agentic Workforce

In my last post, we explored the “Zero-Glue” architecture of the Model Context Protocol (MCP). We established that standardizing how AI “talks” to data via an MCP Server is the “USB-C moment” for AI infrastructure.

But once you have the pipes, how do you build the engine?

In 2026, the answer is no longer “one giant system prompt.” Instead, it’s Functional Specialization. Today, we’re building a Multi-Agent Forensic Team: a group of specialized Python agents that use our TypeScript MCP Server to perform deep-dive archival audits.

The “Context Fatigue” Problem

Early agent architectures relied on a single LLM handling everything:

  • retrieve data
  • reason about it
  • run tools
  • write the final output

Even with large context windows, this approach quickly hits a reasoning ceiling.

A single agent juggling too many tools often suffers from:

  1. Tool Confusion
    Choosing the wrong function when multiple tools are available.
  2. Logic Drift
    Losing track of the objective during multi-step reasoning.
  3. Latency and Cost
    Sequential reasoning loops increase response time and token usage.

The solution is functional specialization.

Instead of one overloaded agent, we build a team of focused agents coordinated by a supervisor.

Before diving into the multi-agent design, it helps to understand where the agents live in the MCP stack.

Figure 1. The MCP architecture stack: agents reason about tasks while MCP standardizes access to tools, resources, and enterprise data.

Layered architecture diagram of an MCP-based AI system showing applications, agent orchestration, the Model Context Protocol layer, tools and resources, and underlying data systems.
The MCP architecture stack: agents reason about tasks while MCP standardizes access to tools, resources, and enterprise data.

The Architecture: A Polyglot Powerhouse

One of MCP’s strengths is that it decouples tools from orchestration.

This allows each layer of the system to use the language best suited for the job.

In our case:

  • The “Hands” (TypeScript)
    Our MCP server handles data access and tool execution with strong typing.
  • The “Brain” (Python)
    A Python orchestrator manages reasoning and agent coordination using frameworks like LangGraph or PydanticAI.

Because both layers communicate through MCP, the language boundary disappears.

Multi-Agent MCP Architecture

Diagram showing a multi-agent architecture using the Model Context Protocol (MCP) with a Python supervisor agent coordinating Librarian and Analyst agents that access tools through a TypeScript MCP server connected to an archive database.
Multi-agent MCP architecture: a Python supervisor coordinates specialized agents that access tools through a shared MCP server.

Each agent communicates with tools through the MCP server, not directly with the data source.

The Forensic Team Roles:

Role Agent Identity Primary Responsibility MCP Tools Used
Supervisor The Orchestrator Receives request, manages state, and handles handoffs. list_tools, list_resources
Librarian The Researcher Gathers historical facts and archival metadata find_book_in_master_bibliography
Analyst The Forensic Tech Compares observed data against metadata to find flaws audit_artifact_consistency

How It Works: Glue-Free Agent Handoffs

The beauty of MCP is the Transport Layer. Our Python client connects to the TypeScript server via stdio. It doesn’t care that the server is written in Node.js; it only cares about the protocol.

  1. Spawning the Sub-process
    In our orchestrator.py, we define how to “wake up” the TypeScript server. Notice how we point Python directly at the Node.js build:
def get_server_params() -> StdioServerParameters:
    # This is the bridge: Python spawning a Node.js process
    return StdioServerParameters(
        command="node",
        args=[str(SERVER_ENTRY)], # Points to our TS /build/index.js
        cwd=str(PROJECT_ROOT),
    )
  1. The Functional Handoff
    Because MCP tools expose strict schemas, the agents can pass structured results between each other without custom translation layers.

The Supervisor doesn’t manually parse JSON or remap fields.

Instead it simply chains the outputs:

# 1. Librarian: pull book details
librarian_result = await librarian_agent(session, title, author)

# 2. Analyst: audit for discrepancies (using Librarian's data)
analyst_result = await analyst_agent(
    session, book_page_id, book_standard, observed
)

Why This Wins in the Enterprise:

Auditability

You can track exactly what each agent saw and what conclusions it produced.

Security

Agent permissions can be scoped by tool access.
The Librarian may only read archives, while the Analyst writes forensic reports.

Maintainability

Each agent owns a single responsibility.
If the forensic logic changes, only the Analyst agent needs to be updated.

Scaling to the “AI Mesh”

By using MCP as the backbone, you’ve built more than an app; you’ve built a System of Intelligence. Any new tool you add to your TypeScript server is instantly “discoverable” by your Python team. You are no longer writing “Glue Code”; you are orchestrating a digital workforce.

The MCP server becomes the shared capability layer for your entire AI system.

📚 The “Zero-Glue” Series
– Post 1: The End of Glue Code: Why MCP is the USB-C Moment for AI
– Post 2: The Forensic Team: Architecting Multi-Agent Handoffs – You are here
– Post 3: From Cloud to Laptop: Running MCP Agents with SLMs – Coming Soon
– Post 4: Enterprise Governance: Scaling MCP with Oracle 26ai – Coming Soon

Explore the Code:

The full multi-agent orchestrator is now live in the /examples folder of the repo:
👉 MCP Forensic Analyzer – Multi-Agent Example

Up Next in the Series:

Next week, we go small. We’re moving the “Forensic Team” out of the cloud and onto your laptop. We’ll explore Edge AI and how to run this entire stack using Small Language Models (SLMs) like Phi-4—no $10,000 GPU required.

Facebooktwitterredditlinkedinmail