Sovereign Synapse: The Context Cleaner

(Curation is Sovereignty)

Sovereign Synapse Series | Post 2

AI is polite by design. It prefaces its answers with “Certainly! I’d be happy to help” and closes with “I hope this information is useful.” In a casual chat, these conversational “handshakes” are harmless. In a Cognitive Estate—a permanent, local archive of your thoughts—they are a Prose Tax.

Last time, we successfully evacuated our intellectual history from the cloud. But once the data landed on local silicon, the reality of “raw” data set in. To turn a disorganized data dump into a high-fidelity archive, we must move from ingestion to Forensic Curation.

🛠️ Builder’s Note: The Roundtable Pivot

When I published Part 1, the community exploded with architectural feedback. While discussing the code, an engineer named WAB raised a critical long-term systems question: As a local memory store grows, multiple autonomous local agents will eventually read, write, and refactor these synapses. How does an agent running six months from now know that a specific memory chunk is a high-fidelity historical insight rather than a corrupted file or an adversarial local injection?

The solution was elegant: don’t just clean the data—sign it. By integrating an Ed25519 cryptographic layer at the moment of distillation, we move from simple file cleanup to establishing an immutable Chain of Custody for our thoughts.

But pushing a zero-trust cryptographic layer into a production pipeline meant surviving a rigorous multi-round systems audit. We didn’t just merge naive code. We engineered a canonical sorted-JSON payload structure to prevent newline field-injection attacks, enforced continuous POSIX owner-only permission validations to neutralize local forgery vectors, and ensured our verification paths were strictly side-effect free—guaranteeing that read operations never accidentally mutate disk state by generating blank keys. We subjected our architecture to enterprise-grade rigor before allowing a single byte to hit local silicon.

The Problem: Ghost Nodes and Corporate Boilerplate

OpenAI exports are not linear files; they are complex branching trees. A naive extractor often trips over “ghost nodes”—dangling references or messages with missing timestamps that cause standard scripts to crash. Our updated adapter now uses defensive null-guards to ensure these broken links don’t halt the evacuation.

Even when the extraction is stable, the result is cluttered. When you have thousands of files in your vault, you don’t want your local semantic search results polluted by generic AI pleasantries. You want the signal: the technical reasoning, the code, the breakthrough. If you don’t strip the prose at the edge, you pay an Interpretation Tax in downstream inference costs every single time an agent reads that memory.

The Build: The Structural Sieve & Signer

To solve this without destroying the original record, we built a Context-Cleaner that acts as a structural sieve. We pattern-match on the layout to separate the Preamble (the intro) from the Postamble (the outro).

Once the text is stripped of its corporate residue, we run it through our Zero-Trust Signer to seal the contract before it hits local storage.

# core/context_cleaner.py
import os
import re
import logging
import tempfile
from pathlib import Path
from datetime import datetime
from cryptography.hazmat.primitives.asymmetric import ed25519

_CORE_DIR = os.path.dirname(os.path.abspath(__file__))
_REPO_ROOT = os.path.abspath(os.path.join(_CORE_DIR, os.pardir))
DEFAULT_KEYS_DIR = os.path.abspath(os.path.join(_REPO_ROOT, "vault", "keys"))
_logger = logging.getLogger(__name__)

def _atomic_write_bytes(path: Path, data: bytes) -> None:
    """Writes data to path atomically via a temp file in the same directory.

    Guarantees os.replace stays on one filesystem to avoid cross-device EXDEV errors.
    """
    directory = path.parent
    directory.mkdir(parents=True, exist_ok=True)
    fd, tmp_path = tempfile.mkstemp(prefix=f".{path.name}.", suffix=".tmp", dir=str(directory))
    tmp = Path(tmp_path)
    try:
        with os.fdopen(fd, "wb") as handle:
            handle.write(data)
        os.replace(tmp, path)
    except Exception:
        tmp.unlink(missing_ok=True)
        raise

class ContextCleaner:
    """Heuristic-based scanner to identify and flag AI conversational noise."""

    @classmethod
    def verify_signature(
        cls,
        signature_hex: str,
        *,
        receipt_id: str,
        structural_signal: str,
        user_text: str,
        timestamp: datetime,
        keys_dir: Path | None = None,
    ) -> bool:
        """Adheres strictly to a boolean contract. Fails closed on permission or system errors."""
        from cryptography.exceptions import InvalidSignature
        from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PublicKey

        directory = resolve_keys_dir(keys_dir)
        try:
            public_key = Ed25519PublicKey.from_public_bytes(_load_public_key_bytes(directory))
            payload = _signing_payload(receipt_id, structural_signal, user_text, timestamp)
            public_key.verify(bytes.fromhex(signature_hex), payload)
            return True
        except (PermissionError, FileNotFoundError, RuntimeError) as exc:
            _logger.warning(
                "Cannot verify Sovereign Synapse signature: public signing key "
                "unavailable or inaccessible (%s). Ensure vault/keys/ is readable "
                "by this process or set SYNAPSE_KEYS_DIR with correct permissions.",
                exc,
            )
            return False
        except (InvalidSignature, ValueError, OSError):
            return False # Strictly fail closed

Defensive Engineering: Identity & Integrity

In our initial design, we used deterministic uuid5 hashing to solve idempotency and prevent duplicate files. Now, our deterministic asset ID is directly tied to our cryptographic provenance. By moving away from fragile Current Working Directory relative paths and forcing our key serialization to be strictly atomic, the ingestion engine guarantees that no mid-process crash or system context drift can corrupt or orphan our signed data.

By using the SHA-256 hash of the signed payload as our primary URN, our files don’t just have a repeatable name; they possess an unalterable Forensic Trace. If a rogue local process or a misconfigured local agent attempts to silently modify a synapse file in your vault, the signature validation fails immediately. The knowledge base becomes entirely self-verifying.

The Result: Signed Signal over Sentiment

By implementing defensive guards to handle “ghost nodes” and using the cryptographic Context-Cleaner, our Sovereign Synapse transitions from a text dump to a high-integrity reasoning ledger.

Feature Phase 1 (Raw Ingest) Phase 2 (Curated Estate)
Prose Tax Paid in Full Redacted & Audited
File Identity Random ( uuid4 ) Deterministic SHA-256 URN
Data Integrity Crash-prone / Fragile Resilient (Null-guarded)
Provenance Gate Unverified Text Ed25519 Cryptographically Signed

The 2024 conversation in my vault regarding Movesense Medical and MetaMotion R sensors is no longer just a text file. It is a permanent, cryptographically secured, asset. It is a part of my own intellectual history—entirely under my sovereign control, stripped of corporate residue, and ready for the local network.

Is your local AI memory running on trusted, signed contracts—or are you still paying a Prose Tax on corporate fluff?

Join the Architecture Discussion

The frameworks we are using to eliminate the Prose Tax and secure our cognitive estates are being formalized into an open-source standard.

The Sovereign Systems Specification & Glossary is now live under the MIT License on GitHub.

If you are building in the local-first or sovereign RAG space and want to propose updates, refine boundaries, or add new architectural vectors, check out the repository and open a Pull Request. Let’s map out the constraints of this discipline together.

The Sovereign Synapse Series

  • The Great Export
  • The Context Cleaner – This Post
  • The Local Brain – Coming 9 June 2026
  • The View from the Summit – Coming 16 June 2026
  • The Synapse Navigator – Coming 30 June 2026
  • The Analog Bridge – Coming 7 July 2026
  • The Temporal Mirror – Coming 14 July 2026
  • The Unbroken Voice – Coming 21 July 2026
Facebooktwitterredditlinkedinmail

The Local Eye (Sovereign Vision)

We’ve built a system that is Reliable, Affordable, and Governed. But until now, our Forensic Team has been “blind.” It could only reconcile text-based metadata.

In the world of rare book forensics, the text is only half the story. The typography, paper grain, and binding texture are the true “fingerprints.” However, sending high-resolution, proprietary scans of a $50,000 asset to a cloud-based LLM is a Data Sovereignty nightmare.

Today, we introduce The Local Eye: Edge-based Multimodal Vision that processes pixels without letting them leak into the cloud.

The Sovereignty Gap in Multimodal AI

Most multimodal implementations send raw images directly to frontier models (like GPT-4o). For an enterprise, this is a liability.

  1. Intellectual Property: Who owns the training data rights to the scan?
  2. Privacy: Does the image contain metadata or background information that violates NDAs?
  3. Cost: Sending 10MB 4K images for every query is an “Accountant’s” nightmare.

Implementing “Feature Extraction” at the Edge

Instead of sending the image to the cloud, we use Llama 3.2 Vision running locally via Ollama. Our MCP server acts as an “Airlock.”

The Handshake:
Normalization: The sharp library resizes and standardizes the forensic scan locally.
Local Inference: The Vision SLM analyzes the image and generates a text-based “Feature Map.”
Metadata Egress: Only the textual description is passed to the reasoning agents. Even if The Accountant routes the task to a Cloud model for deep analysis, the cloud only sees our description, never the pixels.

Architectural diagram of the 'Local Eye' workflow. An artifact image is processed locally using the Sharp library and Llama 3.2 Vision. Only the resulting text metadata is allowed to pass through the security airlock to cloud-based reasoning models, ensuring the original pixels never leave the local environment.
The Sovereign Vision Workflow—Extracting intelligence at the edge to prevent data leakage.

The Sovereign Vision Workflow—Extracting intelligence at the edge to prevent data leakage.
Architectural diagram of the 'Local Eye' workflow. An artifact image is processed locally using the Sharp library and Llama 3.2 Vision. Only the resulting text metadata is allowed to pass through the security airlock to cloud-based reasoning models, ensuring the original pixels never leave the local environment.

In code we might have something like this then:

// From src/index.ts: The Vision Airlock
async function analyzeArtifactVision(imagePath: string, focus: string) {
  const processedImage = await sharp(imagePath).resize(512, 512).toBuffer();

  // Local-only call to Ollama
  const description = await ollama.generate({
    model: 'llama3.2-vision',
    prompt: `Analyze the ${focus} of this artifact.`,
    images: [processedImage.toString('base64')]
  });

  return description; // Pixels stay here. Only text leaves.
}

The “Zero-Pixel” Policy

The goal is to maximize Intelligence while minimizing Exposure. By implementing Local Vision, we treat the cloud as a “Reasoning Utility,” not a “Data Store.” We send it the logic puzzle, but we never give it the raw forensic evidence. We gain the power of frontier-model reasoning without the risk of data harvesting.

Developer Lessons: The “Latency of Locality”

In building the Sovereign Vault, we learned that ‘Data Sovereignty’ has a physical cost: Time.

While a cloud-based API might analyze a 4K image in seconds, running a deep-dive OCR and visual analysis on local consumer hardware using Llama 3.2-Vision takes significantly longer. We had to tune our “Airlock” timeouts—raising the ceiling from 120 seconds to 300 seconds—to give the local “Eye” enough time to process complex handwriting on a standard CPU.

Additionally, we realized that our error logs were a potential privacy leak. We implemented Log Truncation to ensure that even our failures respect the Sovereign Vault’s privacy mandate.

The “Zero-Glue” Discovery

In a traditional setup, adding vision would require rewriting the orchestrator’s core logic. Because we use the Model Context Protocol, the orchestrator simply asked the server: “What can you do?”. The server replied with the analyze_artifact_vision manifest. The agent then dynamically decided to use this new “Eye” to investigate the Gatsby image. No new glue code was written to connect the vision model to the reasoning brain.

Case Study: The Gatsby Inscription

To test our Sovereign Vault, we ran a forensic audit on a high-value first edition of The Great Gatsby. Our local Vision Agent detected something anomalous on the title page: a cursive, multi-line inscription.

An image of The Great Gatsby copyright page
Image credit: [University of Southern Mississippi Special Collections](https://lib.usm.edu/spcol/exhibitions/item_of_the_month/iotm_june_2021.html) (June 2021 Item of the Month)

The Sovereign Trace

When we ran the analyze_artifact_vision tool, the local Llama 3.2 Vision model performed a deep scan and returned a fascinating finding:

**Visual Findings: Handwritten Inscription**
* Location: Right-hand margin of title page
* Medium: Faint pencil, cursive script
* Transcribed Content: "Then we are not alone at all when we remember that we have in our hearts that something so precious..."

Why this matters: Notice that the model didn’t just see “scribbles.” It attempted to transcribe a 40-word passage. Crucially, the Forensic Analyst (Claude) recognized that this text does not exist in any canonical version of The Great Gatsby.

This is a massive forensic win. The “Eye” identified a potential fabricated provenance or a non-standard owner intervention. Because this happened inside our “Airlock,” the specific handwriting and the non-canonical text were captured without ever touching a cloud API.

The Architect’s Trade-off: The Reasoning Gap
While our local Llama 3.2-Vision is an incredible “Eye,” it occasionally faces a Reasoning Gap. In certain runs, it may identify a note as “illegible” or produce repetitive output due to CPU thermal throttling or model constraints.

Instead of hallucinating a “clean” signature, our system is designed to Safe-Fail. It flags the finding as “Indeterminate” and triggers a High-Severity Human Authorization request.

The Governance Challenge: We now have a transcribed inscription that might contain a previous owner’s private thoughts or names. If we simply passed this output to an LLM for summarization, we would have leaked a private message to a third-party server. This discovery sets the stage for our next architectural layer: The Redactor.

Facebooktwitterredditlinkedinmail