When Regex Beats an LLM

Here's a document classification task: given a section of the MCP Transport Specification, determine which parts are mandatory requirements, which are security risks, and which are informational filler your agent can skip.

An LLM can do this. You can prompt Claude or GPT-4 to read the text and classify each section. It will take 2-10 seconds, cost $0.003-0.02 per call, and give you slightly different answers every time you run it.

Or you can do it with regex in 3.78 milliseconds. Deterministically. Offline. For free.

That's what Decompose does. It's a Python library that splits text into classified semantic units — no LLM, no API key, no GPU. One function call. Here's the full output from processing the MCP transport spec:

The input

1,786 characters of specification text. Five sections covering transports, a security warning, stdio, SSE (deprecated), streamable HTTP, and security requirements. The kind of document every MCP implementation needs to read carefully.

# Two lines of Python
from decompose import decompose
result = decompose(spec_text)   # 3.78ms
            

The output

9 units. Each one has an authority level, risk category, attention score, and actionability flag. No LLM was consulted. Here are the three that matter:

Unit 2 — Security Warning

authority: directive risk: security attention: 2.2 actionable: true "Implementations SHOULD be cautious about exposing MCP servers via SSE to untrusted networks..." → SHOULD = directive authority. "untrusted networks" = security risk. Attention: 2.2

Unit 9 — Security Requirements

authority: mandatory risk: security attention: 3.0 actionable: true "All implementations MUST: - Use HTTPS for non-local connections - Validate Origin headers on all requests - Implement proper session management - Follow OAuth 2.1 for authentication" → MUST = mandatory authority. Security risk + mandatory = attention 3.0 This is the unit your agent reads first.

And the units that don't matter:

Unit 1 — Transports (overview)
  authority: mandatory  risk: informational  attention: 0.3
  → "must handle message framing" — boilerplate

Unit 3 — Standard Input/Output (stdio)
  authority: informational  risk: informational  attention: 0.0
  → description, no obligations, skip

Unit 7 — Streamable HTTP
  authority: informational  risk: informational  attention: 0.0
  → same — description only, skip

Unit 8 — Request-Response Flow
  authority: permissive  risk: informational  attention: 0.1
  → "server MAY send multiple messages" — nice to know
            

An agent using these scores would read 2 out of 9 units. It would know the security requirements are mandatory (MUST) and the security warning is advisory (SHOULD). It would skip the overview, the stdio description, and the SSE deprecation notice. It would save 78% of its context window on this document alone.

How it works

No magic. The classification runs on three things:

1. RFC 2119 keyword detection

"MUST", "SHALL", "MUST NOT" → mandatory or prohibitive. "SHOULD", "RECOMMENDED" → directive. "MAY" → permissive. No keywords → informational.

This isn't an opinion. It's the actual standard. RFC 2119 was written in 1997 specifically to make these words unambiguous in specifications. An LLM has to figure this out from its training data. Regex just matches the word.

2. Risk category patterns

Words like "attack", "untrusted", "authentication", "HTTPS" → security. Dollar amounts and financial terms → financial. "OSHA", "safety-critical", "load-bearing" → safety-critical. "compliance", "violation", "regulation" → compliance.

These patterns are deterministic. They don't vary between runs. They don't hallucinate risk where there is none.

3. Attention scoring

A simple formula: authority weight × risk multiplier. Mandatory + security = high score. Informational + informational = 0.0. The numbers aren't arbitrary — they're calibrated to put genuinely critical content at the top of the reading list.

That's it. Three regex systems, a header-aware Markdown splitter, and an attention calculator. Total code: ~2,000 lines of Python. Total external dependencies: zero.

When does this actually beat an LLM?

Not always. Let me be specific about the tradeoffs.

Regex wins when:

You need consistent, reproducible classification across documents
You're preprocessing hundreds or thousands of documents before they hit a model
You're running locally / air-gapped / ITAR-controlled
You need an audit trail that explains exactly why a section was flagged
You can't afford $0.01 per document at scale
You need the answer in 4ms, not 4 seconds

LLMs win when:

You need to understand nuance, implication, or cross-document reasoning
The document uses domain-specific language that doesn't match standard patterns
You're classifying intent, not structure
You have one document, not a thousand

The insight is that these aren't mutually exclusive. Decompose runs before your LLM does. It's a preprocessor. Your agent reads 9 units of metadata instead of 1,786 characters of raw text. It decides which 2 units to send to the model for deeper analysis. The LLM still does the hard work — it just does less of the easy work.

The cost math

Let's say you have 10,000 specification documents. Average 5,000 characters each.

# LLM classification (Claude Sonnet, est.)
10,000 docs × ~2,000 input tokens × $3/M   = $60 input
10,000 docs × ~500 output tokens × $15/M   = $75 output
Total: ~$135
Time: ~10,000 × 3s = 8.3 hours (sequential)

# Decompose
10,000 docs × 14ms = 2.3 minutes
Cost: $0
            

After decompose, your agent might send 20% of units to the LLM for deeper analysis. Now your LLM cost is $27 instead of $135, and the model sees pre-classified, structurally annotated text instead of raw blobs.

Try it

pip install decompose-mcp

# Process any document
from decompose import decompose
result = decompose(open("spec.md").read())

for unit in result["units"]:
    if unit["attention"] > 1.0:
        print(f"{unit['heading']}: {unit['authority']} / {unit['risk']} / {unit['attention']}")

# Or pipe from CLI
cat spec.md | python -m decompose --compact

# Or run as an MCP tool for your agent
python -m decompose --serve
            

Every unit comes back with: authority, risk, attention, actionable, irreducible, entities, dates, financial, heading_path. No API key. No setup. Runs on a Raspberry Pi.

What we're building

Decompose is the open-source primitive. AECai is the product — a local-first document intelligence platform for architecture, engineering, and construction firms. It uses Decompose as its chunking and classification layer, then adds verification against building codes, cross-referencing against jurisdictional standards, and vector search across entire project libraries.

Both are built by Echology. Both run on your hardware. Neither sends data to anyone's cloud.

If you're building agents that read documents, let's talk. If you want to see what Decompose finds in your documents, send them over.

pip install decompose-mcp GitHub Contact

When RegexBeats an LLM