Here's a document classification task: given a section of the MCP Transport Specification, determine which parts are mandatory requirements, which are security risks, and which are informational filler your agent can skip.
An LLM can do this. You can prompt Claude or GPT-4 to read the text and classify each section. It will take 2-10 seconds, cost $0.003-0.02 per call, and give you slightly different answers every time you run it.
Or you can do it with regex in 3.78 milliseconds. Deterministically. Offline. For free.
That's what Decompose does. It's a Python library that splits text into classified semantic units — no LLM, no API key, no GPU. One function call. Here's the full output from processing the MCP transport spec:
The input
1,786 characters of specification text. Five sections covering transports, a security warning, stdio, SSE (deprecated), streamable HTTP, and security requirements. The kind of document every MCP implementation needs to read carefully.
The output
9 units. Each one has an authority level, risk category, attention score, and actionability flag. No LLM was consulted. Here are the three that matter:
And the units that don't matter:
An agent using these scores would read 2 out of 9 units. It would know the security requirements are mandatory (MUST) and the security warning is advisory (SHOULD). It would skip the overview, the stdio description, and the SSE deprecation notice. It would save 78% of its context window on this document alone.
How it works
No magic. The classification runs on three things:
1. RFC 2119 keyword detection
"MUST", "SHALL", "MUST NOT" → mandatory or prohibitive. "SHOULD", "RECOMMENDED" → directive. "MAY" → permissive. No keywords → informational.
This isn't an opinion. It's the actual standard. RFC 2119 was written in 1997 specifically to make these words unambiguous in specifications. An LLM has to figure this out from its training data. Regex just matches the word.
2. Risk category patterns
Words like "attack", "untrusted", "authentication", "HTTPS" → security. Dollar amounts and financial terms → financial. "OSHA", "safety-critical", "load-bearing" → safety-critical. "compliance", "violation", "regulation" → compliance.
These patterns are deterministic. They don't vary between runs. They don't hallucinate risk where there is none.
3. Attention scoring
A simple formula: authority weight × risk multiplier. Mandatory + security = high score. Informational + informational = 0.0. The numbers aren't arbitrary — they're calibrated to put genuinely critical content at the top of the reading list.
That's it. Three regex systems, a header-aware Markdown splitter, and an attention calculator. Total code: ~2,000 lines of Python. Total external dependencies: zero.
When does this actually beat an LLM?
Not always. Let me be specific about the tradeoffs.
Regex wins when:
- You need consistent, reproducible classification across documents
- You're preprocessing hundreds or thousands of documents before they hit a model
- You're running locally / air-gapped / ITAR-controlled
- You need an audit trail that explains exactly why a section was flagged
- You can't afford $0.01 per document at scale
- You need the answer in 4ms, not 4 seconds
LLMs win when:
- You need to understand nuance, implication, or cross-document reasoning
- The document uses domain-specific language that doesn't match standard patterns
- You're classifying intent, not structure
- You have one document, not a thousand
The insight is that these aren't mutually exclusive. Decompose runs before your LLM does. It's a preprocessor. Your agent reads 9 units of metadata instead of 1,786 characters of raw text. It decides which 2 units to send to the model for deeper analysis. The LLM still does the hard work — it just does less of the easy work.
The cost math
Let's say you have 10,000 specification documents. Average 5,000 characters each.
After decompose, your agent might send 20% of units to the LLM for deeper analysis. Now your LLM cost is $27 instead of $135, and the model sees pre-classified, structurally annotated text instead of raw blobs.
Try it
Every unit comes back with: authority, risk, attention, actionable, irreducible, entities, dates, financial, heading_path. No API key. No setup. Runs on a Raspberry Pi.
What we're building
Decompose is the open-source primitive. AECai is the product — a local-first document intelligence platform for architecture, engineering, and construction firms. It uses Decompose as its chunking and classification layer, then adds verification against building codes, cross-referencing against jurisdictional standards, and vector search across entire project libraries.
Both are built by Echology. Both run on your hardware. Neither sends data to anyone's cloud.
If you're building agents that read documents, let's talk. If you want to see what Decompose finds in your documents, send them over.