If your AI system processes documents and someone asks "why did it flag this section as safety-critical?" — what do you answer?
"The model said so" is not an answer. Not in construction. Not in healthcare. Not in any industry where documents have legal weight and wrong answers have professional liability.
"Simulation-aware" is a design pattern that answers this question. Every classification is traceable to specific causes. Every output is verifiable against its inputs. Every anomaly is detected before it reaches a human. The pattern is built from real computer science primitives — Merkle trees, causal DAGs, attention budgets, parity checks — not from marketing.
The core idea
Treat every document as a small universe. It has its own rules (mandatory requirements, permissive options), its own physics (structural loads, financial thresholds), its own timeline (deadlines, milestones), and its own entities (standards bodies, parties, jurisdictions).
A simulation-aware system processes this universe with the same rigor a physics engine processes a game world:
- Every state is verifiable (you can prove a finding is consistent with the input)
- Every transition is causal (you can trace why one state led to another)
- Every anomaly is detectable (impossible states get flagged)
- Resources are budgeted (attention goes where it matters)
This is the architecture behind AECai. It's 17 systems distributed across three engine pillars. Here's what they actually do.
System 1: Causal Consistency Networks
Every finding in the pipeline has a causal chain explaining why it was made. Not a confidence score. Not a probability. A directed acyclic graph of specific, traceable causes.
This matters for E&O insurance defense. When a client asks "why did your AI say this was critical?" you point to the causal chain, not the model weights. The graph has a consistency checker that detects cycles, orphaned findings, and contradictions. If a classification has no causal chain, it's flagged as an orphaned finding — something the system produced but can't explain.
System 2: Reality Anchors
Some facts in a document are externally verifiable. "ACI 318-19" is a real standard. "January 15, 2026" is a real date. "OSHA" is a real organization. These are anchors — known-true reference points that everything else is measured against.
The confidence model uses geometric mean of anchor confidences. A finding with three verified anchors has confidence ~1.0. A finding with one invalidated anchor drops to ~0.0. A finding with no anchors at all gets baseline 0.5 — the system acknowledges uncertainty rather than guessing.
This is how the system handles a withdrawn standard, a retracted report, or an amended contract. Invalidate the anchor, and everything that depended on it cascades to "suspect" automatically.
System 3: Temporal Merkle Trees
Every semantic unit the pipeline produces gets hashed into a Merkle tree. The root hash represents the entire output. Any single unit can be verified without downloading the full dataset.
This isn't blockchain. There's no distributed consensus, no mining, no chain. It's a standard Merkle tree — the same data structure git uses to verify commits. The difference is that it operates at the semantic unit level, so you can verify that a single paragraph of a 200-page spec hasn't been altered without re-processing the entire document.
Why this matters: when you deliver an AI-processed compliance report to a client, they need to know the output hasn't been modified after processing. The Merkle proof is that guarantee.
System 4: Attention Budgets
The pipeline has a fixed attention budget of 100 units per document. Safety-critical content consumes more. Boilerplate consumes less. The budget prevents the system from spending equal compute on every section.
This is the same principle behind Decompose's attention scoring, but applied to the full pipeline. In Decompose, attention decides what your agent reads. In AECai, attention decides what processing depth each unit receives: deep analysis for safety-critical content, shallow pass for background, skip for boilerplate.
The budget is finite. When it runs out, remaining units get minimal processing. This is intentional — it forces the system to prioritize. A 200-page spec where every section gets "deep" analysis is a system that doesn't know what matters.
System 5: Multi-Channel Error Correction
Run multiple independent extraction channels on the same content. Where channels agree, confidence is high. Where they disagree, flag for review.
This catches OCR errors, misclassifications, and edge cases that any single extraction method would miss. The correction is conservative: unanimous = high confidence, majority = corrected with note, split = flagged for human review. No auto-correction on uncertain data.
System 6: Anomaly Detection
Documents can contain contradictions, impossible dates, and circular references. The simulation escape detector flags these before they reach a human.
A date in 1847 is almost certainly an OCR error or copy-paste mistake. Two different versions of the same standard in the same spec is a real conflict that needs resolution. Both are "simulation escapes" — states that shouldn't exist given the document's internal rules.
The full inventory
Six systems explained above, eleven more running underneath. Here's the complete map, organized by which engine pillar owns each system:
Plus four systems in the torsion subsystem (lazy scheduling, spin-curvature field computation, vortex caching, chirality feedback) that handle the initial field physics computations before classification begins.
Why the simulation framing
Fair question. Why call it "simulation-aware" instead of "document processing pipeline"?
Because the framing changes how you design systems. If you think of a document as text to extract data from, you build a pipeline. If you think of a document as a universe to verify, you build something different:
- Pipelines extract data. Simulations verify consistency.
- Pipelines classify content. Simulations explain classifications.
- Pipelines process sequentially. Simulations detect anomalies.
- Pipelines produce output. Simulations prove output is correct.
The simulation framing led us to systems we wouldn't have built otherwise. Causal consistency networks exist because we asked "can we trace the causal chain for every finding?" Reality anchors exist because we asked "what are the known-true facts in this document, and what happens when one is wrong?" Merkle trees exist because we asked "can we verify a single paragraph without re-processing 200 pages?"
These questions don't arise from a data extraction mindset. They arise from treating the document as a system with internal rules that can be checked.
What this enables
Three capabilities that a standard document AI can't provide:
1. Audit defense
When a client or regulator asks "why did your system flag this section as safety-critical?", you show the causal chain: keyword "shall" (mandatory authority) + reference to OSHA 1926 (safety standard) + structural discipline context = safety-critical classification. Each link in the chain is a specific, verifiable signal. Not a model confidence score.
2. Incremental verification
A 500-page spec was processed six months ago. Today, section 14 needs re-verification. The Merkle tree provides a proof path for section 14 without re-processing sections 1-13 and 15-500. If the proof validates, section 14 hasn't been tampered with. If it fails, something changed.
3. Cascading trust
ASTM C150-22 gets superseded by C150-23. One anchor invalidation, and every finding in every document that referenced the old standard gets flagged as "suspect" with a clear trail: "This finding was anchored to ASTM C150-22, which has been superseded." No re-processing needed — just an anchor update that cascades through the dependency graph.
What we open-sourced
Decompose is the open-source version of two of these systems: the attention scorer (system 4) and the irreducibility detector (system 5). It runs on pure regex, processes documents in ~14ms on average, and gives any agent the ability to prioritize what matters.
The remaining 15 systems are part of AECai, which runs locally on your hardware and processes AEC documents with the full simulation-aware architecture.
Both are built by Echology. If you're building document intelligence for an industry where wrong answers have consequences, let's talk.