How AxiomWeaver's Entity Scanner Works

I'm not a computer scientist. I don't have a CS degree. I'm a software engineer who builds things. When I designed the entity scanner for AxiomWeaver, I knew exactly what it needed to do without knowing how to make it fast enough.

The requirement was simple: detect every entity name in the author's prose, in real time, as they type. Characters, items, locations, factions: anything in the Lexicon. When you type "Borin Ironforge," the scanner should recognize it instantly and link it to Borin's entity card.

The catch: LitRPG novels are big. 200,000 words. 50+ named characters, each with aliases. A brute-force search, checking every entity name against every paragraph, would choke at scale.

The algorithm I'd never heard of

I described the problem to Claude: "I need to search for hundreds of string patterns simultaneously in a large body of text, and it needs to be fast enough to run on every keystroke."

The answer came back immediately: Aho-Corasick.

It's an algorithm from 1975, older than I am. Alfred Aho and Margaret Corasick designed it for the original fgrep Unix tool. The core idea is elegant:

Take all your search patterns (every entity name and alias) and build them into a tree structure called a trie.
Add shortcut links between branches so that when a partial match fails, the algorithm knows exactly where to jump instead of starting over.
Walk through the text exactly once. One pass. No matter how many patterns you're searching for.

That's the key insight: whether you have 10 entity names or 10,000, the algorithm reads through your manuscript once. The number of patterns doesn't affect the scan speed.

Claude didn't just hand me the algorithm; it walked me through why it works. The trie structure, the failure links, why it's O(n + m + z) where n is the text length, m is the total pattern length, and z is the number of matches. I understood it before I wrote a line of code.

Building it in Rust

The implementation uses the aho-corasick crate in Rust, which provides a battle-tested, optimized version of the algorithm. My job was everything around it:

Pattern preparation. Every entity name and alias gets lowercased for case-insensitive matching. Single-character patterns get skipped (too many false positives). The algorithm uses "leftmost longest" matching, so "Borin Ironforge" wins over just "Borin" when the full name appears.

Word boundary filtering. Raw matches get post-filtered: the character before and after each match must be non-alphanumeric. This prevents "art" from matching inside "artist" or "the" from matching inside "together." Unicode-aware, because curly quotes and em-dashes are real punctuation that authors use.

Offset conversion. The Rust crate returns byte offsets. The editor (ProseMirror) needs character positions. Multi-byte Unicode characters (curly quotes, em-dashes) mean these aren't the same thing. A dedicated conversion step handles this, with a test specifically for the edge case.

19 unit tests. 23 end-to-end tests. 85 documented test scenarios across 9 categories. I don't ship features that aren't tested.

The performance numbers

Target: sub-millisecond for a single text block. Under 50ms for an entire chapter. Under 2 seconds for a full 100-chapter project scan.

In practice, single blocks scan so fast the timer rounds to zero. The Rust implementation processes hundreds of megabytes per second. For a writing tool, that's absurd overkill, and it means the scanner can run on every keystroke via a 500ms debounce without the author ever noticing.

Then I hit the wall

Everything worked perfectly in development. Small documents, a few entities, instant scans. I was feeling good.

Then I imported a 100,000-word manuscript and created a new entity.

The app froze.

When you create a new entity, the scanner needs to rescan the entire document. Every block of prose needs to be checked for the new name. On a normal document, this takes milliseconds. On 100,000 words, the scan was fast but the frontend work of applying marks to every matched entity across thousands of ProseMirror blocks locked the browser's main thread.

The fix: chunked rescanning

The solution was architectural, not algorithmic. The Rust scanner was already fast. The problem was the frontend trying to apply all the results at once.

I built a chunked rescan engine: split the mark operations into batches of 40 blocks. Between each batch, yield back to the browser via requestAnimationFrame. If the author types during the rescan, abort and restart after they pause.

The result: the scanner still processes the full document, but the UI stays responsive throughout. On a 100K-word manuscript, the rescan happens in the background while you keep writing. You see entity highlights appearing progressively down the page, like a wave.

Why this matters for authors

The scanner isn't a feature you interact with directly. It's infrastructure, the foundation for everything intelligent the app does.

When you hover over a character name and see their stat card, that's the scanner knowing the name is there. When the Lexicon shows "41 mentions across 41 blocks" for Borin Ironforge, that's the scanner's data. When a future version highlights continuity errors or tracks which characters appear in which scenes, that's all built on the scanner.

For the author, it means the Lexicon builds itself as you write. You don't tag entities manually. You don't maintain a separate list of "characters in this chapter." You just write, and the engine understands what you're writing about. The scanner is one part of a larger system. The post on AxiomWeaver's four primitives explains how entities, templates, and relations fit together.

What I learned

I didn't know what Aho-Corasick was before I built this feature. Now I can explain it at a whiteboard. That's what building with AI looks like when it works well: not "AI wrote my code," but "AI was the collaborator who knew the algorithm I needed and helped me understand it before I implemented it."

The scanner shipped on February 17th. The chunked rescan fix shipped on the 28th. 19 unit tests, 23 E2E tests, handles 200K+ words in real time. It's one of the features I'm most proud of in the entire app, and it started with me not knowing how to do it.

That's the thing about building solo. You don't need to know everything upfront. You need to know what you want to build, and be willing to learn the rest along the way.