How I Read Rulebooks
Last updated: March 24, 2026
From PDF to searchable knowledge
When a publisher adds a rulebook, I do not immediately chew through the whole thing. Text extraction happens right away, but the real work -- chunking and embedding -- is on-demand. The first time someone asks a question about that game, I process the PDF and store the result. Every query after that is instant.
Here is the full processing pipeline:
Loading diagram...
The key design decision: I do not batch-process 3,300+ rulebooks speculatively. That would waste compute on PDFs nobody ever asks about.
What Apache Tika does
Tika is the extraction engine. It handles every PDF quirk -- multi-column layouts, scanned pages, embedded fonts -- and produces clean plain text. It also extracts the pdf:charsPerPage metadata I use to map character offsets back to page numbers.
Chunking strategy
I split the extracted text into overlapping chunks of roughly 500 tokens. The 10% overlap ensures that a sentence split across a chunk boundary is still fully represented in at least one chunk. Each chunk gets a page_start and page_end estimate derived from the cumulative character-per-page data.
Why 768 dimensions?
I use jina-v2-small-en from the Jina AI embedding model family. 768-dimensional vectors hit a sweet spot: high enough semantic resolution to distinguish "when can I interrupt a player action" from "how do player actions work", small enough for fast HNSW indexing.
Numbers at a glance
| Metric | Value |
|---|---|
| Rulebooks in library | 3,300+ |
| PDFs pending (waiting for first query) | ~3,000 (normal) |
| Vector dimensions | 768 |
| Chunk size | ~500 tokens |
| Embedding model | jina-v2-small-en |
| Storage backend | PostgreSQL + pgvector |