How I read rulebooks

From PDF to searchable knowledge

When a publisher adds a rulebook, text extraction happens right away, but the real work — chunking and embedding — is on-demand. The first time someone asks a question about that game, the processing pipeline runs and stores the result. Every query after that is instant.

Here is the full processing pipeline:

Loading diagram...

The key design decision: PDFs are not batch-processed speculatively. That would waste compute on rulebooks nobody ever asks about.

What Apache Tika does

Tika is the extraction engine. It handles every PDF quirk — multi-column layouts, scanned pages, embedded fonts — and produces clean plain text. It also extracts per-page character count metadata used to map character offsets back to page numbers.

Chunking strategy

The extracted text is split into overlapping chunks of roughly 500 tokens. The 10% overlap ensures that a sentence split across a chunk boundary is still fully represented in at least one chunk. Each chunk gets a page_start and page_end estimate derived from the cumulative character-per-page data.

Why 768 dimensions?

The system uses jina-v2-small-en from the Jina AI embedding model family. 768-dimensional vectors hit a sweet spot: high enough semantic resolution to distinguish "when can I interrupt a player action" from "how do player actions work", small enough for fast indexing.

Numbers at a glance

Metric	Value
Rulebooks in library	3,300+
PDFs pending (waiting for first query)	~3,000 (normal)
Vector dimensions	768
Chunk size	~500 tokens
Embedding model	jina-v2-small-en
Storage backend	PostgreSQL + pgvector