From PDF to answer: how rulebooks become knowledge

When you ask "Can I interrupt an attack with a reaction?" the system doesn't search a PDF. It searches a compressed, enriched, vectorised representation of that PDF — one that was built the first time anyone ever asked about that game.

The four stages

Getting from a publisher's PDF to a searchable knowledge base involves four distinct stages: text extraction, chunking, AI enrichment, and embedding. Each stage transforms the content into something more machine-readable, and each introduces its own failure modes.

The stages run in sequence, but not all at the same time — and that's a deliberate architectural choice.

Stage 1: text extraction

When an admin uploads a PDF, Apache Tika runs immediately. Tika is an open-source content analysis toolkit that pulls raw character data out of documents. It's fast, it handles a wide range of PDF variants, and it produces plain text.

Plain text. That's the operative phrase. Tika doesn't render pages. It doesn't understand layout. It doesn't see images, diagrams, tables formatted with borders, or any visual element at all. It reads the character stream embedded in the PDF's internal structure and outputs it in the order it encounters it.

For a well-structured rulebook — one where the text flows linearly and the PDF encodes reading order correctly — Tika produces something quite usable. For a rulebook where the designer used sidebars, multi-column layouts, or floating callout boxes, Tika produces something that looks like it was run through a blender. The system stores whatever Tika produces, marks extraction as complete, and waits.

At this point, there are no chunks. No embeddings. No AI enrichment. Just a long string of text that may or may not be coherent.

Stage 2: chunking — cutting the book into pieces

A 200-page rulebook might contain 150,000 characters of text. You can't hand that to a language model and ask a question. Context windows have limits, and even where they don't, drowning a model in irrelevant text hurts answer quality. You need to find the relevant passages first — and that's what vector search does. But vector search needs chunks, not one enormous blob.

The system cuts the extracted text into segments of roughly 300 tokens, with overlap between adjacent chunks. The overlap matters: it ensures a passage near a chunk boundary doesn't lose its context on either side. A rule explained across two pages won't get silently amputated.

Each chunk gets a sequential ID, a reference back to its source PDF, and metadata tracking which character positions in the original text it came from. That character-position data is used later to estimate page numbers — since Tika doesn't preserve page boundaries, the system uses Tika's per-page character count metadata to do the math. It's an estimate. It's usually accurate to within a page or two.

Chunking is computationally cheap. The expensive parts come next.

Stage 3: AI enrichment — teaching the system what each piece means

Raw text chunks aren't great search targets on their own. A chunk that reads "If the active player has no cards in hand, they must pass their turn unless..." doesn't tell you what game mechanic this describes, what related concepts exist, or what the authoritative summary of this rule is.

AI enrichment addresses that. Each chunk gets sent to a language model (Claude or GPT-4o via OpenRouter), which generates three structured additions:

key_topics — the mechanical concepts the chunk actually covers. Combat resolution, resource generation, end-game scoring.

related_concepts — adjacent concepts that might be relevant. If a chunk discusses action points, related concepts might include turn structure, player order, and speed tokens.

authoritative_summary — a clean, precise restatement of what the chunk says, written for clarity rather than fidelity to the original prose.

This enrichment data doesn't replace the chunk text. It sits alongside it. When a user asks a question, the system can match against the original text and against the enriched concepts — which means it can find relevant chunks even when the user's vocabulary doesn't match the rulebook's vocabulary. A player asking about "reaction cards" might find the chunk that talks about "interrupt actions" because the AI enrichment bridged that gap.

There's a cost. Enrichment takes a few seconds per chunk, and a 200-page rulebook might produce several hundred chunks. It runs in the background and it hits a paid API for each chunk.

Stage 4: embedding — converting meaning into numbers

Once a chunk has its text and AI-enriched metadata, the system generates embeddings. An embedding is a fixed-length vector — 768 numbers — that encodes the semantic meaning of a piece of text. Two chunks about the same concept will have vectors that point in similar directions in 768-dimensional space. Two chunks about unrelated topics won't.

The embedding model is jina-v2-small-en, running via sentence-transformers on a local embedding service. It generates three separate vectors per chunk:

A main content embedding (from the raw chunk text)
A concepts embedding (from the AI-enriched topics and related concepts)
A topic keywords embedding

Three vectors per chunk, because different queries benefit from different representations. A vague question like "how does combat work?" might match best against the concepts embedding. A specific technical question might match best against the raw content.

These vectors get stored in PostgreSQL alongside the chunk text, using the pgvector extension with an HNSW index. HNSW (Hierarchical Navigable Small World) is an approximate nearest-neighbour algorithm that trades a tiny bit of recall for a large gain in query speed. Query latency is under 100ms for most games.

The on-demand trigger

Here's the architectural choice that surprises most people: chunking, enrichment, and embedding don't happen when a PDF is uploaded. They happen when the first user asks a question about that game.

There are thousands of PDFs in the system. Most games have never been asked about. Running full enrichment and embedding pipelines on every uploaded PDF would cost significant compute and API money, mostly on games nobody queries. The on-demand architecture means that work only happens when there's actual demand.

When a question comes in for a game that hasn't been processed yet, the processing pipeline fires asynchronously. A typical rulebook takes roughly 5–10 minutes.

Meanwhile, the user's question still needs an answer.

The first question: RAW-FIRST fallback

The first time anyone asks about a game, the chunks don't exist yet. The system can't do vector search, so it falls back to reading the raw Tika output directly — sending relevant portions straight to the language model.

This is the RAW-FIRST fallback. It works, but it's less precise than vector search. The system can't find the most relevant passages through semantic similarity; it has to make do with simpler text retrieval. For complex rules questions, the answer quality is noticeably lower than what vector search produces.

After the background pipeline completes, every subsequent question about that game uses proper vector search. The RAW-FIRST response is a one-time cost. Users who ask a second question, or who ask the day after someone else first triggered processing, get full vector search quality.

There's no user-visible indicator that a game is being processed. It just happens in the background.

What the system actually stores

After the full pipeline runs, a typical rulebook chunk ends up with the following in the database:

The raw chunk text (~300 tokens of extracted content)
Character position offsets (start and end positions in the original text)
Estimated page range (e.g., "Pages 23–24")
AI-generated key topics, related concepts, and authoritative summary
Three 768-dimensional vectors (main, concepts, keywords)
Metadata linking back to the source PDF, the game, and the partner who uploaded it

When a question comes in, the system embeds the question using the same model, runs three parallel vector similarity searches — one against each vector type — and combines the results. The top-scoring chunks go into the context window for the language model, which synthesises an answer and generates citations pointing back to the source PDF and estimated page numbers.

The page numbers are calculated from character offsets and Tika's character-count metadata. They're usually right. They're occasionally off by one.

Common questions

How long does first-time processing take? Usually 5–10 minutes for a typical rulebook. A very large rulebook (300+ pages) can take longer. The user's first question still gets answered via RAW-FIRST fallback during this window.

Can I force reprocessing? Yes — admins can trigger a reprocess from the admin PDF panel. This re-runs chunking and embedding from the existing extracted text. It doesn't re-extract from the PDF itself unless the file is re-uploaded.

Does the system learn from corrections? Not in the machine-learning sense. There's no fine-tuning happening on question-answer pairs. Corrections to the rulebook require re-uploading the PDF or editing the extracted text.

Why PostgreSQL and not a dedicated vector database? pgvector with HNSW indexing handles the current scale (hundreds of games, millions of chunks) without needing an additional infrastructure component. It's not the right choice at every scale, but it's the right choice for this one.

What happens if Tika fails? Extraction failures are logged and the PDF is marked as failed. Admins can retry. The system won't attempt chunking or embedding on a failed extraction.

Does the system handle multiple language editions of the same game? Partially. Each PDF is stored and processed separately. If you upload a French edition and an English edition, the system maintains two separate knowledge bases for the same game. Queries are answered from whatever PDFs are associated with that game in the library.