What the system can't see: limits of PDF extraction
Last updated: March 23, 2026
What the system can't see: limits of PDF extraction
The system works by reading text. Rulebooks communicate through text, images, diagrams, colour, spatial layout, and visual examples. That gap between what the system reads and what a rulebook actually contains is the most important thing to understand before deciding how to use it — or whether to integrate with it.
This isn't a solvable problem with more compute. It's structural.
Text is not a rulebook
Apache Tika extracts the character stream from a PDF. Characters. Not layout, not images, not the spatial relationship between elements on a page. When Tika finishes, you have a long string of text — and everything that wasn't text is simply gone.
Modern board game rulebooks aren't designed as text documents. They're designed as visual artefacts. A good rulebook designer uses diagrams to show component placement, colour-coded examples to illustrate rule interactions, icons to replace repetitive prose, and progressive visual layouts to walk players through setup. That's not stylistic decoration. It's often load-bearing communication.
The text extraction process doesn't fail silently — it succeeds completely at what it does, and what it does is extract text. The limitation isn't a bug.
Image-heavy manuals
Some rulebooks are more image than text. Games with rich visual production use frequent diagrams and annotated illustrations to explain mechanical flow. Wargames might use three-panel diagrams to show a complete combat sequence, where the prose just says "resolve combat as shown."
When Tika processes these rulebooks, the images don't become descriptions. They disappear. The prose remains. If that prose is self-contained — if it explains the rule completely without relying on the adjacent diagram — the system works fine. If the prose says "as illustrated above" or "see the diagram on the opposite page," you've got a reference to something that no longer exists.
The practical impact scales with image density. A rulebook that's 20% images loses relatively little. A rulebook that's 60% images — common in modern production-heavy games — loses a significant fraction of its communicative content. The system doesn't know what it's missing. It answers from whatever text remains, and that answer might be incomplete without any indication that it's incomplete.
Publishers who want accurate coverage from this system should consider their PDF design choices. Dense prose-based rulebooks (many older Euros, wargames, RPG-adjacent games) perform well. Visually-driven rulebooks designed for quick casual learning perform poorly.
Card games and card-heavy games
This is where the limitation gets sharp. In most card games — and in many deck-builders, engine-builders, and games with expansive card libraries — the rules live on the cards, not in the rulebook. The rulebook explains the general structure. The cards explain what actually happens.
Card text isn't in the rulebook PDF. It's on printed cardstock. Tika can't read cardstock.
A player asking "what does the Alchemist ability do?" is asking about card text. The system doesn't have it. The answer will be something like "I don't have information about that specific card" — which is honest but frustrating if the user expects full coverage.
Games with fixed card sets and published card databases are a better target for integration than living card games or games where the card pool is large and frequently updated. Worth knowing before you build a support workflow around this.
Symbol and icon systems
Many games replace repeated words with icons. Resource symbols, action types, terrain markers, status conditions — designers use visual shorthand because it's faster to read at the table. The icon appears in the rules wherever the concept appears.
Tika handles this in one of two ways: it skips the icon entirely, leaving a gap in the sentence, or it substitutes a Unicode character that may or may not be meaningful. A rule like "pay ⚙️ to activate this ability" might become "pay to activate this ability" or "pay ? to activate this ability" or, in some PDF encodings, something stranger.
The downstream effect: the AI enrichment step tries to make sense of incomplete sentences, and usually does a reasonable job inferring context. But when a rule is built around the icon — when the icon is the subject of the sentence — inference can fail. The system might not know that a sword symbol means combat, so a chunk that says "⚔ resolves before movement" might not match well against a question about combat resolution.
Icon-heavy games include most worker placement games, most deckbuilders, and most mid-weight Euros published in the last decade. This isn't a rare edge case.
Layout and visual context
PDFs use a variety of layout techniques that Tika's linearisation destroys. Sidebars. Multi-column layouts. Callout boxes with rules summaries. Tables with bordered cells. Two-page spreads where the left page states the rule and the right page provides an annotated example.
Tika reads the PDF's internal character stream. That stream may or may not reflect reading order — it depends on how the PDF was constructed. A publisher who used InDesign with proper tagged PDF output gets clean linearisation. A publisher who exported from Illustrator or assembled the PDF in Acrobat Pro from individual page exports might produce a character stream that interleaves columns, reads right-to-left, or jumbles sidebar text into the middle of unrelated paragraphs.
The chunking step cuts this linearised text into ~300-token pieces. If the linearisation was incorrect, the chunks contain incoherent mixtures of rules. The AI enrichment can partially compensate — it's good at identifying that a chunk covers "action economy" even if the sentences are out of order — but it can't fully fix a meaningless input.
There's no automated quality signal that tells the system "this extraction went wrong." It just stores whatever Tika produced.
What this means in practice
If you're a publisher evaluating this system for your game:
A rulebook that's mostly prose, with images used for illustrations rather than rule delivery, will be well-served. A rulebook where diagrams are integral to understanding combat or setup — not just decorative — will be partially served. A game that relies heavily on card text for its strategic depth won't be well-served for card-specific questions.
If you're a user wondering why an answer seems incomplete or wrong:
The system answered from what's in the text. If the correct answer to your question involves a diagram, a card, an icon, or a visual example, the system might not have that information. It's not making a mistake based on bad reasoning — it's working correctly from incomplete data.
If you're a developer building on top of this:
Extraction quality varies per title. There's no single "coverage score" for a rulebook, because quality depends on the specific questions being asked. A rulebook might answer combat questions perfectly and fail on setup questions because setup was explained entirely through diagrams. Test with your actual target questions, not general accuracy metrics.
What we're working toward
OCR-plus-text extraction will always have these limits. Solving them completely requires approaches that don't exist in production yet — multimodal models that can reason about PDF pages as images, structured card databases linked to rulebook references, community-contributed rule summaries that fill gaps the PDF can't.
Some of these are tractable near-term. Card data from game databases and publisher APIs could supplement rulebook text for card-heavy games. Community forum threads are already indexed and searchable, which partially compensates for rulebook gaps (the community has often answered the questions that diagrams would have answered).
Some aren't tractable without fundamental changes to how the system processes documents. Diagram understanding would require visual models on every PDF chunk, at considerable cost and added complexity. Not ruled out. Not happening soon.
The current state: text-extractable content answers well, visual content answers poorly or not at all, and the system doesn't always know the difference. That's the baseline for any deployment decision.
Common questions
The system gave me a wrong answer about a rule — is the rulebook wrong? Not necessarily. The extracted text might be garbled from a complex layout, the relevant explanation might have been in a diagram, or the answer might have been in a sidebar that got mixed in with adjacent text. Check the PDF yourself and compare to the cited page range.
Can I upload a corrected or annotated version of the PDF? Yes. Admins can re-upload a PDF to trigger fresh extraction. If you have a version of the rulebook with better layout (plain text, single column, minimal images), that version will extract more cleanly than a visually-designed edition.
Does the system know it doesn't have information about a card? Usually not. It knows what's in its chunks. If the card text isn't there, the system doesn't have an absence marker — it just can't find relevant content and will say so, or will answer from general rules context. It won't always know that the answer requires card data specifically.
Why not just photograph the pages and use vision AI? Cost and latency. Running a vision model over hundreds of pages per rulebook, at query time or at ingestion time, is expensive. It's a viable direction. It isn't the current architecture.