Autonomous Quality Optimisation
Last updated: March 31, 2026
The self-improving pipeline
Board Game Librarian includes a background service — the quality-optimizer — that evaluates the Q&A pipeline daily, proposes improvements to the prompt templates that drive synthesis, and applies validated improvements automatically or after admin approval.
The goal is to catch systematic prompt issues before they accumulate and to fix them without requiring manual prompt engineering every time a pattern is reported.
Loading diagram...
Phase 1: Triage
The triage step runs a SQL query against unified_interactions_v2 to classify every interaction:
| Classification | Meaning |
|---|---|
below_threshold | Unusable — error, empty response, system/test message, unsupported language |
needs_attention | Real question, real answer — send to LLM-as-judge for scoring |
evaluated | Already scored by the judge in a previous run |
Triage is incremental: new interactions are classified on each run. Previously scored interactions are not re-evaluated.
Phase 2: LLM-as-Judge (CC agent)
A Claude Code agent (the "CC agent") picks up any quality_runs row with status = pending_cc and begins scoring.
For each flagged interaction, the judge evaluates:
| Dimension | What it checks |
|---|---|
| Accuracy | Does the answer correctly represent what the rulebook says? |
| Completeness | Does it cover all parts of the question without omitting key details? |
| Format | Is the structure appropriate for the question type (YES_NO vs PROCEDURAL vs EDGE_CASE)? |
| Relevance | Does the answer stay on topic and avoid tangential filler? |
Scores are stored in quality_interaction_evals.eval_dimensions (JSONB). The overall eval_score is a weighted average (0.0–1.0).
The judge also identifies:
- Which synthesis template file was used (from
question_category+ tier) - What specific issue, if any, occurred
- Whether a pattern exists across multiple interactions for the same template
Phase 3: Proposal generation
If a pattern is identified — e.g. 15 of the last 20 EDGE_CASE interactions scored below 0.70 on the conditionality dimension — the CC agent generates a YAML diff proposal targeting the specific template.
The proposal is stored in quality_prompt_proposals:
| Field | Content |
|---|---|
template_file | e.g. synthesis-tier1-edge_case-normal.yml |
question_category | e.g. EDGE_CASE |
issue_summary | Plain-language description of the problem |
diff_before | Original YAML content |
diff_after | Proposed YAML content |
Phase 4: Test battery
Before any proposal is applied, it is validated against a representative set of questions (quality_test_results). The orchestrator's /api/internal/quality/test-synthesis endpoint processes each question using the proposed template in isolation, without affecting live traffic.
The test battery contains 40 questions sampled across question categories and games, with known expected answer characteristics. The score before and after the proposal is compared.
Decision thresholds
| Delta | Decision |
|---|---|
| >= 8% improvement, <= 2 regressions | auto_apply — applied immediately |
| 3–8% improvement | pending_approval — admin decides at /admin/quality |
| < 3% improvement or > 2 regressions | blocked — proposal discarded |
Thresholds are configurable via environment variables: QUALITY_AUTO_APPLY_DELTA, QUALITY_PENDING_DELTA, QUALITY_AUTO_APPLY_MAX_DEGRADED.
Safe deployment
When a proposal is applied (auto or admin-approved):
- The YAML template file is overwritten with
diff_after - Redis prompt cache (DB 2) is flushed:
redis-cli -n 2 FLUSHDB - The orchestrator restarts:
pm2 restart rules-orchestrator --update-env - A git commit records the change
If something goes wrong after deployment, the admin can roll back via the /admin/quality interface. Rollback overwrites the template with diff_before and repeats the flush+restart sequence.
Admin interface
The admin dashboard at /admin/quality shows all runs with:
- Run status (
pending_cc,running,pending_approval,auto_applied,approved,blocked,rolled_back) - Number of interactions triaged and flagged
- Test battery question count
- Score before and after (delta as a percentage)
- Decision and decision reason
- Per-run detail page showing the proposals and their diffs
Manual runs can be triggered from the dashboard. The default cron schedule is daily at 02:00 UTC.
Service details
- Service name:
quality-optimizer - Port: 3482
- Cron schedule:
0 2 * * *(configurable viaQUALITY_CRON_SCHEDULE) - Default sample size: 120 interactions per run
- Battery size: 40 questions per proposal
Related pages
- Inside the prompt architecture — the YAML templates the optimizer modifies
- The Question & Answer Pipeline — the pipeline the optimizer monitors
- The RAG Pipeline in Detail — retrieval layer the optimizer works above