Evaluation Framework¶
A repeatable eval harness for measuring retrieval quality. Run evals before and after changes to chunking, embedding models, or prompts.
Quick start¶
This ingests the eval corpus, runs the test set, and prints a summary table. No API key needed.
Manual usage¶
# Ingest the eval corpus into a fresh database
kb ingest --source eval/corpus --db eval.db
# Run evaluation
kb eval --db eval.db
# Run with custom options
kb eval --db eval.db --testset eval/testset.json --limit 10 --json
# Ingest and eval in one step
kb eval --db eval.db --corpus eval/corpus --ingest
Flags¶
| Flag | Default | Description |
|---|---|---|
--db |
kb.db |
Database path |
--testset |
eval/testset.json |
Path to query/answer test set |
--corpus |
eval/corpus |
Path to eval corpus directory |
--limit |
20 |
K value for retrieval (top-K fragments) |
--ingest |
false | Ingest the corpus before running eval |
--json |
false | Output structured JSON instead of a table |
Metrics¶
Retrieval metrics (reported at K=5, K=10, K=20)¶
- Hit@K — Was at least one expected source file found in the top-K results?
- Recall@K — Did the expected source files appear in the top-K retrieved fragments?
- Precision@K — What fraction of top-K fragments came from expected source files?
- MRR — Mean reciprocal rank of the first relevant fragment, averaged across queries.
- Avg confidence — Mean confidence score across retrieved fragments.
- Avg freshness — Mean freshness score across retrieved fragments.
Chunking stats¶
- Total fragment count
- Fragments per file
- Mean, median, and P95 token length (whitespace-approximated)
These track chunking quality over time, so a change that produces 3x more fragments or halves average length is immediately visible.
Eval corpus¶
Located in eval/corpus/. A set of fictional files about an "Acme Widget Service" designed to exercise different retrieval scenarios:
| File | Purpose |
|---|---|
README.md |
Project overview, features, quick start |
config.go |
Go config structs and validation |
api.go |
HTTP API handlers |
architecture.md |
System design (intentionally contradicts README on some details) |
runbook.md |
Operational procedures and troubleshooting |
CHANGELOG.md |
Version history and release notes |
CONTRIBUTING.md |
Contribution guidelines |
deploy/kubernetes.yaml |
Kubernetes deployment manifests |
design-review.md |
Design review discussion |
incident-review.md |
Incident post-mortem |
The corpus is checked into the repo and should not change between eval runs unless you're intentionally updating it.
Test set¶
Located in eval/testset.json. Each entry has:
{
"id": "q01",
"query": "What database does the widget service use?",
"expected_sources": ["config.go", "architecture.md"],
"reference_answer": "PostgreSQL, configured via DATABASE_URL.",
"category": "direct_extraction"
}
Categories: - direct_extraction — single-source factual lookup - cross_document — requires information from multiple files - knowledge_update — tests whether newer information is preferred - abstention — no good answer exists in the corpus - pronoun_resolution — requires resolving references across context - vocabulary_mismatch — query uses different terms than the corpus
Extending the eval¶
Adding queries: Edit eval/testset.json. Include expected_sources (filenames that should appear in results) and a reference_answer for human comparison.
Adding corpus files: Add files to eval/corpus/ and write questions that reference them. Re-run make eval to see the impact.
Comparing configurations: Run eval with different embedding models or chunk sizes by changing KB_EMBEDDING_MODEL or KB_MAX_CHUNK_SIZE, then compare the summary tables.