- System
- BM25 → dense (MiniLM L6) → cross-encoder rerank → T5-small generation
- Dataset
- MS MARCO passage, 8.8 M docs, 6 980 paired queries (dev/small)
- Main result
- Reranking lifts Token-F1 from 0.197 to 0.368 (Δ +0.171, 95 % CI [+0.163, +0.178], paired bootstrap)
- Reproducibility
- Schema-v2 manifest contract;
make reproduce-baselinefrom clean clone - Status
- Active — CPU-only, single-machine, 371 tests passing
Problem
Surface generation metrics (Token-F1, ROUGE-L) on retrieve → rerank → generate pipelines do not, by themselves, tell you whether the generator’s output is actually supported by the retrieved evidence. Reranking the retrieval stage can simultaneously raise surface metrics and reduce grounding on the same fixed generator. Quantifying this requires a pipeline where every component is exchangeable, every run is reproducible to a bit-level manifest, and statistical comparisons are paired and confidence-bounded.
What I built
Four entry-point scripts under experiments/ — one per pipeline stage — sharing a single source-of-truth config and writing a schema-versioned manifest.json next to every metrics.json. The contract is enforced at write time: a run that fails to capture the six required reproducibility fields cannot land an artifact.
Headline result
Generation × retrieval source, full MS MARCO dev/small, 6 980 paired queries, paired bootstrap (N = 10 000).
Retrieval-only, BM25 on the full 8.8 M-passage corpus (6 980 queries):
Technical components
- Sparse retrieval
- BM25, k₁ = 1.5, b = 0.75, top-k = 1000
- Dense retrieval
sentence-transformers/all-MiniLM-L6-v2+ FAISS flat IP, qrels-anchored 50 000-passage sample- Reranking
cross-encoder/ms-marco-MiniLM-L-6-v2over dense top-100- Generation
t5-small, frozen, max_new_tokens = 64- Retrieval metrics
- MRR@k, Recall@k, nDCG@k
- Generation metrics
- Token-F1, Exact-Match, ROUGE-L, BLEU (best-of-N references)
- Grounding
- Lexical content-token, 3-gram, NLI entailment via
cross-encoder/nli-deberta-v3-small - Statistical core
- Paired bootstrap, N = 10 000 resamples, seed 42
- Manifest contract
- Required at write time:
git.commit,git.dirty,extra.seed,extra.resolved_config_hash,extra.data_fingerprint,extra.env_fingerprint - CI
- GitHub Actions, Python 3.10, CPU-only torch —
pytest -q+ruff checkon every push
Reproduce
Clean clone, CPU-only laptop:
make install
make reproduce-baseline
make reproduce-baseline runs the BM25 stage under a clean-tree checkpoint and then validates the produced manifest against committed fingerprints via scripts/verify_reproduction.py.
Current status
Active. Schema-v2 manifest contract closed in the reproducibility-protocol release, then extended with per-task NLI profile fields (backbone, score formula, threshold, label-index map, premise→hypothesis direction). The upcoming research/metric-robustness round runs the full factorial — multiple NLI backbones × score formulas × thresholds × seeds — with paired bootstrap CIs, a length covariate, and a failure taxonomy as a versioned data product.