msmarco-genqa — Gioia Zheng

System: BM25 → dense (MiniLM L6) → cross-encoder rerank → T5-small generation
Dataset: MS MARCO passage, 8.8 M docs, 6 980 paired queries (dev/small)
Main result: Reranking lifts Token-F1 from 0.197 to 0.368 (Δ +0.171, 95 % CI [+0.163, +0.178], paired bootstrap)
Reproducibility: Schema-v2 manifest contract; make reproduce-baseline from clean clone
Status: Active — CPU-only, single-machine, 371 tests passing

Problem

Surface generation metrics (Token-F1, ROUGE-L) on retrieve → rerank → generate pipelines do not, by themselves, tell you whether the generator’s output is actually supported by the retrieved evidence. Reranking the retrieval stage can simultaneously raise surface metrics and reduce grounding on the same fixed generator. Quantifying this requires a pipeline where every component is exchangeable, every run is reproducible to a bit-level manifest, and statistical comparisons are paired and confidence-bounded.

What I built

Four entry-point scripts under experiments/ — one per pipeline stage — sharing a single source-of-truth config and writing a schema-versioned manifest.json next to every metrics.json. The contract is enforced at write time: a run that fails to capture the six required reproducibility fields cannot land an artifact.

Headline result

Generation × retrieval source, full MS MARCO dev/small, 6 980 paired queries, paired bootstrap (N = 10 000).

Token-F1

0.368

Δ +0.171 vs 0.197

CI [+0.163, +0.178]

ROUGE-L

0.368

Δ +0.174 vs 0.193

CI [+0.166, +0.181]

BLEU

+0.221

Δ vs BM25

CI strictly > 0

Exact Match

+0.047

Δ vs BM25

CI strictly > 0

Retrieval-only, BM25 on the full 8.8 M-passage corpus (6 980 queries):

MRR@10

0.170

BM25, full corpus

Recall@100

0.621

BM25, full corpus

Recall@1000

0.815

BM25, full corpus

Technical components

Sparse retrieval: BM25, k₁ = 1.5, b = 0.75, top-k = 1000
Dense retrieval: sentence-transformers/all-MiniLM-L6-v2 + FAISS flat IP, qrels-anchored 50 000-passage sample
Reranking: cross-encoder/ms-marco-MiniLM-L-6-v2 over dense top-100
Generation: t5-small, frozen, max_new_tokens = 64
Retrieval metrics: MRR@k, Recall@k, nDCG@k
Generation metrics: Token-F1, Exact-Match, ROUGE-L, BLEU (best-of-N references)
Grounding: Lexical content-token, 3-gram, NLI entailment via cross-encoder/nli-deberta-v3-small
Statistical core: Paired bootstrap, N = 10 000 resamples, seed 42
Manifest contract: Required at write time: git.commit, git.dirty, extra.seed, extra.resolved_config_hash, extra.data_fingerprint, extra.env_fingerprint
CI: GitHub Actions, Python 3.10, CPU-only torch — pytest -q + ruff check on every push

Reproduce

Clean clone, CPU-only laptop:

make install
make reproduce-baseline

make reproduce-baseline runs the BM25 stage under a clean-tree checkpoint and then validates the produced manifest against committed fingerprints via scripts/verify_reproduction.py.

Current status

Active. Schema-v2 manifest contract closed in the reproducibility-protocol release, then extended with per-task NLI profile fields (backbone, score formula, threshold, label-index map, premise→hypothesis direction). The upcoming research/metric-robustness round runs the full factorial — multiple NLI backbones × score formulas × thresholds × seeds — with paired bootstrap CIs, a length covariate, and a failure taxonomy as a versioned data product.

Repo

github.com/GioiaZheng/msmarco-genqa