Projects

System
BM25 → dense (MiniLM L6) → cross-encoder rerank → T5-small generation
Dataset
MS MARCO passage, 8.8 M docs, 6 980 paired queries (dev/small)
Main result
Reranking lifts Token-F1 from 0.197 to 0.368 (Δ +0.171, 95 % CI [+0.163, +0.178], paired bootstrap)
Reproducibility
Schema-v2 manifest contract; make reproduce-baseline from clean clone
Status
Active — CPU-only, single-machine, 371 tests passing

Problem

Surface generation metrics (Token-F1, ROUGE-L) on retrieve → rerank → generate pipelines do not, by themselves, tell you whether the generator’s output is actually supported by the retrieved evidence. Reranking the retrieval stage can simultaneously raise surface metrics and reduce grounding on the same fixed generator. Quantifying this requires a pipeline where every component is exchangeable, every run is reproducible to a bit-level manifest, and statistical comparisons are paired and confidence-bounded.

What I built

Four entry-point scripts under experiments/ — one per pipeline stage — sharing a single source-of-truth config and writing a schema-versioned manifest.json next to every metrics.json. The contract is enforced at write time: a run that fails to capture the six required reproducibility fields cannot land an artifact.

Headline result

Generation × retrieval source, full MS MARCO dev/small, 6 980 paired queries, paired bootstrap (N = 10 000).

Token-F1
0.368
Δ +0.171 vs 0.197
CI [+0.163, +0.178]
ROUGE-L
0.368
Δ +0.174 vs 0.193
CI [+0.166, +0.181]
BLEU
+0.221
Δ vs BM25
CI strictly > 0
Exact Match
+0.047
Δ vs BM25
CI strictly > 0

Retrieval-only, BM25 on the full 8.8 M-passage corpus (6 980 queries):

MRR@10
0.170
BM25, full corpus
Recall@100
0.621
BM25, full corpus
Recall@1000
0.815
BM25, full corpus

Technical components

Sparse retrieval
BM25, k₁ = 1.5, b = 0.75, top-k = 1000
Dense retrieval
sentence-transformers/all-MiniLM-L6-v2 + FAISS flat IP, qrels-anchored 50 000-passage sample
Reranking
cross-encoder/ms-marco-MiniLM-L-6-v2 over dense top-100
Generation
t5-small, frozen, max_new_tokens = 64
Retrieval metrics
MRR@k, Recall@k, nDCG@k
Generation metrics
Token-F1, Exact-Match, ROUGE-L, BLEU (best-of-N references)
Grounding
Lexical content-token, 3-gram, NLI entailment via cross-encoder/nli-deberta-v3-small
Statistical core
Paired bootstrap, N = 10 000 resamples, seed 42
Manifest contract
Required at write time: git.commit, git.dirty, extra.seed, extra.resolved_config_hash, extra.data_fingerprint, extra.env_fingerprint
CI
GitHub Actions, Python 3.10, CPU-only torch — pytest -q + ruff check on every push

Reproduce

Clean clone, CPU-only laptop:

make install
make reproduce-baseline

make reproduce-baseline runs the BM25 stage under a clean-tree checkpoint and then validates the produced manifest against committed fingerprints via scripts/verify_reproduction.py.

Current status

Active. Schema-v2 manifest contract closed in the reproducibility-protocol release, then extended with per-task NLI profile fields (backbone, score formula, threshold, label-index map, premise→hypothesis direction). The upcoming research/metric-robustness round runs the full factorial — multiple NLI backbones × score formulas × thresholds × seeds — with paired bootstrap CIs, a length covariate, and a failure taxonomy as a versioned data product.

Repo

github.com/GioiaZheng/msmarco-genqa