Retrieval systems · evaluation · automation

Gioia Zheng

Information Retrieval · RAG Evaluation · Reproducible ML · AI Infrastructure

B.Sc. student at Sapienza University of Rome building reproducible retrieve → rerank → generate systems and the evaluation machinery around large-scale QA pipelines.

Featured project All projects
msmarco-genqa
active

Reproducible retrieve → rerank → generate pipeline on MS MARCO with paired-bootstrap evaluation and a manifest-enforced reproducibility contract.

retrievalragevaluationreproducibility
Token-F1 Δ +0.171
95% CI [+0.163, +0.178]
Queries 6,980 paired
updated 2026-05-29 · View project
Recent writing
Failure analysis in retrieval-augmented generation
2026-05-28

A single aggregate score is the wrong unit for evaluating a RAG pipeline. Reporting per-category failure rates makes regressions visible that aggregates hide.

Read note