Short technical notes. New notes land alongside the experimental round that produced them.
A single aggregate score is the wrong unit for evaluating a RAG pipeline. Reporting per-category failure rates makes regressions visible that aggregates hide.