Add rag_luciole citation-aware grounded QA benchmark#7
Open
Matteovanypersele wants to merge 4 commits into
Open
Add rag_luciole citation-aware grounded QA benchmark#7Matteovanypersele wants to merge 4 commits into
Matteovanypersele wants to merge 4 commits into
Conversation
litellm.completion expects an int, not a (N,) tuple.
Current RAG-style tasks need the row-specific retrieved context to live in the system role, not prepended to the user query. Opt-in flag keeps all existing tasks unchanged.
squad_v2 was filtering out questions with no answer, which is exactly the half of the dataset that tests refusal behavior. Replace the filter with an explicit "unanswerable" choice.
7 subsets (hotpotqa, hotpotqa_fr, tatqa, tatqav2, piaf, newsquadfr,
squad2_fr_pragnakalp). Rows with empty supporting_facts_titles are
unanswerable: refusing is the correct behavior.
Per-sample metrics (answerability-conditional):
- answer_em, answer_em_fuzzy: substring / token-recall match of the
reference answer in the model response
- citation_{precision,recall,f1}_{strict,fuzzy}: citation accuracy
against gold supporting-facts titles, with strict and fuzzy
(substring) title matching
- distractor_citation_rate: fraction of cited titles that are not gold
- refusal_recall, refusal_precision, false_refusal_rate: refusal
behavior on unanswerable vs answerable rows
Optional LLM-as-judge (RAG_LUCIOLE_USE_JUDGE=1):
- factual_judge_accuracy_ge_5, factual_judge_accuracy_gt_4: 1-5
faithfulness rating on answerable rows, thresholded
4ffcaf2 to
dc1af87
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds
rag_luciole, a community benchmark for grounded QA with citations.The benchmark evaluates whether a model can answer from retrieved documents, cite the supporting sources, and refuse to answer when the documents do not contain enough information. It covers the luciole_rag SFT datasets
Rows with an empty
supporting_facts_titleslist are treated as unanswerable. For those examples, refusing to answer is the expected behavior.Metrics
For answerable examples, the benchmark reports:
For unanswerable examples, it reports refusal behavior:
An optional LLM-as-judge path can also be enabled with
RAG_LUCIOLE_USE_JUDGE=1to score factual grounding on answerable examples (correct chunks and gold answerScores and standard errors are computed on the metric’s applicable subset only
Supporting changes
This PR also includes a small prompt-manager change that lets a task request its instruction to be emitted as a
systemmessage throughDoc.specific["instruction_as_system"].rag_lucioleneeds this because the retrieved documents are provided in the system message, matching the SFT format.It also fixes a small
max_tokensissue in the JudgeLM litellm path, which was surfaced while testing the optional factual judge.