Add rag_luciole citation-aware grounded QA benchmark by Matteovanypersele · Pull Request #7 · OpenLLM-France/lighteval

Matteovanypersele · 2026-04-29T10:33:30Z

Summary

This PR adds rag_luciole, a community benchmark for grounded QA with citations.

The benchmark evaluates whether a model can answer from retrieved documents, cite the supporting sources, and refuse to answer when the documents do not contain enough information. It covers the luciole_rag SFT datasets

Rows with an empty supporting_facts_titles list are treated as unanswerable. For those examples, refusing to answer is the expected behavior.

Metrics

For answerable examples, the benchmark reports:

exact and fuzzy answer matching (gold answer extracted from the gold assistant generation)
citation precision, recall, and F1, both strict and fuzzy match
distractor citation rate

For unanswerable examples, it reports refusal behavior:

refusal recall
refusal precision
false refusal rate

An optional LLM-as-judge path can also be enabled with RAG_LUCIOLE_USE_JUDGE=1 to score factual grounding on answerable examples (correct chunks and gold answer

Scores and standard errors are computed on the metric’s applicable subset only

Supporting changes

This PR also includes a small prompt-manager change that lets a task request its instruction to be emitted as a system message through
Doc.specific["instruction_as_system"]. rag_luciole needs this because the retrieved documents are provided in the system message, matching the SFT format.

It also fixes a small max_tokens issue in the JudgeLM litellm path, which was surfaced while testing the optional factual judge.

litellm.completion expects an int, not a (N,) tuple.

Current RAG-style tasks need the row-specific retrieved context to live in the system role, not prepended to the user query. Opt-in flag keeps all existing tasks unchanged.

squad_v2 was filtering out questions with no answer, which is exactly the half of the dataset that tests refusal behavior. Replace the filter with an explicit "unanswerable" choice.

7 subsets (hotpotqa, hotpotqa_fr, tatqa, tatqav2, piaf, newsquadfr, squad2_fr_pragnakalp). Rows with empty supporting_facts_titles are unanswerable: refusing is the correct behavior. Per-sample metrics (answerability-conditional): - answer_em, answer_em_fuzzy: substring / token-recall match of the reference answer in the model response - citation_{precision,recall,f1}_{strict,fuzzy}: citation accuracy against gold supporting-facts titles, with strict and fuzzy (substring) title matching - distractor_citation_rate: fraction of cited titles that are not gold - refusal_recall, refusal_precision, false_refusal_rate: refusal behavior on unanswerable vs answerable rows Optional LLM-as-judge (RAG_LUCIOLE_USE_JUDGE=1): - factual_judge_accuracy_ge_5, factual_judge_accuracy_gt_4: 1-5 faithfulness rating on answerable rows, thresholded

Matteovanypersele added 4 commits April 28, 2026 15:33

fix max_tokens tuple bug in JudgeLM litellm call

f73c07d

litellm.completion expects an int, not a (N,) tuple.

support per-doc system role via Doc.specific["instruction_as_system"]

7b2bea5

Current RAG-style tasks need the row-specific retrieved context to live in the system role, not prepended to the user query. Opt-in flag keeps all existing tasks unchanged.

keep unanswerable rows in squad_v2

fc9ddd3

squad_v2 was filtering out questions with no answer, which is exactly the half of the dataset that tests refusal behavior. Replace the filter with an explicit "unanswerable" choice.

Matteovanypersele force-pushed the add-rag-luciole branch from 4ffcaf2 to dc1af87 Compare April 29, 2026 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rag_luciole citation-aware grounded QA benchmark#7

Add rag_luciole citation-aware grounded QA benchmark#7
Matteovanypersele wants to merge 4 commits into
OpenLLM-France:mainfrom
Matteovanypersele:add-rag-luciole

Matteovanypersele commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Matteovanypersele commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Metrics

Supporting changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Matteovanypersele commented Apr 29, 2026 •

edited

Loading