Skip to content

Add rag_luciole citation-aware grounded QA benchmark#7

Open
Matteovanypersele wants to merge 4 commits into
OpenLLM-France:mainfrom
Matteovanypersele:add-rag-luciole
Open

Add rag_luciole citation-aware grounded QA benchmark#7
Matteovanypersele wants to merge 4 commits into
OpenLLM-France:mainfrom
Matteovanypersele:add-rag-luciole

Conversation

@Matteovanypersele
Copy link
Copy Markdown
Collaborator

@Matteovanypersele Matteovanypersele commented Apr 29, 2026

Summary

This PR adds rag_luciole, a community benchmark for grounded QA with citations.

The benchmark evaluates whether a model can answer from retrieved documents, cite the supporting sources, and refuse to answer when the documents do not contain enough information. It covers the luciole_rag SFT datasets

Rows with an empty supporting_facts_titles list are treated as unanswerable. For those examples, refusing to answer is the expected behavior.

Metrics

For answerable examples, the benchmark reports:

  • exact and fuzzy answer matching (gold answer extracted from the gold assistant generation)
  • citation precision, recall, and F1, both strict and fuzzy match
  • distractor citation rate

For unanswerable examples, it reports refusal behavior:

  • refusal recall
  • refusal precision
  • false refusal rate

An optional LLM-as-judge path can also be enabled with RAG_LUCIOLE_USE_JUDGE=1 to score factual grounding on answerable examples (correct chunks and gold answer

Scores and standard errors are computed on the metric’s applicable subset only

Supporting changes

This PR also includes a small prompt-manager change that lets a task request its instruction to be emitted as a system message through
Doc.specific["instruction_as_system"]. rag_luciole needs this because the retrieved documents are provided in the system message, matching the SFT format.

It also fixes a small max_tokens issue in the JudgeLM litellm path, which was surfaced while testing the optional factual judge.

litellm.completion expects an int, not a (N,) tuple.
Current RAG-style tasks need the row-specific retrieved context to
live in the system role, not prepended to the user query. Opt-in
flag keeps all existing tasks unchanged.
squad_v2 was filtering out questions with no answer, which is
exactly the half of the dataset that tests refusal behavior.
Replace the filter with an explicit "unanswerable" choice.
7 subsets (hotpotqa, hotpotqa_fr, tatqa, tatqav2, piaf, newsquadfr,
squad2_fr_pragnakalp). Rows with empty supporting_facts_titles are
unanswerable: refusing is the correct behavior.

Per-sample metrics (answerability-conditional):
- answer_em, answer_em_fuzzy: substring / token-recall match of the
  reference answer in the model response
- citation_{precision,recall,f1}_{strict,fuzzy}: citation accuracy
  against gold supporting-facts titles, with strict and fuzzy
  (substring) title matching
- distractor_citation_rate: fraction of cited titles that are not gold
- refusal_recall, refusal_precision, false_refusal_rate: refusal
  behavior on unanswerable vs answerable rows

Optional LLM-as-judge (RAG_LUCIOLE_USE_JUDGE=1):
- factual_judge_accuracy_ge_5, factual_judge_accuracy_gt_4: 1-5
  faithfulness rating on answerable rows, thresholded
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant