fix: transpose references before passing to sacrebleu in CorpusLevelTranslationMetric by jaydenC88 · Pull Request #1234 · huggingface/lighteval

jaydenC88 · 2026-05-08T01:04:33Z

Summary

CorpusLevelTranslationMetric.compute_corpus() was passing references to sacrebleu in the wrong shape. References were collected as [sent_id][ref_id], but sacrebleu's corpus_score expects [ref_id][sent_id]. Because sacrebleu internally does zip(*references) to transpose the streams, the wrong orientation meant only the first hypothesis was ever scored — matched against all references from all sentences pooled together — causing inflated scores (e.g. 100.0 for a half-correct corpus).

Changes:

src/lighteval/metrics/metrics_corpus.py: add from itertools import zip_longest and replace the per-metric special-case logic with a single zip_longest(*golds) transpose applied uniformly before corpus_score(). This fixes chrF, chrF++, and TER, and also lets BLEU use all available references instead of silently dropping to gold[0].
tests/unit/metrics/test_cases/chrf.json, chrf_plus.json, ter.json: update expected values that were computed with the buggy code.

Reproducer (from issue)

from lighteval.metrics.metrics_corpus import CorpusLevelTranslationMetric
from lighteval.metrics.sample_preparator import GenerativeCorpusMetricInput
from lighteval.utils.utils import as_list

items = [
    GenerativeCorpusMetricInput(golds=["GOOD"], preds=["GOOD"]),
    GenerativeCorpusMetricInput(golds=["REF2"], preds=["PRED2"]),
]
metric = CorpusLevelTranslationMetric("chrf++")

# Before fix: stats=1, score=100.0 (only first hyp scored against all refs)
# After fix:  stats=2, score=58.7  (each hyp scored against its own ref)
print(metric.compute_corpus(items))

Test plan

Reproducer from issue confirms len(stats) == len(hypotheses) after fix
All 8 chrf / chrf++ / ter / bleu test cases pass (pytest tests/unit/metrics/test_automated_metrics_pytest.py -k "chrf or ter or bleu")

…ranslationMetric sacrebleu expects references in [ref_id][sent_id] shape, but compute_corpus() was collecting golds as [sent_id][ref_id]. This caused sacrebleu's internal zip(*references) to transpose the already-per-sample list, resulting in only the first hypothesis being scored against a pooled set of all references. Fixed by transposing golds with zip_longest(*golds) before calling corpus_score(), so each inner list is one reference stream across all hypotheses. Also removes the special-case BLEU branch (gold[0]-only), allowing BLEU to benefit from multi-reference inputs the same way. Updated chrf, chrf++, and TER test fixtures to reflect the now-correct scores. Fixes huggingface#1112 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: transpose references before passing to sacrebleu in CorpusLevelTranslationMetric#1234

fix: transpose references before passing to sacrebleu in CorpusLevelTranslationMetric#1234
jaydenC88 wants to merge 1 commit into
huggingface:mainfrom
jaydenC88:fix/corpus-translation-metrics-reference-orientation

jaydenC88 commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jaydenC88 commented May 8, 2026

Summary

Reproducer (from issue)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant