Skip to content

fix: transpose references before passing to sacrebleu in CorpusLevelTranslationMetric#1234

Open
jaydenC88 wants to merge 1 commit into
huggingface:mainfrom
jaydenC88:fix/corpus-translation-metrics-reference-orientation
Open

fix: transpose references before passing to sacrebleu in CorpusLevelTranslationMetric#1234
jaydenC88 wants to merge 1 commit into
huggingface:mainfrom
jaydenC88:fix/corpus-translation-metrics-reference-orientation

Conversation

@jaydenC88
Copy link
Copy Markdown

Summary

Fixes #1112

CorpusLevelTranslationMetric.compute_corpus() was passing references to sacrebleu in the wrong shape. References were collected as [sent_id][ref_id], but sacrebleu's corpus_score expects [ref_id][sent_id]. Because sacrebleu internally does zip(*references) to transpose the streams, the wrong orientation meant only the first hypothesis was ever scored — matched against all references from all sentences pooled together — causing inflated scores (e.g. 100.0 for a half-correct corpus).

Changes:

  • src/lighteval/metrics/metrics_corpus.py: add from itertools import zip_longest and replace the per-metric special-case logic with a single zip_longest(*golds) transpose applied uniformly before corpus_score(). This fixes chrF, chrF++, and TER, and also lets BLEU use all available references instead of silently dropping to gold[0].
  • tests/unit/metrics/test_cases/chrf.json, chrf_plus.json, ter.json: update expected values that were computed with the buggy code.

Reproducer (from issue)

from lighteval.metrics.metrics_corpus import CorpusLevelTranslationMetric
from lighteval.metrics.sample_preparator import GenerativeCorpusMetricInput
from lighteval.utils.utils import as_list

items = [
    GenerativeCorpusMetricInput(golds=["GOOD"], preds=["GOOD"]),
    GenerativeCorpusMetricInput(golds=["REF2"], preds=["PRED2"]),
]
metric = CorpusLevelTranslationMetric("chrf++")

# Before fix: stats=1, score=100.0 (only first hyp scored against all refs)
# After fix:  stats=2, score=58.7  (each hyp scored against its own ref)
print(metric.compute_corpus(items))

Test plan

  • Reproducer from issue confirms len(stats) == len(hypotheses) after fix
  • All 8 chrf / chrf++ / ter / bleu test cases pass (pytest tests/unit/metrics/test_automated_metrics_pytest.py -k "chrf or ter or bleu")

…ranslationMetric

sacrebleu expects references in [ref_id][sent_id] shape, but compute_corpus()
was collecting golds as [sent_id][ref_id]. This caused sacrebleu's internal
zip(*references) to transpose the already-per-sample list, resulting in only
the first hypothesis being scored against a pooled set of all references.

Fixed by transposing golds with zip_longest(*golds) before calling
corpus_score(), so each inner list is one reference stream across all
hypotheses. Also removes the special-case BLEU branch (gold[0]-only), allowing
BLEU to benefit from multi-reference inputs the same way.

Updated chrf, chrf++, and TER test fixtures to reflect the now-correct scores.

Fixes huggingface#1112

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] chrF++/chrF/TER metrics receive references in wrong format, causing incorrect corpus-level scoring

1 participant