Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio) by jiafatom · Pull Request #2476 · microsoft/Olive

jiafatom · 2026-05-27T20:25:34Z

Summary

Extends Olive's evaluator framework with three vision-oriented accuracy sub-metrics for VQA, ChartQA, and OCR evaluation, following the existing pattern used for speech metrics (PR #2444).

New Metrics

Metric	Task Type	Suitable Benchmarks
`exact_match`	`vision-vqa`	AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS
`relaxed_accuracy`	`vision-chart-qa`	ChartQA (±5% numeric tolerance for numbers)
`word_sort_ratio`	`vision-ocr`	OCR benchmarks (word-level overlap)

Public HuggingFace Datasets

These metrics are designed to work with publicly available datasets:

Metric	Recommended Dataset	HuggingFace ID
`exact_match`	TextVQA	`facebook/textvqa`
`relaxed_accuracy`	ChartQA	`HuggingFaceM4/ChartQA`
`word_sort_ratio`	DocumentVQA	`HuggingFaceM4/DocumentVQA`

Example configuration snippets are provided in docs/source/how-to/configure-workflows/metrics-configuration.md.

Changes

olive/evaluator/metric.py: Adds EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType enum
olive/evaluator/accuracy.py: Implements the three metric classes with multi-answer support
olive/evaluator/olive_evaluator.py: Adds vision inference path and task-metric validation
olive/data/component/pre_process_data.py: Adds vision_vqa_pre_process component
olive/data/component/dataloader.py: Adds vision_vqa_dataloader with custom collate_fn for PIL images
olive/data/container/huggingface_container.py: Registers vision-vqa, vision-chart-qa, vision-ocr task types with appropriate dataloader
olive/olive_config.json: Adds vision extras (pillow)
docs/source/how-to/configure-workflows/metrics-configuration.md: Adds vision metrics documentation with public dataset examples
test/evaluator/test_accuracy.py: Unit tests covering all new metrics

Design

Vision metrics are text-based (compare predicted answer string to ground truth), task-dependent
Multiple valid answers supported via | separator (metrics match against any valid answer)
Task-metric validation ensures incompatible combinations raise ValueError
Custom vision_vqa_dataloader handles PIL images with a collate_fn that avoids PyTorch default collation issues
PyTorch path: model processor handles images natively
ONNX path: single forward pass assumed (classification-style VQA); for autoregressive models use PyTorch evaluator with generation loop

…rt_ratio) Add vision evaluation metrics to the Olive evaluator framework, enabling VQA, ChartQA, and OCR model evaluation. - exact_match: case-insensitive string equality for VQA tasks - relaxed_accuracy: ±5% numeric tolerance for ChartQA - word_sort_ratio: word-level overlap ratio for OCR Changes: - olive/evaluator/metric.py: Add EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType - olive/evaluator/accuracy.py: Add ExactMatch, RelaxedAccuracy, WordSortRatio classes - olive/evaluator/olive_evaluator.py: Add _inference_vision() path and task-metric validation - olive/data/component/pre_process_data.py: Add vision_vqa_pre_process data component - olive/data/container/huggingface_container.py: Add vision-vqa, vision-chart-qa, vision-ocr tasks - olive/olive_config.json: Add vision extra dependencies (Pillow) - test/evaluator/test_accuracy.py: Add 20 unit tests for vision metrics Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Document which datasets each vision metric is suitable for: - exact_match: AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS - relaxed_accuracy: ChartQA - word_sort_ratio: OCR Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix _validate_vision_task_metric to extract task from pre_process_data_config.params['task'] instead of non-existent DataConfig attributes - Wrap _VISION_ACCURACY_SUBTYPES across multiple lines for lint compliance - Use lowercase 'pillow' in olive_config.json for consistency - Add docstring note about ONNX vs PyTorch path for vision_vqa_pre_process Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR extends Olive’s evaluator framework with three vision-oriented, text-based accuracy sub-metrics intended for VQA/ChartQA/OCR-style evaluation, adding corresponding task registrations, data pre-processing, and unit tests.

Changes:

Add three new AccuracySubType values (exact_match, relaxed_accuracy, word_sort_ratio) and implement their metric logic.
Introduce a vision string-inference path in both ONNX and PyTorch evaluators, including task↔metric compatibility validation.
Register new HuggingFace task types and add a vision VQA pre-process component plus unit tests for the new metrics.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`olive/evaluator/metric.py`	Adds new vision accuracy sub-types to the enum.
`olive/evaluator/accuracy.py`	Implements `ExactMatch`, `RelaxedAccuracy`, and `WordSortRatio`.
`olive/evaluator/olive_evaluator.py`	Adds vision inference paths and task/metric validation for vision metrics.
`olive/data/component/pre_process_data.py`	Adds `vision_vqa_pre_process` that emits (image, question) inputs and string answers.
`olive/data/container/huggingface_container.py`	Registers new vision task types mapping to the vision pre-process component.
`olive/olive_config.json`	Adds a `vision` extra dependency (`pillow`).
`test/evaluator/test_accuracy.py`	Adds unit tests for all new metric classes.

- Fix isinstance check to include tuple (shaahji) - Support multiple valid answers via | separator instead of taking first only - Add vision_vqa_dataloader with custom collate_fn for PIL images (Copilot) - Register vision_vqa_dataloader for vision task types in HuggingfaceContainer - Simplify relaxed_accuracy numeric comparison (shaahji) - Add clarifying comment about single-pass vs generation models (jambayk) - Add vision metrics documentation with public HuggingFace dataset examples (facebook/textvqa, HuggingFaceM4/ChartQA, HuggingFaceM4/DocumentVQA) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

+    Note: This returns raw PIL images and question strings. For the PyTorch evaluator,
+    the model's own processor/tokenizer should be applied in the post_func or within
+    the model's forward method. For the ONNX evaluator, provide a custom pre-process
+    component that applies the appropriate processor/tokenizer to produce numeric
+    tensors matching the model's io_config.


+        # Extract task from pre_process_data_config params, which is how HuggingfaceContainer
+        # maps task types (e.g., "vision-vqa", "vision-chart-qa", "vision-ocr") to components.
+        pre_process_config = metric.data_config.pre_process_data_config
+        if pre_process_config and pre_process_config.params:
+            task_type = pre_process_config.params.get("task")


+def _is_vision_metric(metric: "Metric") -> bool:
+    """Check if metric uses vision accuracy sub-types (exact_match, relaxed_accuracy, word_sort_ratio).
+
+    Raises ValueError if vision sub-types are mixed with non-vision sub-types,
+    as they require different inference paths.
+    """
+    if metric.type != MetricType.ACCURACY:
+        return False
+    vision_based = [sub.name in _VISION_ACCURACY_SUBTYPES for sub in metric.sub_types]
+    if any(vision_based) and not all(vision_based):
+        raise ValueError(
+            "Cannot mix vision accuracy sub-types (exact_match, relaxed_accuracy, word_sort_ratio) "
+            "with other sub-types in the same metric. Please define them as separate metrics."
+        )
+    return all(vision_based)


jiafatom and others added 4 commits May 27, 2026 18:13

Remove internal project references from comments

28d8110

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 27, 2026 20:25

Copilot started reviewing on behalf of jiafatom May 27, 2026 20:25 View session

jiafatom mentioned this pull request May 27, 2026

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio) #2474

Closed

Copilot AI reviewed May 27, 2026

View reviewed changes

Comment thread olive/data/container/huggingface_container.py

Comment thread olive/data/component/pre_process_data.py

jambayk reviewed May 27, 2026

View reviewed changes

Comment thread olive/evaluator/olive_evaluator.py

shaahji requested changes May 27, 2026

View reviewed changes

Comment thread olive/data/component/pre_process_data.py Outdated

Comment thread olive/data/component/pre_process_data.py

Comment thread olive/evaluator/accuracy.py Outdated

jiafatom requested review from Copilot, jambayk and shaahji May 28, 2026 21:14

Copilot started reviewing on behalf of jiafatom May 28, 2026 21:14 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2476

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2476
jiafatom wants to merge 5 commits into
mainfrom
jiafa/add-vision-eval-metrics

jiafatom commented May 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jiafatom commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Metrics

Public HuggingFace Datasets

Changes

Design

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jiafatom commented May 27, 2026 •

edited

Loading