Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2476
Open
jiafatom wants to merge 5 commits into
Open
Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2476jiafatom wants to merge 5 commits into
jiafatom wants to merge 5 commits into
Conversation
…rt_ratio) Add vision evaluation metrics to the Olive evaluator framework, enabling VQA, ChartQA, and OCR model evaluation. - exact_match: case-insensitive string equality for VQA tasks - relaxed_accuracy: ±5% numeric tolerance for ChartQA - word_sort_ratio: word-level overlap ratio for OCR Changes: - olive/evaluator/metric.py: Add EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType - olive/evaluator/accuracy.py: Add ExactMatch, RelaxedAccuracy, WordSortRatio classes - olive/evaluator/olive_evaluator.py: Add _inference_vision() path and task-metric validation - olive/data/component/pre_process_data.py: Add vision_vqa_pre_process data component - olive/data/container/huggingface_container.py: Add vision-vqa, vision-chart-qa, vision-ocr tasks - olive/olive_config.json: Add vision extra dependencies (Pillow) - test/evaluator/test_accuracy.py: Add 20 unit tests for vision metrics Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document which datasets each vision metric is suitable for: - exact_match: AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS - relaxed_accuracy: ChartQA - word_sort_ratio: OCR Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix _validate_vision_task_metric to extract task from pre_process_data_config.params['task'] instead of non-existent DataConfig attributes - Wrap _VISION_ACCURACY_SUBTYPES across multiple lines for lint compliance - Use lowercase 'pillow' in olive_config.json for consistency - Add docstring note about ONNX vs PyTorch path for vision_vqa_pre_process Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR extends Olive’s evaluator framework with three vision-oriented, text-based accuracy sub-metrics intended for VQA/ChartQA/OCR-style evaluation, adding corresponding task registrations, data pre-processing, and unit tests.
Changes:
- Add three new
AccuracySubTypevalues (exact_match,relaxed_accuracy,word_sort_ratio) and implement their metric logic. - Introduce a vision string-inference path in both ONNX and PyTorch evaluators, including task↔metric compatibility validation.
- Register new HuggingFace task types and add a vision VQA pre-process component plus unit tests for the new metrics.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
olive/evaluator/metric.py |
Adds new vision accuracy sub-types to the enum. |
olive/evaluator/accuracy.py |
Implements ExactMatch, RelaxedAccuracy, and WordSortRatio. |
olive/evaluator/olive_evaluator.py |
Adds vision inference paths and task/metric validation for vision metrics. |
olive/data/component/pre_process_data.py |
Adds vision_vqa_pre_process that emits (image, question) inputs and string answers. |
olive/data/container/huggingface_container.py |
Registers new vision task types mapping to the vision pre-process component. |
olive/olive_config.json |
Adds a vision extra dependency (pillow). |
test/evaluator/test_accuracy.py |
Adds unit tests for all new metric classes. |
jambayk
reviewed
May 27, 2026
shaahji
requested changes
May 27, 2026
- Fix isinstance check to include tuple (shaahji) - Support multiple valid answers via | separator instead of taking first only - Add vision_vqa_dataloader with custom collate_fn for PIL images (Copilot) - Register vision_vqa_dataloader for vision task types in HuggingfaceContainer - Simplify relaxed_accuracy numeric comparison (shaahji) - Add clarifying comment about single-pass vs generation models (jambayk) - Add vision metrics documentation with public HuggingFace dataset examples (facebook/textvqa, HuggingFaceM4/ChartQA, HuggingFaceM4/DocumentVQA) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines
+400
to
+404
| Note: This returns raw PIL images and question strings. For the PyTorch evaluator, | ||
| the model's own processor/tokenizer should be applied in the post_func or within | ||
| the model's forward method. For the ONNX evaluator, provide a custom pre-process | ||
| component that applies the appropriate processor/tokenizer to produce numeric | ||
| tensors matching the model's io_config. |
Comment on lines
+126
to
+130
| # Extract task from pre_process_data_config params, which is how HuggingfaceContainer | ||
| # maps task types (e.g., "vision-vqa", "vision-chart-qa", "vision-ocr") to components. | ||
| pre_process_config = metric.data_config.pre_process_data_config | ||
| if pre_process_config and pre_process_config.params: | ||
| task_type = pre_process_config.params.get("task") |
Comment on lines
+99
to
+113
| def _is_vision_metric(metric: "Metric") -> bool: | ||
| """Check if metric uses vision accuracy sub-types (exact_match, relaxed_accuracy, word_sort_ratio). | ||
|
|
||
| Raises ValueError if vision sub-types are mixed with non-vision sub-types, | ||
| as they require different inference paths. | ||
| """ | ||
| if metric.type != MetricType.ACCURACY: | ||
| return False | ||
| vision_based = [sub.name in _VISION_ACCURACY_SUBTYPES for sub in metric.sub_types] | ||
| if any(vision_based) and not all(vision_based): | ||
| raise ValueError( | ||
| "Cannot mix vision accuracy sub-types (exact_match, relaxed_accuracy, word_sort_ratio) " | ||
| "with other sub-types in the same metric. Please define them as separate metrics." | ||
| ) | ||
| return all(vision_based) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends Olive's evaluator framework with three vision-oriented accuracy sub-metrics for VQA, ChartQA, and OCR evaluation, following the existing pattern used for speech metrics (PR #2444).
New Metrics
exact_matchvision-vqarelaxed_accuracyvision-chart-qaword_sort_ratiovision-ocrPublic HuggingFace Datasets
These metrics are designed to work with publicly available datasets:
exact_matchfacebook/textvqarelaxed_accuracyHuggingFaceM4/ChartQAword_sort_ratioHuggingFaceM4/DocumentVQAExample configuration snippets are provided in
docs/source/how-to/configure-workflows/metrics-configuration.md.Changes
olive/evaluator/metric.py: AddsEXACT_MATCH,RELAXED_ACCURACY,WORD_SORT_RATIOtoAccuracySubTypeenumolive/evaluator/accuracy.py: Implements the three metric classes with multi-answer supportolive/evaluator/olive_evaluator.py: Adds vision inference path and task-metric validationolive/data/component/pre_process_data.py: Addsvision_vqa_pre_processcomponentolive/data/component/dataloader.py: Addsvision_vqa_dataloaderwith custom collate_fn for PIL imagesolive/data/container/huggingface_container.py: Registersvision-vqa,vision-chart-qa,vision-ocrtask types with appropriate dataloaderolive/olive_config.json: Addsvisionextras (pillow)docs/source/how-to/configure-workflows/metrics-configuration.md: Adds vision metrics documentation with public dataset examplestest/evaluator/test_accuracy.py: Unit tests covering all new metricsDesign
|separator (metrics match against any valid answer)ValueErrorvision_vqa_dataloaderhandles PIL images with a collate_fn that avoids PyTorch default collation issues