Skip to content

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2476

Open
jiafatom wants to merge 5 commits into
mainfrom
jiafa/add-vision-eval-metrics
Open

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2476
jiafatom wants to merge 5 commits into
mainfrom
jiafa/add-vision-eval-metrics

Conversation

@jiafatom
Copy link
Copy Markdown
Contributor

@jiafatom jiafatom commented May 27, 2026

Summary

Extends Olive's evaluator framework with three vision-oriented accuracy sub-metrics for VQA, ChartQA, and OCR evaluation, following the existing pattern used for speech metrics (PR #2444).

New Metrics

Metric Task Type Suitable Benchmarks
exact_match vision-vqa AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS
relaxed_accuracy vision-chart-qa ChartQA (±5% numeric tolerance for numbers)
word_sort_ratio vision-ocr OCR benchmarks (word-level overlap)

Public HuggingFace Datasets

These metrics are designed to work with publicly available datasets:

Metric Recommended Dataset HuggingFace ID
exact_match TextVQA facebook/textvqa
relaxed_accuracy ChartQA HuggingFaceM4/ChartQA
word_sort_ratio DocumentVQA HuggingFaceM4/DocumentVQA

Example configuration snippets are provided in docs/source/how-to/configure-workflows/metrics-configuration.md.

Changes

  • olive/evaluator/metric.py: Adds EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType enum
  • olive/evaluator/accuracy.py: Implements the three metric classes with multi-answer support
  • olive/evaluator/olive_evaluator.py: Adds vision inference path and task-metric validation
  • olive/data/component/pre_process_data.py: Adds vision_vqa_pre_process component
  • olive/data/component/dataloader.py: Adds vision_vqa_dataloader with custom collate_fn for PIL images
  • olive/data/container/huggingface_container.py: Registers vision-vqa, vision-chart-qa, vision-ocr task types with appropriate dataloader
  • olive/olive_config.json: Adds vision extras (pillow)
  • docs/source/how-to/configure-workflows/metrics-configuration.md: Adds vision metrics documentation with public dataset examples
  • test/evaluator/test_accuracy.py: Unit tests covering all new metrics

Design

  • Vision metrics are text-based (compare predicted answer string to ground truth), task-dependent
  • Multiple valid answers supported via | separator (metrics match against any valid answer)
  • Task-metric validation ensures incompatible combinations raise ValueError
  • Custom vision_vqa_dataloader handles PIL images with a collate_fn that avoids PyTorch default collation issues
  • PyTorch path: model processor handles images natively
  • ONNX path: single forward pass assumed (classification-style VQA); for autoregressive models use PyTorch evaluator with generation loop

jiafatom and others added 4 commits May 27, 2026 18:13
…rt_ratio)

Add vision evaluation metrics to the Olive evaluator framework, enabling
VQA, ChartQA, and OCR model evaluation.

- exact_match: case-insensitive string equality for VQA tasks
- relaxed_accuracy: ±5% numeric tolerance for ChartQA
- word_sort_ratio: word-level overlap ratio for OCR

Changes:
- olive/evaluator/metric.py: Add EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType
- olive/evaluator/accuracy.py: Add ExactMatch, RelaxedAccuracy, WordSortRatio classes
- olive/evaluator/olive_evaluator.py: Add _inference_vision() path and task-metric validation
- olive/data/component/pre_process_data.py: Add vision_vqa_pre_process data component
- olive/data/container/huggingface_container.py: Add vision-vqa, vision-chart-qa, vision-ocr tasks
- olive/olive_config.json: Add vision extra dependencies (Pillow)
- test/evaluator/test_accuracy.py: Add 20 unit tests for vision metrics

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document which datasets each vision metric is suitable for:
- exact_match: AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS
- relaxed_accuracy: ChartQA
- word_sort_ratio: OCR

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix _validate_vision_task_metric to extract task from
  pre_process_data_config.params['task'] instead of non-existent
  DataConfig attributes
- Wrap _VISION_ACCURACY_SUBTYPES across multiple lines for lint compliance
- Use lowercase 'pillow' in olive_config.json for consistency
- Add docstring note about ONNX vs PyTorch path for vision_vqa_pre_process

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends Olive’s evaluator framework with three vision-oriented, text-based accuracy sub-metrics intended for VQA/ChartQA/OCR-style evaluation, adding corresponding task registrations, data pre-processing, and unit tests.

Changes:

  • Add three new AccuracySubType values (exact_match, relaxed_accuracy, word_sort_ratio) and implement their metric logic.
  • Introduce a vision string-inference path in both ONNX and PyTorch evaluators, including task↔metric compatibility validation.
  • Register new HuggingFace task types and add a vision VQA pre-process component plus unit tests for the new metrics.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
olive/evaluator/metric.py Adds new vision accuracy sub-types to the enum.
olive/evaluator/accuracy.py Implements ExactMatch, RelaxedAccuracy, and WordSortRatio.
olive/evaluator/olive_evaluator.py Adds vision inference paths and task/metric validation for vision metrics.
olive/data/component/pre_process_data.py Adds vision_vqa_pre_process that emits (image, question) inputs and string answers.
olive/data/container/huggingface_container.py Registers new vision task types mapping to the vision pre-process component.
olive/olive_config.json Adds a vision extra dependency (pillow).
test/evaluator/test_accuracy.py Adds unit tests for all new metric classes.

Comment thread olive/data/container/huggingface_container.py
Comment thread olive/data/component/pre_process_data.py
Comment thread olive/evaluator/olive_evaluator.py
Comment thread olive/data/component/pre_process_data.py Outdated
Comment thread olive/data/component/pre_process_data.py
Comment thread olive/evaluator/accuracy.py Outdated
- Fix isinstance check to include tuple (shaahji)
- Support multiple valid answers via | separator instead of taking first only
- Add vision_vqa_dataloader with custom collate_fn for PIL images (Copilot)
- Register vision_vqa_dataloader for vision task types in HuggingfaceContainer
- Simplify relaxed_accuracy numeric comparison (shaahji)
- Add clarifying comment about single-pass vs generation models (jambayk)
- Add vision metrics documentation with public HuggingFace dataset examples
  (facebook/textvqa, HuggingFaceM4/ChartQA, HuggingFaceM4/DocumentVQA)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Comment on lines +400 to +404
Note: This returns raw PIL images and question strings. For the PyTorch evaluator,
the model's own processor/tokenizer should be applied in the post_func or within
the model's forward method. For the ONNX evaluator, provide a custom pre-process
component that applies the appropriate processor/tokenizer to produce numeric
tensors matching the model's io_config.
Comment on lines +126 to +130
# Extract task from pre_process_data_config params, which is how HuggingfaceContainer
# maps task types (e.g., "vision-vqa", "vision-chart-qa", "vision-ocr") to components.
pre_process_config = metric.data_config.pre_process_data_config
if pre_process_config and pre_process_config.params:
task_type = pre_process_config.params.get("task")
Comment on lines +99 to +113
def _is_vision_metric(metric: "Metric") -> bool:
"""Check if metric uses vision accuracy sub-types (exact_match, relaxed_accuracy, word_sort_ratio).

Raises ValueError if vision sub-types are mixed with non-vision sub-types,
as they require different inference paths.
"""
if metric.type != MetricType.ACCURACY:
return False
vision_based = [sub.name in _VISION_ACCURACY_SUBTYPES for sub in metric.sub_types]
if any(vision_based) and not all(vision_based):
raise ValueError(
"Cannot mix vision accuracy sub-types (exact_match, relaxed_accuracy, word_sort_ratio) "
"with other sub-types in the same metric. Please define them as separate metrics."
)
return all(vision_based)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants