Skip to content

Rework textpage_ocr#4920

Merged
JorjMcKie merged 1 commit intomainfrom
harald-ocr
Mar 2, 2026
Merged

Rework textpage_ocr#4920
JorjMcKie merged 1 commit intomainfrom
harald-ocr

Conversation

@JorjMcKie
Copy link
Collaborator

No description provided.

Copy link
Collaborator

@julian-smith-artifex-com julian-smith-artifex-com left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new expected text tests/resources/test_3842_partial.txt seems worse than before?

The ordering of lines is now incorrect, and the ...... text now contains spurious incorrect chars like c:ccccssccs`.

For partial OCR, we previously added text content from OCR'd images on the page.
We now redact legible text and let the OCR engine recognize the remaining page content - which includes images as before but also vectors simulating text.
@JorjMcKie JorjMcKie merged commit 447ff2c into main Mar 2, 2026
3 checks passed
@JorjMcKie JorjMcKie deleted the harald-ocr branch March 2, 2026 17:42
@github-actions github-actions bot locked and limited conversation to collaborators Mar 2, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants