Skip to content

fix(pdf): recover whitespace when pdfminer collapses plain text#1733

Open
LaplaceYoung wants to merge 1 commit intomicrosoft:mainfrom
LaplaceYoung:fix/issue-120-pdf-whitespace-collapse
Open

fix(pdf): recover whitespace when pdfminer collapses plain text#1733
LaplaceYoung wants to merge 1 commit intomicrosoft:mainfrom
LaplaceYoung:fix/issue-120-pdf-whitespace-collapse

Conversation

@LaplaceYoung
Copy link
Copy Markdown

Summary

Fixes whitespace-collapse behavior for plain-text PDFs in issue #120 without regressing table/form extraction.

What changed

  • Added a collapse detector in PdfConverter to identify pathological extraction where words are concatenated with almost no spaces.
  • Improved plain-page extraction via pdfplumber with conservative spacing tolerances (x_tolerance=1, y_tolerance=3).
  • Kept existing pdfminer fallback for plain PDFs, but when pdfminer output is detected as collapsed, now fallback to the already-collected pdfplumber plain text if it is healthier.
  • Tightened form/table column-width guard (adaptive_max_columns) to reduce false-positive table extraction on dense academic layouts.
  • Added regression test coverage for the collapse-recovery path.

Files

  • packages/markitdown/src/markitdown/converters/_pdf_converter.py
  • packages/markitdown/tests/test_pdf_memory.py

Validation

  • python -m pytest packages/markitdown/tests/test_pdf_memory.py -q
  • python -m pytest packages/markitdown/tests/test_pdf_tables.py::TestPdfTableExtraction::test_borderless_table_extraction packages/markitdown/tests/test_pdf_tables.py::TestPdfFullOutputComparison::test_academic_paper_full_output -q

Issue

Closes #120

@LaplaceYoung
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Removal of all whitespaces during PDF conversion

1 participant