fix(pdf): recover whitespace when pdfminer collapses plain text by LaplaceYoung · Pull Request #1733 · microsoft/markitdown

LaplaceYoung · 2026-04-13T02:56:16Z

Summary

Fixes whitespace-collapse behavior for plain-text PDFs in issue #120 without regressing table/form extraction.

Added a collapse detector in PdfConverter to identify pathological extraction where words are concatenated with almost no spaces.
Improved plain-page extraction via pdfplumber with conservative spacing tolerances (x_tolerance=1, y_tolerance=3).
Kept existing pdfminer fallback for plain PDFs, but when pdfminer output is detected as collapsed, now fallback to the already-collected pdfplumber plain text if it is healthier.
Tightened form/table column-width guard (adaptive_max_columns) to reduce false-positive table extraction on dense academic layouts.
Added regression test coverage for the collapse-recovery path.

python -m pytest packages/markitdown/tests/test_pdf_memory.py -q
python -m pytest packages/markitdown/tests/test_pdf_tables.py::TestPdfTableExtraction::test_borderless_table_extraction packages/markitdown/tests/test_pdf_tables.py::TestPdfFullOutputComparison::test_academic_paper_full_output -q

Closes #120

LaplaceYoung · 2026-04-13T03:01:34Z

@microsoft-github-policy-service agree

fix(pdf): avoid collapsed whitespace fallback for plain pages

b8b4de4