[BugFix] Fix NOT TEXT_MATCH false positives on consuming segments by heng-kuang-777 · Pull Request #17880 · apache/pinot

heng-kuang-777 · 2026-03-13T20:14:49Z

Problem

On consuming segments, Lucene operates in near-realtime mode and recently ingested documents may not yet be visible to the IndexSearcher until the next SearcherManager refresh. When evaluating NOT TEXT_MATCH, the filter inversion was operating over [0, numDocs) — the full segment doc count — causing unindexed tail documents to appear as false positives.

Fix

Fix by introducing getSearchableDocCount() on TextIndexReader, which returns the number of documents currently visible to the Lucene searcher on realtime indexes (updated on each refresh), or -1 for offline/sealed segments where all docs are indexed. TextMatchFilterOperator now uses this count as the inversion universe instead of numDocs, so unindexed tail docs are excluded from NOT results.

Testing

Deployed the change in internal cluster. Compare results from the same NOT TEXT_MATCH query before and after the fix. Validated the false positive result did not re-occur.

Before:

After:

Fixes #17809

On consuming segments, Lucene operates in near-realtime mode and recently ingested documents may not yet be visible to the IndexSearcher until the next SearcherManager refresh. When evaluating NOT TEXT_MATCH, the filter inversion was operating over [0, numDocs) — the full segment doc count — causing unindexed tail documents to appear as false positives. Fix by introducing `getSearchableDocCount()` on `TextIndexReader`, which returns the number of documents currently visible to the Lucene searcher on realtime indexes (updated on each refresh), or -1 for offline/sealed segments where all docs are indexed. `TextMatchFilterOperator` now uses this count as the inversion universe instead of numDocs, so unindexed tail docs are excluded from NOT results. Fixes apache#17809

codecov-commenter · 2026-03-13T21:53:34Z

Codecov Report

❌ Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 63.26%. Comparing base (936196c) to head (892147d).

Files with missing lines	Patch %	Lines
...inot/segment/spi/index/reader/TextIndexReader.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #17880      +/-   ##
============================================
+ Coverage     63.25%   63.26%   +0.01%     
  Complexity     1481     1481              
============================================
  Files          3190     3190              
  Lines        192257   192261       +4     
  Branches      29470    29471       +1     
============================================
+ Hits         121607   121632      +25     
+ Misses        61115    61093      -22     
- Partials       9535     9536       +1

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-11	`63.21% <85.71%> (-0.01%)`	⬇️
java-21	`63.23% <85.71%> (+0.02%)`	⬆️
temurin	`63.26% <85.71%> (+0.01%)`	⬆️
unittests	`63.26% <85.71%> (+0.01%)`	⬆️
unittests1	`55.56% <57.14%> (+<0.01%)`	⬆️
unittests2	`34.28% <28.57%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

shauryachats

LGTM, thanks for the fix.

…ache#17880) * [BugFix] Fix NOT TEXT_MATCH false positives on consuming segments On consuming segments, Lucene operates in near-realtime mode and recently ingested documents may not yet be visible to the IndexSearcher until the next SearcherManager refresh. When evaluating NOT TEXT_MATCH, the filter inversion was operating over [0, numDocs) — the full segment doc count — causing unindexed tail documents to appear as false positives. Fix by introducing `getSearchableDocCount()` on `TextIndexReader`, which returns the number of documents currently visible to the Lucene searcher on realtime indexes (updated on each refresh), or -1 for offline/sealed segments where all docs are indexed. `TextMatchFilterOperator` now uses this count as the inversion universe instead of numDocs, so unindexed tail docs are excluded from NOT results. Fixes apache#17809 * add new Lucene index unit tests

add new Lucene index unit tests

892147d

shauryachats approved these changes Mar 14, 2026

View reviewed changes

chenboat approved these changes Mar 16, 2026

View reviewed changes

chenboat merged commit 7f4d28e into apache:master Mar 16, 2026
16 checks passed

xiangfu0 added text-search Related to text/Lucene indexing and search bug Something is not working as expected real-time Related to realtime table ingestion and serving labels Mar 20, 2026

xiangfu0 mentioned this pull request Mar 23, 2026

Exception while using NOT on regexp_like function #5797

Closed

heng-kuang-777 mentioned this pull request Mar 27, 2026

[BugFix] Fix NOT TEXT_MATCH fence to exclude all docs when Lucene searcher see zero docs #18006

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Fix NOT TEXT_MATCH false positives on consuming segments#17880

[BugFix] Fix NOT TEXT_MATCH false positives on consuming segments#17880
chenboat merged 2 commits intoapache:masterfrom
heng-kuang-777:searchable-doc-fence-not-text-match

heng-kuang-777 commented Mar 13, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Mar 13, 2026 •

edited

Loading

Uh oh!

shauryachats left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

heng-kuang-777 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Testing

Uh oh!

codecov-commenter commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shauryachats left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

heng-kuang-777 commented Mar 13, 2026 •

edited

Loading

codecov-commenter commented Mar 13, 2026 •

edited

Loading