[BugFix] Fix NOT TEXT_MATCH false positives on consuming segments#17880
Merged
chenboat merged 2 commits intoapache:masterfrom Mar 16, 2026
Merged
Conversation
On consuming segments, Lucene operates in near-realtime mode and recently ingested documents may not yet be visible to the IndexSearcher until the next SearcherManager refresh. When evaluating NOT TEXT_MATCH, the filter inversion was operating over [0, numDocs) — the full segment doc count — causing unindexed tail documents to appear as false positives. Fix by introducing `getSearchableDocCount()` on `TextIndexReader`, which returns the number of documents currently visible to the Lucene searcher on realtime indexes (updated on each refresh), or -1 for offline/sealed segments where all docs are indexed. `TextMatchFilterOperator` now uses this count as the inversion universe instead of numDocs, so unindexed tail docs are excluded from NOT results. Fixes apache#17809
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #17880 +/- ##
============================================
+ Coverage 63.25% 63.26% +0.01%
Complexity 1481 1481
============================================
Files 3190 3190
Lines 192257 192261 +4
Branches 29470 29471 +1
============================================
+ Hits 121607 121632 +25
+ Misses 61115 61093 -22
- Partials 9535 9536 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
shauryachats
approved these changes
Mar 14, 2026
Collaborator
shauryachats
left a comment
There was a problem hiding this comment.
LGTM, thanks for the fix.
chenboat
approved these changes
Mar 16, 2026
xiangfu0
pushed a commit
to xiangfu0/pinot
that referenced
this pull request
Mar 16, 2026
…ache#17880) * [BugFix] Fix NOT TEXT_MATCH false positives on consuming segments On consuming segments, Lucene operates in near-realtime mode and recently ingested documents may not yet be visible to the IndexSearcher until the next SearcherManager refresh. When evaluating NOT TEXT_MATCH, the filter inversion was operating over [0, numDocs) — the full segment doc count — causing unindexed tail documents to appear as false positives. Fix by introducing `getSearchableDocCount()` on `TextIndexReader`, which returns the number of documents currently visible to the Lucene searcher on realtime indexes (updated on each refresh), or -1 for offline/sealed segments where all docs are indexed. `TextMatchFilterOperator` now uses this count as the inversion universe instead of numDocs, so unindexed tail docs are excluded from NOT results. Fixes apache#17809 * add new Lucene index unit tests
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On consuming segments, Lucene operates in near-realtime mode and recently ingested documents may not yet be visible to the IndexSearcher until the next SearcherManager refresh. When evaluating NOT TEXT_MATCH, the filter inversion was operating over [0, numDocs) — the full segment doc count — causing unindexed tail documents to appear as false positives.
Fix
Fix by introducing
getSearchableDocCount()onTextIndexReader, which returns the number of documents currently visible to the Lucene searcher on realtime indexes (updated on each refresh), or -1 for offline/sealed segments where all docs are indexed.TextMatchFilterOperatornow uses this count as the inversion universe instead of numDocs, so unindexed tail docs are excluded from NOT results.Testing
Deployed the change in internal cluster. Compare results from the same NOT TEXT_MATCH query before and after the fix. Validated the false positive result did not re-occur.
Before:

After:

Fixes #17809