Skip to content

[BugFix] Fix NOT TEXT_MATCH false positives on consuming segments#17880

Merged
chenboat merged 2 commits intoapache:masterfrom
heng-kuang-777:searchable-doc-fence-not-text-match
Mar 16, 2026
Merged

[BugFix] Fix NOT TEXT_MATCH false positives on consuming segments#17880
chenboat merged 2 commits intoapache:masterfrom
heng-kuang-777:searchable-doc-fence-not-text-match

Conversation

@heng-kuang-777
Copy link
Copy Markdown
Contributor

@heng-kuang-777 heng-kuang-777 commented Mar 13, 2026

Problem

On consuming segments, Lucene operates in near-realtime mode and recently ingested documents may not yet be visible to the IndexSearcher until the next SearcherManager refresh. When evaluating NOT TEXT_MATCH, the filter inversion was operating over [0, numDocs) — the full segment doc count — causing unindexed tail documents to appear as false positives.

Fix

Fix by introducing getSearchableDocCount() on TextIndexReader, which returns the number of documents currently visible to the Lucene searcher on realtime indexes (updated on each refresh), or -1 for offline/sealed segments where all docs are indexed. TextMatchFilterOperator now uses this count as the inversion universe instead of numDocs, so unindexed tail docs are excluded from NOT results.

Testing

Deployed the change in internal cluster. Compare results from the same NOT TEXT_MATCH query before and after the fix. Validated the false positive result did not re-occur.

Before:
Screenshot 2026-03-13 at 1 20 21 PM

After:
Screenshot 2026-03-13 at 1 20 34 PM

Fixes #17809

On consuming segments, Lucene operates in near-realtime mode and recently ingested documents may not yet be visible to the IndexSearcher until the next SearcherManager refresh. When evaluating NOT TEXT_MATCH, the filter inversion was operating over [0, numDocs) — the full segment doc count — causing unindexed tail documents to appear as false positives.

Fix by introducing `getSearchableDocCount()` on `TextIndexReader`, which returns the number of documents currently visible to the Lucene searcher on realtime indexes (updated on each refresh), or -1 for offline/sealed segments where all docs are indexed. `TextMatchFilterOperator` now uses this count as the inversion universe instead of numDocs, so unindexed tail docs are excluded from NOT results.

Fixes apache#17809
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 13, 2026

Codecov Report

❌ Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 63.26%. Comparing base (936196c) to head (892147d).

Files with missing lines Patch % Lines
...inot/segment/spi/index/reader/TextIndexReader.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17880      +/-   ##
============================================
+ Coverage     63.25%   63.26%   +0.01%     
  Complexity     1481     1481              
============================================
  Files          3190     3190              
  Lines        192257   192261       +4     
  Branches      29470    29471       +1     
============================================
+ Hits         121607   121632      +25     
+ Misses        61115    61093      -22     
- Partials       9535     9536       +1     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.21% <85.71%> (-0.01%) ⬇️
java-21 63.23% <85.71%> (+0.02%) ⬆️
temurin 63.26% <85.71%> (+0.01%) ⬆️
unittests 63.26% <85.71%> (+0.01%) ⬆️
unittests1 55.56% <57.14%> (+<0.01%) ⬆️
unittests2 34.28% <28.57%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Collaborator

@shauryachats shauryachats left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix.

@chenboat chenboat merged commit 7f4d28e into apache:master Mar 16, 2026
16 checks passed
xiangfu0 pushed a commit to xiangfu0/pinot that referenced this pull request Mar 16, 2026
…ache#17880)

* [BugFix] Fix NOT TEXT_MATCH false positives on consuming segments

On consuming segments, Lucene operates in near-realtime mode and recently ingested documents may not yet be visible to the IndexSearcher until the next SearcherManager refresh. When evaluating NOT TEXT_MATCH, the filter inversion was operating over [0, numDocs) — the full segment doc count — causing unindexed tail documents to appear as false positives.

Fix by introducing `getSearchableDocCount()` on `TextIndexReader`, which returns the number of documents currently visible to the Lucene searcher on realtime indexes (updated on each refresh), or -1 for offline/sealed segments where all docs are indexed. `TextMatchFilterOperator` now uses this count as the inversion universe instead of numDocs, so unindexed tail docs are excluded from NOT results.

Fixes apache#17809

* add new Lucene index unit tests
@xiangfu0 xiangfu0 added text-search Related to text/Lucene indexing and search bug Something is not working as expected real-time Related to realtime table ingestion and serving labels Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something is not working as expected real-time Related to realtime table ingestion and serving text-search Related to text/Lucene indexing and search

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NOT(TEXT_MATCH) on consuming segments returns false positives for documents not yet indexed by Lucene

5 participants