Skip to content

Add JsonIndexDistinctOperator for index-based SELECT DISTINCT on JSON columns#17820

Merged
xiangfu0 merged 1 commit intoapache:masterfrom
raghavyadav01:json-index-distinct-operator
Mar 18, 2026
Merged

Add JsonIndexDistinctOperator for index-based SELECT DISTINCT on JSON columns#17820
xiangfu0 merged 1 commit intoapache:masterfrom
raghavyadav01:json-index-distinct-operator

Conversation

@raghavyadav01
Copy link
Collaborator

@raghavyadav01 raghavyadav01 commented Mar 5, 2026

Summary

Add JsonIndexDistinctOperator — an index-only execution path for SELECT DISTINCT jsonExtractIndex(...) queries on columns with a JSON index. Instead of scanning documents through the projection/transform pipeline, the operator reads distinct values directly from the JSON index's value→docId map, avoiding per-doc evaluation entirely.

The operator is disabled by default and opt-in via a query option.

Key changes

  • JsonIndexDistinctOperator (pinot-core): New operator that reads the JSON index value→docId map directly, intersects with the filter bitmap, and populates a typed DistinctTable (supports INT, LONG, FLOAT, DOUBLE, BIG_DECIMAL, STRING). Handles defaultValue semantics and nullHandlingEnabled for docs where the JSON path is absent.
  • DistinctPlanNode: Routes to JsonIndexDistinctOperator when the query option is enabled and the expression is eligible.
  • QueryOptionsUtils / CommonConstants: New query option useIndexBasedDistinctOperator.
  • JsonIndexReader.isPathIndexed(): New default method so the operator can check whether a path is indexed (always true for OSS JSON index; selective for composite JSON index).
  • Integration tests (JsonPathTest): Validates baseline vs optimized results match, with and without filters, for both SSE and MSE.

Usage

Enable via query option:

SET useIndexBasedDistinctOperator = true;

SELECT DISTINCT jsonExtractIndex(myJsonCol, '$.path.to.field', 'STRING')
FROM myTable
ORDER BY jsonExtractIndex(myJsonCol, '$.path.to.field', 'STRING')
LIMIT 1000;

Or per-query via REST API:

POST /query/sql
{
  "sql": "SELECT DISTINCT jsonExtractIndex(myJsonCol, '$.name', 'STRING') FROM myTable",
  "queryOptions": "useIndexBasedDistinctOperator=true"
}

Prerequisites

  • The column must have a JSON index configured in the table config:
{
  "tableName": "myTable_OFFLINE",
  "tableType": "OFFLINE",
  "fieldConfigList": [
    {
      "name": "myJsonCol",
      "encodingType": "RAW",
      "indexTypes": ["JSON"]
    }
  ]
}

Or via the legacy shorthand:

{
  "tableIndexConfig": {
    "jsonIndexColumns": ["myJsonCol"]
  }
}

Supported query patterns

Pattern Supported
SELECT DISTINCT jsonExtractIndex(col, '$.path', 'STRING') Yes
SELECT DISTINCT jsonExtractIndex(col, '$.path', 'INT') Yes (all SV types)
SELECT DISTINCT jsonExtractIndex(col, '$.path', 'STRING', 'default') Yes (defaultValue)
SELECT DISTINCT jsonExtractIndex(col, '$.path', 'STRING', 'default', '$.filter') Yes (with filter JSON path)
With WHERE clause filters Yes
With ORDER BY Yes
Multi-value (STRING_ARRAY, etc.) No (falls back to baseline)
Multiple columns in SELECT DISTINCT No (falls back to baseline)

Performance

For SSE queries, numEntriesScannedPostFilter = 0 — the operator reads entirely from the index without scanning any documents.

Test plan

  • Integration tests validate optimized results match baseline (with and without filters, SSE and MSE)
  • Integration tests verify numEntriesScannedPostFilter = 0 for SSE
  • Existing JsonExtractIndexTransformFunctionTest unit tests pass (31/31)
  • Verify defaultValue semantics with docs that have missing JSON paths
  • Verify null handling when SET enableNullHandling = true and JSON path is absent

🤖 Generated with Claude Code

@raghavyadav01 raghavyadav01 force-pushed the json-index-distinct-operator branch from 9fe3f76 to a986d91 Compare March 5, 2026 05:22
@raghavyadav01 raghavyadav01 changed the title Adding JsonIndexDistinctOperator and InvertedIndexDistinctOperator [DRAFT]: Adding JsonIndexDistinctOperator and InvertedIndexDistinctOperator Mar 5, 2026
@codecov-commenter
Copy link

codecov-commenter commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 0.41841% with 238 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.18%. Comparing base (00cd0e9) to head (c2dd0bb).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...core/operator/query/JsonIndexDistinctOperator.java 0.00% 232 Missing ⚠️
...a/org/apache/pinot/core/plan/DistinctPlanNode.java 0.00% 4 Missing and 1 partial ⚠️
...inot/segment/spi/index/reader/JsonIndexReader.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17820      +/-   ##
============================================
- Coverage     63.22%   63.18%   -0.05%     
  Complexity     1481     1481              
============================================
  Files          3190     3191       +1     
  Lines        192312   192551     +239     
  Branches      29475    29528      +53     
============================================
+ Hits         121591   121657      +66     
- Misses        61193    61365     +172     
- Partials       9528     9529       +1     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.16% <0.41%> (-0.04%) ⬇️
java-21 63.13% <0.41%> (-0.08%) ⬇️
temurin 63.18% <0.41%> (-0.05%) ⬇️
unittests 63.17% <0.41%> (-0.05%) ⬇️
unittests1 55.49% <0.41%> (-0.06%) ⬇️
unittests2 34.22% <0.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds two new index-based distinct operators (JsonIndexDistinctOperator and InvertedIndexDistinctOperator) that avoid the scan-based projection pipeline for SELECT DISTINCT queries by reading values directly from JSON or inverted indexes. Both operators are disabled by default and opt-in via query options (useJsonIndexDistinct, useInvertedIndexDistinct, or the umbrella useIndexBasedDistinctOperator).

Changes:

  • Two new operators (JsonIndexDistinctOperator, InvertedIndexDistinctOperator) and their integration into DistinctPlanNode's operator selection logic, plus query option plumbing in CommonConstants and QueryOptionsUtils.
  • Integration tests in JsonPathTest and OfflineClusterIntegrationTest validating correctness and index-only execution stats for both operators.
  • A new isPathIndexed default method on JsonIndexReader SPI interface, and an unrelated change to MultiStageWithoutStatsIntegrationTest.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
CommonConstants.java Adds three new query option keys for index-based distinct
QueryOptionsUtils.java Adds parsing methods for the new query options with umbrella fallback
JsonIndexReader.java Adds isPathIndexed default method for path indexing check
DistinctPlanNode.java Integrates new operators into the plan selection logic
JsonIndexDistinctOperator.java New operator using JSON index value→docId map for DISTINCT
InvertedIndexDistinctOperator.java New operator using inverted index dictId→docIds for DISTINCT
JsonPathTest.java Integration tests for JsonIndexDistinctOperator
OfflineClusterIntegrationTest.java Integration tests for InvertedIndexDistinctOperator
MultiStageWithoutStatsIntegrationTest.java Unrelated change replacing enum reference with string literal

You can also share your feedback on Copilot code review. Take the survey.

@raghavyadav01 raghavyadav01 force-pushed the json-index-distinct-operator branch from bc0d213 to 57dfc5b Compare March 12, 2026 00:58
@xiangfu0 xiangfu0 force-pushed the json-index-distinct-operator branch from 57dfc5b to a69679e Compare March 12, 2026 19:41
@raghavyadav01 raghavyadav01 force-pushed the json-index-distinct-operator branch from a69679e to 92561b5 Compare March 14, 2026 18:53
@raghavyadav01 raghavyadav01 changed the title [DRAFT]: Adding JsonIndexDistinctOperator and InvertedIndexDistinctOperator Adding JsonIndexDistinctOperator and InvertedIndexDistinctOperator Mar 14, 2026
@xiangfu0 xiangfu0 force-pushed the json-index-distinct-operator branch 3 times, most recently from b301053 to 1930742 Compare March 18, 2026 00:36
@xiangfu0 xiangfu0 requested a review from Copilot March 18, 2026 00:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.


You can also share your feedback on Copilot code review. Take the survey.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.


You can also share your feedback on Copilot code review. Take the survey.

@xiangfu0 xiangfu0 force-pushed the json-index-distinct-operator branch from 73bd3cf to f1fcf4a Compare March 18, 2026 03:25
@xiangfu0 xiangfu0 changed the title Adding JsonIndexDistinctOperator and InvertedIndexDistinctOperator Add JsonIndexDistinctOperator for index-based SELECT DISTINCT on JSON columns Mar 18, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.


You can also share your feedback on Copilot code review. Take the survey.

@xiangfu0 xiangfu0 force-pushed the json-index-distinct-operator branch 2 times, most recently from 6b18472 to 0eb69d1 Compare March 18, 2026 03:51
Copy link
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


You can also share your feedback on Copilot code review. Take the survey.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


You can also share your feedback on Copilot code review. Take the survey.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


You can also share your feedback on Copilot code review. Take the survey.

… columns

Introduce a new operator that resolves SELECT DISTINCT jsonExtractIndex(...)
queries directly from the JSON index's value-to-docId map, bypassing the
document scan and projection/transform pipeline entirely. This is opt-in
via the query option `SET useIndexBasedDistinctOperator=true`.

Key changes:
- JsonIndexDistinctOperator reads distinct values from the JSON index with
  support for typed distinct tables (INT, LONG, FLOAT, DOUBLE, BIG_DECIMAL,
  STRING), ORDER BY, LIMIT, filter pushdown, defaultValue, and null handling
- DistinctPlanNode routes to JsonIndexDistinctOperator when the query option
  is enabled and a single jsonExtractIndex expression has a backing JSON index
- JsonIndexReader.isPathIndexed() default method for path availability checks
- QueryOptionsUtils helpers and USE_INDEX_BASED_DISTINCT_OPERATOR constant
- Integration tests in JsonPathTest verifying correctness against baseline,
  filter support, defaultValue handling, and zero numEntriesScannedPostFilter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@xiangfu0 xiangfu0 force-pushed the json-index-distinct-operator branch from 81a4816 to c2dd0bb Compare March 18, 2026 08:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@xiangfu0 xiangfu0 requested a review from Copilot March 18, 2026 19:07
@xiangfu0 xiangfu0 merged commit 19acc5d into apache:master Mar 18, 2026
18 checks passed
@xiangfu0 xiangfu0 deleted the json-index-distinct-operator branch March 18, 2026 19:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.


You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +185 to +193
if (parsed._defaultValue != null) {
addValueToDistinctTable(distinctTable, parsed._defaultValue, parsed._dataType, orderByExpression);
} else if (_queryContext.isNullHandlingEnabled()) {
distinctTable.addNull();
} else {
throw new RuntimeException(
String.format("Illegal Json Path: [%s], for some docIds in segment [%s]",
parsed._jsonPathString, _indexSegment.getSegmentName()));
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants