feat: Add native support for max_by and min_by by pchintar · Pull Request #3969 · apache/datafusion-comet

pchintar · 2026-04-16T23:11:34Z

Which issue does this PR close?

Closes #3841 .

Rationale for this change

This change adds native support for MAX_BY and MIN_BY.

These aggregates are commonly used in grouped queries. Without native support, they fall back, which prevents execution from staying within Comet’s aggregation pipeline. This change enables them to run natively and align their behavior with Spark.

What changes are included in this PR?

Added a native implementation for max_by and min_by (maxmin_by.rs) using a shared design
- maintains the current best (value, ordering) pair per group
- updates and merges state using ordering comparison
- single-pass execution with constant state per group
Implemented GroupsAccumulator support to integrate with Comet’s grouped aggregation path
- avoids scalar accumulation and per-row overhead
- includes specialized handling for primitive, byte/string, and struct ordering types, with a row-based fallback for general cases
- enables execution through CometHashAggregate for grouped workloads
Added serialization and planner wiring
- proto definitions for MaxBy / MinBy
- Spark-side serde (CometMaxBy, CometMinBy)
- registration in QueryPlanSerde
- planner support to construct the native aggregate
Extended operator support
- enabled execution under HashAggregate
- added support for SortAggregateExec where selected by Spark
- ensured both partial and final aggregation stages execute natively

How are these changes tested?

Validated against Spark for:

grouped and non-grouped queries
null handling (both ordering and value)
tie behavior (equal ordering selects the latest value)
struct ordering
both hash and sort aggregate plans

Results match Spark semantics, and supported queries execute without any fallback.

Note: I've made the two changes required: 1)Updating the title correctly & 2)Adding the Licence header for maxmin_by.rs

pchintar · 2026-04-17T15:11:48Z

Hi, so I understand the issue with the formatting of rust files & so I ran cargo fmt --all to address the lint failure. But I don't know what's causing the remaining few couple of errors?

coderfender · 2026-04-19T16:32:48Z

@pchintar , you would want to download the logs and grep for FAILED and see which particular spark tests failed and fix code accordingly

pchintar added 2 commits April 16, 2026 18:39

Add native support for max_by and min_by (HashAggregate + SortAggregate)

c5a6965

chore: add license header for maxmin_by

4d356ef

pchintar changed the title ~~Add native support for max_by and min_by~~ feat: Add native support for max_by and min_by Apr 17, 2026

Merge branch 'main' into maxmin-by-native

c4e0e94

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add native support for max_by and min_by#3969

feat: Add native support for max_by and min_by#3969
pchintar wants to merge 3 commits intoapache:mainfrom
pchintar:maxmin-by-native

pchintar commented Apr 16, 2026 •

edited

Loading

Uh oh!

pchintar commented Apr 17, 2026

Uh oh!

coderfender commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pchintar commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

pchintar commented Apr 17, 2026

Uh oh!

coderfender commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pchintar commented Apr 16, 2026 •

edited

Loading