feat: support `collect_set` by comphead · Pull Request #3954 · apache/datafusion-comet

comphead · 2026-04-15T23:46:48Z

Which issue does this PR close?

Closes #2525
Closes #3951

Rationale for this change

What changes are included in this PR?

How are these changes tested?

# Which issue does this PR close?  - Closes #NNN. # Rationale for this change Originally came from apache/datafusion-comet#3954 Getting the error message which hides the CAST target `data_type` ``` Cause: org.apache.comet.CometNativeException: External error: Arrow error: Cast error: Cannot cast list to non-list data types ```  # What changes are included in this PR?  # Are these changes tested?  # Are there any user-facing changes?

mbutrovich · 2026-04-20T14:11:07Z

Thanks @comphead for adding this! Here's some feedback:

Schema patch (`adjustOutputForNativeState`)

The BinaryType-to-ArrayType schema correction for ObjectHashAggregateExec partial mode is interesting. If collect_list or another TypedImperativeAggregate gets added natively in the future, it'll need a case here too. Might be worth a comment so the next person knows to update this method.
The modes != Seq(Partial) early return assumes uniform modes across all aggregate expressions. A brief comment explaining that assumption would help readability.

NaN handling

The Incompatible marking for floating-point types with strictFloatingPoint=true looks correct. Comet's DistinctArrayAggAccumulator deduplicates NaN (since ScalarValue treats NaN == NaN) while Spark does not. The expect_fallback tests for float/double confirm this works.

Docs

docs/source/user-guide/latest/expressions.md has an aggregate expressions table (line ~196) that lists all supported aggregates but doesn't include CollectSet yet. Would be good to add a row there.

Tests

Great coverage across types (bool, byte, short, int, bigint, float, double, string, binary, decimal, date, timestamp) plus NaN/Inf/+0/-0 edge cases and the dictionary encoding config matrix.

A couple of suggestions:

Maybe add a test for collect_set(DISTINCT col). It's semantically redundant but exercises a different planner path.
Could also consider a HAVING clause test, though less critical.

Benchmarks

The PR doesn't include benchmark results. Since the underlying DistinctArrayAggAccumulator does per-row ScalarValue::try_from_array and hashes into HashSet<ScalarValue>, it would be helpful to see numbers confirming native collect_set is faster than Spark's codegen fallback. Even a quick microbenchmark would give confidence.

Performance (not blocking, future opportunity)

The DistinctArrayAggAccumulator in DataFusion doesn't yet have a GroupsAccumulator implementation, so it takes the per-row accumulator path. Neil Conway has been doing a series of aggregate optimizations upstream (e.g., apache/datafusion#20504 making array_agg 190x faster via deferred materialization, apache/datafusion#20538 using hashbrown for array_distinct). Applying similar patterns to DistinctArrayAggAccumulator in DataFusion would benefit this code automatically. Worth filing an upstream issue if benchmarks show room for improvement.

feat: support collect_set

9ef01cb

comphead mentioned this pull request Apr 17, 2026

chore: Refine the error message for List to non List cast apache/arrow-rs#9757

Merged

comphead force-pushed the native_datafusion branch from 6d7dae9 to 9ef01cb Compare April 17, 2026 20:59

feat: support collect_set

868fe02

comphead changed the title ~~feat: support collect_set WIP~~ feat: support collect_set Apr 18, 2026

comphead marked this pull request as ready for review April 18, 2026 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support `collect_set`#3954

feat: support `collect_set`#3954
comphead wants to merge 2 commits intoapache:mainfrom
comphead:native_datafusion

comphead commented Apr 15, 2026

Uh oh!

mbutrovich commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

comphead commented Apr 15, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Schema patch (adjustOutputForNativeState)

NaN handling

Docs

Tests

Benchmarks

Performance (not blocking, future opportunity)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbutrovich commented Apr 20, 2026 •

edited

Loading

Schema patch (`adjustOutputForNativeState`)