bench: Scale sort benchmarks to 1M rows to exercise merge path by mbutrovich · Pull Request #21630 · apache/datafusion

mbutrovich · 2026-04-14T19:59:46Z

Which issue does this PR close?

Partially addresses Revisit ExternalSorter sort strategy and sort_in_place_threshold_bytes #21543. Also needed to properly evaluate the ExternalSorter refactor in perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in #21629, which improves the merge path.

Rationale for this change

Current sort benchmarks use 100K rows across 8 partitions (~12.5K rows per partition, ~100KB for integers). This falls below the sort_in_place_threshold_bytes (1MB), so the "sort partitioned" benchmarks always take the concat-and-sort-in-place path and never exercise the sort-then-merge path that dominates real workloads.

What changes are included in this PR?

Parameterizes the sort benchmark on input size, running each case at both 100K rows (existing) and 1M rows (new). At 1M rows, each partition holds ~125K rows (~1MB for integers), which exercises the merge path.

INPUT_SIZE constant replaced with INPUT_SIZES array: [(100_000, "100k"), (1_000_000, "1M")]
DataGenerator takes input_size as a constructor parameter
All stream generator functions accept input_size
Benchmark names include size label (e.g. sort partitioned i64 100k, sort partitioned i64 10M)
Data distribution and cardinality ratios are preserved across sizes

Are these changes tested?

Benchmark compiles and runs. No functional test changes.

Are there any user-facing changes?

No.

…ently takes the <1MB sort in-place path of the existing ExternalSorter. We want to see larger cases too.

mbutrovich · 2026-04-14T20:22:44Z

Running locally each 10M iteration takes like 30 seconds. I have to do --sample-size 10 to get it manageable. Not sure if we think that's a problem.

mbutrovich · 2026-04-14T21:00:02Z

Let me think if 1M would be a more reasonable size up to still hit what we want.

Dandandan · 2026-04-15T07:24:50Z

Let me think if 1M would be a more reasonable size up to still hit what we want.

Sounds good - let's try that first?

Add a 10M row case to the sort benchmark. 100k is too small and frequ…

551709c

…ently takes the <1MB sort in-place path of the existing ExternalSorter. We want to see larger cases too.

github-actions bot added the core Core DataFusion crate label Apr 14, 2026

mbutrovich self-assigned this Apr 14, 2026

mbutrovich mentioned this pull request Apr 14, 2026

perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in #21629

Open

Dandandan approved these changes Apr 14, 2026

View reviewed changes

Fix clippy.

464b43a

Reduce from 10M to 1M to bring runtime down.

3d25d7c

mbutrovich changed the title ~~bench: Scale sort benchmarks to 10M rows to exercise merge path~~ bench: Scale sort benchmarks to 1M rows to exercise merge path Apr 15, 2026

mbutrovich added this pull request to the merge queue Apr 15, 2026

Merged via the queue into apache:main with commit d0692b8 Apr 15, 2026
33 of 34 checks passed

mbutrovich deleted the sort_benchmark branch April 15, 2026 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: Scale sort benchmarks to 1M rows to exercise merge path#21630

bench: Scale sort benchmarks to 1M rows to exercise merge path#21630
mbutrovich merged 3 commits intoapache:mainfrom
mbutrovich:sort_benchmark

mbutrovich commented Apr 14, 2026 •

edited

Loading

Uh oh!

mbutrovich commented Apr 14, 2026

Uh oh!

mbutrovich commented Apr 14, 2026

Uh oh!

Dandandan commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mbutrovich commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

mbutrovich commented Apr 14, 2026

Uh oh!

mbutrovich commented Apr 14, 2026

Uh oh!

Dandandan commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbutrovich commented Apr 14, 2026 •

edited

Loading