Skip to content

bench: Scale sort benchmarks to 1M rows to exercise merge path#21630

Merged
mbutrovich merged 3 commits intoapache:mainfrom
mbutrovich:sort_benchmark
Apr 15, 2026
Merged

bench: Scale sort benchmarks to 1M rows to exercise merge path#21630
mbutrovich merged 3 commits intoapache:mainfrom
mbutrovich:sort_benchmark

Conversation

@mbutrovich
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich commented Apr 14, 2026

Which issue does this PR close?

Rationale for this change

Current sort benchmarks use 100K rows across 8 partitions (~12.5K rows per partition, ~100KB for integers). This falls below the sort_in_place_threshold_bytes (1MB), so the "sort partitioned" benchmarks always take the concat-and-sort-in-place path and never exercise the sort-then-merge path that dominates real workloads.

What changes are included in this PR?

Parameterizes the sort benchmark on input size, running each case at both 100K rows (existing) and 1M rows (new). At 1M rows, each partition holds ~125K rows (~1MB for integers), which exercises the merge path.

  • INPUT_SIZE constant replaced with INPUT_SIZES array: [(100_000, "100k"), (1_000_000, "1M")]
  • DataGenerator takes input_size as a constructor parameter
  • All stream generator functions accept input_size
  • Benchmark names include size label (e.g. sort partitioned i64 100k, sort partitioned i64 10M)
  • Data distribution and cardinality ratios are preserved across sizes

Are these changes tested?

Benchmark compiles and runs. No functional test changes.

Are there any user-facing changes?

No.

…ently takes the <1MB sort in-place path of the existing ExternalSorter. We want to see larger cases too.
@mbutrovich
Copy link
Copy Markdown
Contributor Author

Running locally each 10M iteration takes like 30 seconds. I have to do --sample-size 10 to get it manageable. Not sure if we think that's a problem.

@mbutrovich
Copy link
Copy Markdown
Contributor Author

Let me think if 1M would be a more reasonable size up to still hit what we want.

@Dandandan
Copy link
Copy Markdown
Contributor

Let me think if 1M would be a more reasonable size up to still hit what we want.

Sounds good - let's try that first?

@mbutrovich mbutrovich changed the title bench: Scale sort benchmarks to 10M rows to exercise merge path bench: Scale sort benchmarks to 1M rows to exercise merge path Apr 15, 2026
@mbutrovich mbutrovich added this pull request to the merge queue Apr 15, 2026
Merged via the queue into apache:main with commit d0692b8 Apr 15, 2026
33 of 34 checks passed
@mbutrovich mbutrovich deleted the sort_benchmark branch April 15, 2026 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants