fix: generate scrambled benchmark data with correct per-file RG ranges#21711
fix: generate scrambled benchmark data with correct per-file RG ranges#21711zhuqi-lucas wants to merge 1 commit intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the sort pushdown Inexact overlap benchmark data generator so it actually produces parquet row groups that are both (a) overlapping and (b) out of order, making reorder_by_statistics do meaningful work (matching the streaming / delayed-chunk scenario described in #21580/#21580 follow-ups).
Changes:
- Changes overlap dataset generation to order by a deterministic chunk permutation key plus a jittered
l_orderkey, producing scrambled + overlapping RGs. - Updates benchmark script messaging/comments to reflect the new generation strategy.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ec93c3b to
8fb395a
Compare
|
Addressed both Copilot comments — the ORDER BY + jitter/scramble approach was fundamentally flawed:
The root cause was that the parquet writer merges rows from adjacent chunks into the same RG at chunk boundaries, widening RG ranges to ~3M (instead of ~100K). This made New approach: write a single sorted file with 100K-row RGs, then use pyarrow to split into per-RG files with scrambled names. Each file has a narrow range, scrambled order → reorder actually works. Local benchmark shows ~21% improvement on Q1 with the corrected data. |
Both the inexact and overlap benchmark data generation had the same fundamental problem: writing a single parquet file with ORDER BY scramble/jitter causes the parquet writer to merge rows from adjacent chunks into the same RG at chunk boundaries, widening RG ranges to span the full data range (~6M instead of ~100K). This makes reorder_by_statistics a no-op — there's nothing meaningful to reorder. Fix both by using a two-step approach: 1. Write a single sorted file with small (100K-row) RGs 2. Use pyarrow to split into per-RG files with scrambled names Each file has a narrow l_orderkey range (~100K) but appears in scrambled alphabetical order, so reorder_by_statistics has real work to do. Local benchmark results (feature branch vs upstream/main): inexact: Q1 (DESC LIMIT 100): 4.99ms -> 4.27ms (-14%) Q2 (DESC LIMIT 1000): 2.54ms -> 2.26ms (-11%) overlap: Q1 (DESC LIMIT 100): 5.03ms -> 3.97ms (-21%) Q4 (SELECT * LIMIT 1000): 5.25ms -> 4.90ms (-7%)
8fb395a to
6aa52df
Compare
Which issue does this PR close?
Related to #21580
Rationale for this change
Both the
inexactandoverlapbenchmark data generation had the same fundamental problem: writing a single parquet file withORDER BYscramble/jitter causes the parquet writer to merge rows from adjacent chunks into the same RG at chunk boundaries. This widens RG ranges to span the full data range (~6M instead of ~100K), makingreorder_by_statisticsa no-op — there's nothing meaningful to reorder.For the overlap benchmark specifically, the jitter formula also preserved ascending RG order by min values, so there was nothing to reorder even without the boundary-merging issue.
What changes are included in this PR?
Replace the single-file
ORDER BYapproach for bothdata_sort_pushdown_inexact()datasets with a two-step strategy:l_orderkeyranges.l_orderkeyorder.Each file has a narrow range (~100K) but appears in scrambled order —
reorder_by_statisticshas real work to do.Local benchmark results (reorder feature branch vs upstream/main):
Are these changes tested?
Benchmark data generation change only. Verified locally by:
reorder_by_statisticsAre there any user-facing changes?
No. Adds pyarrow as a dependency for generating these specific benchmark datasets (
pip install pyarrow).