(Test) Advanced adaptive filter selectivity evaluation by adriangb · Pull Request #20363 · apache/datafusion

adriangb · 2026-02-15T04:46:21Z

Which issue does this PR close?

Related to filter pushdown performance optimization work.

Rationale for this change

Currently when pushdown_filters = true, DataFusion pushes all filter predicates into the Parquet reader as row-level filters (ArrowPredicates) unconditionally. This is suboptimal because:

Some filters are expensive relative to their selectivity. A filter that references wide columns but prunes few rows wastes CPU decoding those columns during the row-filter phase, when it would be cheaper to apply the filter post-scan on the already-decoded batch.
The old reorder_filters heuristic was static. It used compressed column size as a proxy for cost and sorted filters by that metric, but never measured actual runtime selectivity or evaluation cost. It could not adapt to data skew or runtime conditions.
Dynamic join filters (e.g., from HashJoinExec) cannot be dropped even when they provide no benefit. Without a way to mark filters as optional, the system was forced to always evaluate them.

This PR introduces an adaptive filter selectivity tracking system that observes filter behavior at runtime and makes data-driven decisions about whether each filter should be pushed down as a row-level predicate or applied post-scan.

What changes are included in this PR?

1. New module: `selectivity.rs` (1,554 lines)

The core of this PR. Introduces SelectivityTracker, a shared, lock-guarded structure that:

Tracks per-filter statistics using Welford's online algorithm for numerically stable streaming mean and variance of filter "effectiveness" (bytes_pruned_per_second_of_eval_time).
Implements a filter state machine: each filter transitions through New -> RowFilter | PostScan -> (promoted/demoted/dropped) states based on:
- Initial placement: uses a byte-ratio heuristic (filter_bytes / projection_bytes) to cheaply decide whether a new filter starts as a row filter or post-scan filter.
- Promotion (PostScan -> RowFilter): when the confidence interval lower bound on effectiveness exceeds filter_pushdown_min_bytes_per_sec.
- Demotion (RowFilter -> PostScan): when the confidence interval upper bound drops below the threshold.
- Dropping (for optional filters only): filters wrapped in OptionalFilterPhysicalExpr can be dropped entirely when ineffective.
Detects dynamic filter updates via snapshot_generation(), resetting statistics when a filter's predicate changes (e.g., when a DynamicFilterPhysicalExpr from a hash join updates its value set).
Sorts filters by effectiveness within each partition (row-level and post-scan), so the most selective filters are applied first.

Key types:

SelectivityTracker -- cross-file tracker shared by all ParquetOpener instances
TrackerConfig -- immutable configuration (built from ParquetOptions)
SelectivityStats -- per-filter Welford statistics with confidence interval methods
FilterState -- RowFilter | PostScan | Dropped enum
PartitionedFilters -- output of partition_filters(), consumed by the opener
FilterId -- stable usize identifier assigned by ParquetSource::with_predicate

2. New wrapper: `OptionalFilterPhysicalExpr` (in `physical_expr_common`)

A transparent PhysicalExpr wrapper that marks a filter as optional -- droppable without affecting query correctness. All PhysicalExpr trait methods delegate to the inner expression. The selectivity tracker detects this via downcast_ref::<OptionalFilterPhysicalExpr>() and can drop the filter entirely when it is ineffective, rather than demoting it to post-scan.

HashJoinExec now wraps its dynamic join filters in OptionalFilterPhysicalExpr before pushing them down. This is why plan output now shows Optional(DynamicFilter [...]) instead of DynamicFilter [...].

3. Removal of `reorder_filters` config option

The old static reorder_filters boolean and its associated heuristic (sort by required_bytes, then can_use_index) are removed entirely. The adaptive system subsumes this:

FilterCandidate no longer stores required_bytes or can_use_index fields.
The size_of_columns() and columns_sorted() helper functions in row_filter.rs are removed.
Filter ordering is now handled by SelectivityTracker::partition_filters() based on measured effectiveness or byte-ratio fallback.

4. Three new configuration options (in `ParquetOptions`)

Option	Default	Purpose
`filter_pushdown_min_bytes_per_sec`	52,428,800 (50 MiB/s)	Throughput threshold for promoting a filter to row-level. `0.0` = all promoted, `INFINITY` = none promoted (feature disabled).
`filter_collecting_byte_ratio_threshold`	0.15	Byte-ratio threshold for initial filter placement. Filters whose columns use < 15% of projected bytes start as row filters; otherwise post-scan.
`filter_confidence_z`	2.0	Z-score for confidence intervals (~95%). Controls how much evidence is needed before promoting or demoting a filter.

5. Changes to `ParquetOpener` / opener.rs

Predicates are now stored as Vec<(FilterId, Arc<dyn PhysicalExpr>)> instead of a single combined Arc<dyn PhysicalExpr>.
The opener calls selectivity_tracker.partition_filters() to split filters into row-level vs. post-scan.
Row-level filters are built via build_row_filter() (updated signature).
Post-scan filters are applied in apply_post_scan_filters_with_stats(), a new function that evaluates each filter individually, reports per-filter timing and selectivity back to the tracker, and combines results into a single boolean mask.
The limit is only applied to the Parquet reader when there are no post-scan filters (otherwise limiting would cut off rows before the filter could find matches).
The projection mask is expanded to include columns needed by post-scan filters.
A new filter_apply_time metric tracks post-scan filter evaluation time.

6. Changes to `ParquetSource` / source.rs

Internal predicate storage changed from Option<Arc<dyn PhysicalExpr>> to Option<Vec<(FilterId, Arc<dyn PhysicalExpr>)>>.
with_predicate() now splits the predicate into conjuncts and assigns stable FilterIds (indices).
SelectivityTracker is stored as a shared Arc on ParquetSource and passed to all openers.
with_table_parquet_options() now builds a fresh SelectivityTracker from the three new config values.
with_reorder_filters() and reorder_filters() methods are removed.

7. Changes to `build_row_filter()` / row_filter.rs

Signature changed: takes Vec<(FilterId, Arc<dyn PhysicalExpr>)> + &Arc<SelectivityTracker> instead of &Arc<dyn PhysicalExpr> + reorder_predicates: bool.
Returns RowFilterWithMetrics (new struct) containing both the RowFilter and any unbuildable filters that must be applied post-scan.
DatafusionArrowPredicate now carries a FilterId and Arc<SelectivityTracker>, reporting per-batch evaluation metrics back to the tracker after each evaluate() call.
No reordering is done inside build_row_filter -- filters arrive pre-ordered by the tracker.

8. Changes to `HashJoinExec`

Dynamic join filters are now wrapped in OptionalFilterPhysicalExpr before being pushed down.
When receiving a pushed-down filter back, the join unwraps OptionalFilterPhysicalExpr to find the inner DynamicFilterPhysicalExpr.

9. Protobuf schema updates

reorder_filters field (tag 6) marked as reserved in datafusion_common.proto.
Three new optional fields added: filter_pushdown_min_bytes_per_sec (tag 35), filter_collecting_byte_ratio_threshold (tag 40), filter_confidence_z (tag 41).
Corresponding serialization/deserialization code updated in pbjson.rs, prost.rs, from_proto, to_proto, and file_formats.rs.

10. Test and benchmark updates

All references to reorder_filters removed from tests and benchmarks.
Existing filter pushdown tests set filter_pushdown_min_bytes_per_sec = 0.0 to preserve deterministic behavior (all filters always pushed down).
Snapshot test expectations updated from DynamicFilter [...] to Optional(DynamicFilter [...]).
New unit tests in selectivity.rs covering: effectiveness calculation, Welford's algorithm, confidence intervals, state machine transitions (initial placement, promotion, demotion, dropping), dynamic filter generation tracking, filter ordering, and integration lifecycle tests.
One expected output change in explain_analyze.rs (output_rows=8 -> output_rows=5) due to the adaptive system now placing some filters as post-scan that were previously row-level, causing slight row count differences in EXPLAIN ANALYZE output.

Are these changes tested?

Yes:

Existing tests: All existing pushdown_filters and filter pushdown SLT tests pass (with filter_pushdown_min_bytes_per_sec = 0.0 to force all filters to row-level for deterministic behavior).
New unit tests: Comprehensive tests in selectivity.rs (~450 lines of tests) covering the SelectivityStats calculator, TrackerConfig builder, state machine transitions (initial placement, promotion, demotion, dropping, reset on generation change), filter ordering, and full promotion/demotion lifecycle integration tests.
Updated snapshot tests: All physical optimizer filter pushdown snapshot tests updated to reflect the Optional(...) wrapper on dynamic filters.
Updated SLT tests: dynamic_filter_pushdown_config.slt, information_schema.slt, preserve_file_partitioning.slt, projection_pushdown.slt, push_down_filter.slt, and repartition_subset_satisfaction.slt updated.
Benchmark data included: benchmarks/results.txt shows TPC-H (13 faster, 6 slower, 3 unchanged), TPC-DS (33 faster, 31 slower, 35 unchanged, with notable 24x improvement on Q64), and ClickBench (18 faster, 12 slower, 13 unchanged) results.

Are there any user-facing changes?

Yes:

reorder_filters config option removed. This is a breaking change. Users who set SET datafusion.execution.parquet.reorder_filters = true will get an error. The adaptive system replaces this functionality automatically.
Three new config options added under datafusion.execution.parquet:
- filter_pushdown_min_bytes_per_sec (default: 52428800)
- filter_collecting_byte_ratio_threshold (default: 0.15)
- filter_confidence_z (default: 2.0)
Changed default behavior when pushdown_filters = true. Previously, all filters were unconditionally pushed into the Parquet reader. Now, the adaptive system decides per-filter based on byte-ratio thresholds and runtime effectiveness measurements. To restore the old behavior of pushing all filters unconditionally, set filter_pushdown_min_bytes_per_sec = 0.0.
EXPLAIN plan output changes. Dynamic join filters now display as Optional(DynamicFilter [...]) instead of DynamicFilter [...], reflecting their new optional wrapper.
Deprecated predicate() method signature changed. ParquetSource::predicate() now returns Option<Arc<dyn PhysicalExpr>> (owned) instead of Option<&Arc<dyn PhysicalExpr>> (reference). This method was already deprecated in favor of filter().

adriangb · 2026-02-15T04:49:22Z

run benchmark tpcds
DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true
DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true

adriangb · 2026-02-15T04:49:29Z

run benchmark clickbench_partitioned
DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true
DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true

alamb-ghbot · 2026-02-15T04:49:34Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing filter-pushdown-dynamic-bytes (dbab02b) to 53b0ffb diff using: clickbench_partitioned
Results will be posted here when complete

adriangb · 2026-02-15T04:49:36Z

run benchmark tpch
DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true
DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true

alamb-ghbot · 2026-02-15T05:18:40Z

🤖: Benchmark completed

Details

Comparing HEAD and filter-pushdown-dynamic-bytes
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃        HEAD ┃ filter-pushdown-dynamic-bytes ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │     2.52 ms │                       2.78 ms │  1.10x slower │
│ QQuery 1  │    52.24 ms │                      49.61 ms │ +1.05x faster │
│ QQuery 2  │   132.32 ms │                     136.95 ms │     no change │
│ QQuery 3  │   160.45 ms │                     159.43 ms │     no change │
│ QQuery 4  │  1006.07 ms │                    1001.72 ms │     no change │
│ QQuery 5  │  1264.63 ms │                    1240.53 ms │     no change │
│ QQuery 6  │    17.57 ms │                       6.46 ms │ +2.72x faster │
│ QQuery 7  │    68.54 ms │                      55.34 ms │ +1.24x faster │
│ QQuery 8  │  1432.88 ms │                    1357.09 ms │ +1.06x faster │
│ QQuery 9  │  1783.22 ms │                    1723.13 ms │     no change │
│ QQuery 10 │   469.85 ms │                     339.66 ms │ +1.38x faster │
│ QQuery 11 │   515.70 ms │                     392.95 ms │ +1.31x faster │
│ QQuery 12 │  1424.70 ms │                    1142.60 ms │ +1.25x faster │
│ QQuery 13 │  2086.66 ms │                    1758.57 ms │ +1.19x faster │
│ QQuery 14 │  1465.46 ms │                    1166.84 ms │ +1.26x faster │
│ QQuery 15 │  1205.75 ms │                    1142.22 ms │ +1.06x faster │
│ QQuery 16 │  2455.92 ms │                    2414.92 ms │     no change │
│ QQuery 17 │  2453.39 ms │                    2375.62 ms │     no change │
│ QQuery 18 │  4799.30 ms │                    4693.76 ms │     no change │
│ QQuery 19 │   141.25 ms │                     142.58 ms │     no change │
│ QQuery 20 │  1861.44 ms │                    1845.91 ms │     no change │
│ QQuery 21 │  2312.86 ms │                    2179.66 ms │ +1.06x faster │
│ QQuery 22 │  3956.56 ms │                    4079.52 ms │     no change │
│ QQuery 23 │  1067.61 ms │                    4758.55 ms │  4.46x slower │
│ QQuery 24 │   244.71 ms │                     185.57 ms │ +1.32x faster │
│ QQuery 25 │   635.55 ms │                     447.91 ms │ +1.42x faster │
│ QQuery 26 │   318.23 ms │                     204.37 ms │ +1.56x faster │
│ QQuery 27 │  2945.79 ms │                    2432.80 ms │ +1.21x faster │
│ QQuery 28 │ 23684.76 ms │                   23006.72 ms │     no change │
│ QQuery 29 │   948.88 ms │                     986.96 ms │     no change │
│ QQuery 30 │  1269.24 ms │                    1229.75 ms │     no change │
│ QQuery 31 │  1311.55 ms │                    1343.10 ms │     no change │
│ QQuery 32 │  4161.16 ms │                    3919.03 ms │ +1.06x faster │
│ QQuery 33 │  5017.59 ms │                    5183.95 ms │     no change │
│ QQuery 34 │  5507.17 ms │                    5238.45 ms │     no change │
│ QQuery 35 │  1862.72 ms │                    1797.80 ms │     no change │
│ QQuery 36 │   171.63 ms │                     185.39 ms │  1.08x slower │
│ QQuery 37 │    90.51 ms │                      72.77 ms │ +1.24x faster │
│ QQuery 38 │    85.76 ms │                     111.77 ms │  1.30x slower │
│ QQuery 39 │   286.54 ms │                     332.16 ms │  1.16x slower │
│ QQuery 40 │    56.28 ms │                      39.13 ms │ +1.44x faster │
│ QQuery 41 │    50.29 ms │                      35.28 ms │ +1.43x faster │
│ QQuery 42 │    36.46 ms │                      32.72 ms │ +1.11x faster │
└───────────┴─────────────┴───────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                            ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                            │ 80821.72ms │
│ Total Time (filter-pushdown-dynamic-bytes)   │ 80952.02ms │
│ Average Time (HEAD)                          │  1879.57ms │
│ Average Time (filter-pushdown-dynamic-bytes) │  1882.61ms │
│ Queries Faster                               │         20 │
│ Queries Slower                               │          5 │
│ Queries with No Change                       │         18 │
│ Queries with Failure                         │          0 │
└──────────────────────────────────────────────┴────────────┘

alamb-ghbot · 2026-02-15T05:18:43Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing filter-pushdown-dynamic-bytes (dbab02b) to 53b0ffb diff using: tpcds
Results will be posted here when complete

Dandandan · 2026-02-15T07:27:39Z

show benchmark queue

alamb-ghbot · 2026-02-15T07:27:49Z

🤖 Hi @Dandandan, you asked to view the benchmark queue (#20363 (comment)).

Job	User	Benchmarks	Comment
`20363_3903261774.sh`	adriangb	tpcds (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20363#issuecomment-3903261774`
`20363_3903262814.sh`	adriangb	tpch (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20363#issuecomment-3903262814`
`20365_3903537986.sh`	Dandandan	tpch tpcds (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20365#issuecomment-3903537986`

Dandandan · 2026-02-15T09:21:03Z

show benchmark queue

alamb-ghbot · 2026-02-15T09:21:11Z

🤖 Hi @Dandandan, you asked to view the benchmark queue (#20363 (comment)).

Job	User	Benchmarks	Comment
`20363_3903261774.sh`	adriangb	tpcds (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20363#issuecomment-3903261774`
`20363_3903262814.sh`	adriangb	tpch (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20363#issuecomment-3903262814`
`20365_3903537986.sh`	Dandandan	tpch tpcds (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20365#issuecomment-3903537986`
`20365_3903568877.sh`	Dandandan	clickbench_partitioned (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20365#issuecomment-3903568877`

Dandandan · 2026-02-15T09:21:34Z

Hm it seems stuck again

Dandandan · 2026-02-15T12:07:08Z

FYI @alamb

Hm it seems stuck again

adriangb · 2026-02-15T12:40:12Z

@Dandandan this is mostly vibe coded, I'm only 50% confident it even makes sense without reviewing the code fwiw

adriangb · 2026-02-15T13:46:23Z

show benchmark queue

alamb-ghbot · 2026-02-15T13:46:30Z

🤖 Hi @adriangb, you asked to view the benchmark queue (#20363 (comment)).

Job	User	Benchmarks	Comment
`20363_3903261774.sh`	adriangb	tpcds (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20363#issuecomment-3903261774`
`20363_3903262814.sh`	adriangb	tpch (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20363#issuecomment-3903262814`
`20365_3903537986.sh`	Dandandan	tpch tpcds (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20365#issuecomment-3903537986`
`20365_3903568877.sh`	Dandandan	clickbench_partitioned (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20365#issuecomment-3903568877`

adriangb · 2026-02-15T13:47:03Z

Wonder if I'm infinite looping it or something :(

Dandandan · 2026-02-15T14:10:04Z

Wonder if I'm infinite looping it or something :(

Yes I think previously it got stuck during infinite loops / extremely long running tasks.

adriangb · 2026-02-15T14:12:58Z

Wonder if I'm infinite looping it or something :(

Yes I think previously it got stuck during infinite loops / extremely long running tasks.

My bad I’ll try to add a PR to have timeouts and a cancel command

adriangb · 2026-02-15T18:27:08Z

show benchmark queue

alamb-ghbot · 2026-02-15T18:27:14Z

🤖 Hi @adriangb, you asked to view the benchmark queue (#20363 (comment)).

Job	User	Benchmarks	Comment
`20363_3903261774.sh`	adriangb	tpcds (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20363#issuecomment-3903261774`
`20363_3903262814.sh`	adriangb	tpch (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20363#issuecomment-3903262814`
`20365_3903537986.sh`	Dandandan	tpch tpcds (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20365#issuecomment-3903537986`
`20365_3903568877.sh`	Dandandan	clickbench_partitioned (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20365#issuecomment-3903568877`
`arrow-9414-3904516323.sh`	Dandandan	arrow_reader_clickbench	`https://github.com/apache/arrow-rs/pull/9414#issuecomment-3904516323`

alamb · 2026-02-15T20:36:32Z

run benchmark tpch

alamb-ghbot · 2026-03-01T12:30:12Z

🤖 Hi @Dandandan, you asked to view the benchmark queue (#20363 (comment)).

Job	User	Benchmarks	Comment
`20363_3977503986.sh`	adriangb	tpcds (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20363#issuecomment-3977503986`
`20363_3977504315.sh`	adriangb	tpch (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20363#issuecomment-3977504315`
`20447_3977530468.sh`	adriangb	tpcds	`https://github.com/apache/datafusion/pull/20447#issuecomment-3977530468`
`20447_3977531157.sh`	adriangb	tpcds (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20447#issuecomment-3977531157`
`20621_3977593319.sh`	Dandandan	default	`https://github.com/apache/datafusion/pull/20621#issuecomment-3977593319`
`arrow-9464-3977709353.sh`	Dandandan	arrow_reader_clickbench	`https://github.com/apache/arrow-rs/pull/9464#issuecomment-3977709353`
`20481_3978024622.sh`	Dandandan	clickbench_partitioned (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20481#issuecomment-3978024622`

alamb-ghbot · 2026-03-01T12:30:24Z

🤖 Hi @Dandandan, you asked to view the benchmark queue (#20363 (comment)).

Job	User	Benchmarks	Comment
`20363_3977503986.sh`	adriangb	tpcds (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20363#issuecomment-3977503986`
`20363_3977504315.sh`	adriangb	tpch (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20363#issuecomment-3977504315`
`20447_3977530468.sh`	adriangb	tpcds	`https://github.com/apache/datafusion/pull/20447#issuecomment-3977530468`
`20447_3977531157.sh`	adriangb	tpcds (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20447#issuecomment-3977531157`
`20621_3977593319.sh`	Dandandan	default	`https://github.com/apache/datafusion/pull/20621#issuecomment-3977593319`
`arrow-9464-3977709353.sh`	Dandandan	arrow_reader_clickbench	`https://github.com/apache/arrow-rs/pull/9464#issuecomment-3977709353`
`20481_3978024622.sh`	Dandandan	clickbench_partitioned (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20481#issuecomment-3978024622`

adriangb · 2026-03-01T12:31:37Z

It does seem to have gotten stuck again. I’m working on a system that can run benchmarks in parallel and won’t get borked like this. I think it’s almost ready.

alamb-ghbot · 2026-03-01T12:32:11Z

🤖 Hi @Dandandan, you asked to view the benchmark queue (#20363 (comment)).

Job	User	Benchmarks	Comment
`20363_3977503986.sh`	adriangb	tpcds (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20363#issuecomment-3977503986`
`20363_3977504315.sh`	adriangb	tpch (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20363#issuecomment-3977504315`
`20447_3977530468.sh`	adriangb	tpcds	`https://github.com/apache/datafusion/pull/20447#issuecomment-3977530468`
`20447_3977531157.sh`	adriangb	tpcds (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20447#issuecomment-3977531157`
`20621_3977593319.sh`	Dandandan	default	`https://github.com/apache/datafusion/pull/20621#issuecomment-3977593319`
`arrow-9464-3977709353.sh`	Dandandan	arrow_reader_clickbench	`https://github.com/apache/arrow-rs/pull/9464#issuecomment-3977709353`
`20481_3978024622.sh`	Dandandan	clickbench_partitioned (env: DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true)	`https://github.com/apache/datafusion/pull/20481#issuecomment-3978024622`

adriangb · 2026-03-01T12:34:27Z

@Dandandan i deleted your comment to stop the spam. I honestly don’t know what’s wrong with this PR or the runner. I think I should close it and open a new one.

Dandandan · 2026-03-01T12:40:56Z

@Dandandan i deleted your comment to stop the spam. I honestly don’t know what’s wrong with this PR or the runner. I think I should close it and open a new one.

Thanks... Perhaps it has to do with the comment / commit window?

alamb-ghbot · 2026-03-01T13:48:24Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing filter-pushdown-dynamic-bytes (8f9827d) to acec058 diff using: tpcds
Results will be posted here when complete

alamb · 2026-03-01T13:48:59Z

I think this PR gets OOM killed -- I'll remove it

adriangb · 2026-03-01T13:53:03Z

I think this PR gets OOM killed -- I'll remove it

Thanks. Sorry for triggering it again. It’s hard to debug what’s going on and I’m surprised this PR causes an OOM when others don’t, but it does seem to be especially problematic. Wonder if it has a mem leak or something.

Dandandan · 2026-03-01T14:07:13Z

I think this PR gets OOM killed -- I'll remove it

Thanks. Sorry for triggering it again. It’s hard to debug what’s going on and I’m surprised this PR causes an OOM when others don’t, but it does seem to be especially problematic. Wonder if it has a mem leak or something.

For context - I tried to run the Clickbench benchmark on a Claude Cloud runner (21 GB RAM / 16 cores) - there it also get stuck (without any changes).
So I guess some of the queries consume already a high amount of RAM.

…tion Replace the single RwLock<SelectivityTrackerInner> guarding all state with two independent locks: - filter_stats: RwLock<HashMap<FilterId, Mutex<SelectivityStats>>> The hot update() path takes a shared read lock then a per-filter Mutex. Different filters never contend; same-filter contention is ~100ns on the cheap inner Mutex. - inner: Mutex<SelectivityTrackerInner> The cold partition_filters() path (once per file open) takes this for state-machine transitions. update() never touches it. Measured on ClickBench Q10 (24 threads, 100 partitioned files): partition_filters() lock acquire: 313µs avg → ~120ns (2600x faster) update() waits >10µs: 265 calls → 2-4 calls (99% reduction) Total cumulative lock wait: ~50ms → <0.1ms Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

adriangb force-pushed the filter-pushdown-dynamic-bytes branch from e0240af to 09cdb0b Compare February 15, 2026 13:11

apache deleted a comment from alamb-ghbot Mar 1, 2026

apache deleted a comment from Dandandan Mar 1, 2026

apache deleted a comment from alamb-ghbot Mar 1, 2026

adriangb and others added 4 commits March 2, 2026 09:43

avoid some per batch work

e9716a6

fmt

460a184

update lints

09f90cf

Conversation

adriangb commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

1. New module: selectivity.rs (1,554 lines)

2. New wrapper: OptionalFilterPhysicalExpr (in physical_expr_common)

3. Removal of reorder_filters config option

4. Three new configuration options (in ParquetOptions)

5. Changes to ParquetOpener / opener.rs

6. Changes to ParquetSource / source.rs

7. Changes to build_row_filter() / row_filter.rs

8. Changes to HashJoinExec

9. Protobuf schema updates

10. Test and benchmark updates

Are these changes tested?

Are there any user-facing changes?

Uh oh!

adriangb commented Feb 15, 2026

Uh oh!

adriangb commented Feb 15, 2026

Uh oh!

alamb-ghbot commented Feb 15, 2026

Uh oh!

adriangb commented Feb 15, 2026

Uh oh!

alamb-ghbot commented Feb 15, 2026

Uh oh!

alamb-ghbot commented Feb 15, 2026

Uh oh!

Dandandan commented Feb 15, 2026

Uh oh!

alamb-ghbot commented Feb 15, 2026

Uh oh!

Dandandan commented Feb 15, 2026

Uh oh!

alamb-ghbot commented Feb 15, 2026

Uh oh!

Dandandan commented Feb 15, 2026

Uh oh!

Dandandan commented Feb 15, 2026

Uh oh!

adriangb commented Feb 15, 2026

Uh oh!

adriangb commented Feb 15, 2026

Uh oh!

alamb-ghbot commented Feb 15, 2026

Uh oh!

adriangb commented Feb 15, 2026

Uh oh!

Dandandan commented Feb 15, 2026

Uh oh!

adriangb commented Feb 15, 2026

Uh oh!

adriangb commented Feb 15, 2026

Uh oh!

alamb-ghbot commented Feb 15, 2026

Uh oh!

alamb commented Feb 15, 2026

Uh oh!

alamb-ghbot commented Mar 1, 2026

Uh oh!

alamb-ghbot commented Mar 1, 2026

Uh oh!

adriangb commented Mar 1, 2026

Uh oh!

alamb-ghbot commented Mar 1, 2026

Uh oh!

adriangb commented Mar 1, 2026

Uh oh!

Dandandan commented Mar 1, 2026

Uh oh!

alamb-ghbot commented Mar 1, 2026

Uh oh!

alamb commented Mar 1, 2026

Uh oh!

adriangb commented Mar 1, 2026

Uh oh!

Dandandan commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

adriangb commented Feb 15, 2026 •

edited

Loading

1. New module: `selectivity.rs` (1,554 lines)

2. New wrapper: `OptionalFilterPhysicalExpr` (in `physical_expr_common`)

3. Removal of `reorder_filters` config option

4. Three new configuration options (in `ParquetOptions`)

5. Changes to `ParquetOpener` / opener.rs

6. Changes to `ParquetSource` / source.rs

7. Changes to `build_row_filter()` / row_filter.rs

8. Changes to `HashJoinExec`

Dandandan commented Mar 1, 2026 •

edited

Loading