Skip to content

perf: Optimize dynamic IN list evaluation with vectorized Arrow eq kernel#20428

Open
zhangxffff wants to merge 4 commits intoapache:mainfrom
zhangxffff:feat/opt_dynamic_in
Open

perf: Optimize dynamic IN list evaluation with vectorized Arrow eq kernel#20428
zhangxffff wants to merge 4 commits intoapache:mainfrom
zhangxffff:feat/opt_dynamic_in

Conversation

@zhangxffff
Copy link
Contributor

@zhangxffff zhangxffff commented Feb 19, 2026

Which issue does this PR close?

Rationale for this change

The dynamic IN list evaluation path in InListExpr::evaluate() is triggered when the list contains non-constant expressions such as column references (e.g., a IN (b, c, d)). It currently compares values row-by-row via make_comparator bypassing Arrow's vectorized SIMD kernels. This makes it orders of magnitude slower than the static filter path (HashSet) used for constant literals.

What changes are included in this PR?

In the dynamic IN list path (None => branch in InListExpr::evaluate()):

  1. Vectorized comparison: Use arrow::compute::kernels::cmp::eq instead of per-row make_comparator for non-nested types (primitives, strings, binary, dictionary). For nested types (Struct, List, Map), fall back to make_comparator since arrow_eq intentionally rejects them due to ambiguous null semantics.
  2. Short-circuit with break: Replace try_fold with an explicit for loop. Check found.true_count() == num_rows before each iteration and break immediately — skipping both evaluate() and comparison for remaining list items.

Are these changes tested?

Yes. Add 7 testcase to cover dynamic in path. Also add New criterion benchmarks (bench_dynamic_int32, bench_dynamic_utf8).

Benchmarked with cargo bench --bench in_list -- "in_list_dynamic" on 8192-row batches. Int32 improved 11–320x, Utf8 improved 2–10x, with the largest gains on high match rates due to early termination.

(zhangxffff) zhangxffff@95d3d60664da ~/W/datafusion (feat/opt_dynamic_in)> critcmp after before
group                                                 after                                  before
-----                                                 -----                                  ------
in_list_dynamic/Int32/list=28/match=0%/nulls=0%       1.00     92.8±0.87µs        ? ?/sec    11.80 1094.4±21.14µs        ? ?/sec
in_list_dynamic/Int32/list=28/match=0%/nulls=20%      1.00    107.9±0.90µs        ? ?/sec    16.57 1788.5±12.38µs        ? ?/sec
in_list_dynamic/Int32/list=28/match=100%/nulls=0%     1.00      3.5±0.04µs        ? ?/sec    319.51 1116.0±21.26µs        ? ?/sec
in_list_dynamic/Int32/list=28/match=100%/nulls=20%    1.00    107.3±1.97µs        ? ?/sec    16.63 1783.9±15.31µs        ? ?/sec
in_list_dynamic/Int32/list=28/match=50%/nulls=0%      1.00     49.6±0.31µs        ? ?/sec    32.46 1610.0±27.97µs        ? ?/sec
in_list_dynamic/Int32/list=28/match=50%/nulls=20%     1.00    107.2±1.15µs        ? ?/sec    19.54     2.1±0.01ms        ? ?/sec
in_list_dynamic/Int32/list=3/match=0%/nulls=0%        1.00     10.1±0.07µs        ? ?/sec    11.61   116.9±1.61µs        ? ?/sec
in_list_dynamic/Int32/list=3/match=0%/nulls=20%       1.00     11.4±0.11µs        ? ?/sec    15.09   172.6±0.97µs        ? ?/sec
in_list_dynamic/Int32/list=3/match=100%/nulls=0%      1.00      3.5±0.02µs        ? ?/sec    33.76   119.7±2.75µs        ? ?/sec
in_list_dynamic/Int32/list=3/match=100%/nulls=20%     1.00     11.4±0.16µs        ? ?/sec    15.12   173.1±0.97µs        ? ?/sec
in_list_dynamic/Int32/list=3/match=50%/nulls=0%       1.00     10.0±0.08µs        ? ?/sec    16.92   170.0±4.09µs        ? ?/sec
in_list_dynamic/Int32/list=3/match=50%/nulls=20%      1.00     11.6±0.19µs        ? ?/sec    17.49   202.5±2.60µs        ? ?/sec
in_list_dynamic/Int32/list=8/match=0%/nulls=0%        1.00     26.6±0.45µs        ? ?/sec    11.73   312.5±2.88µs        ? ?/sec
in_list_dynamic/Int32/list=8/match=0%/nulls=20%       1.00     30.4±0.43µs        ? ?/sec    16.23   493.6±3.45µs        ? ?/sec
in_list_dynamic/Int32/list=8/match=100%/nulls=0%      1.00      3.5±0.03µs        ? ?/sec    90.16   313.6±6.51µs        ? ?/sec
in_list_dynamic/Int32/list=8/match=100%/nulls=20%     1.00     31.2±0.23µs        ? ?/sec    16.04   499.9±4.44µs        ? ?/sec
in_list_dynamic/Int32/list=8/match=50%/nulls=0%       1.00     26.6±0.17µs        ? ?/sec    17.07   453.4±8.14µs        ? ?/sec
in_list_dynamic/Int32/list=8/match=50%/nulls=20%      1.00     30.8±0.37µs        ? ?/sec    19.16   590.6±2.55µs        ? ?/sec
in_list_dynamic/Utf8/list=28/match=0%                 1.00    149.9±2.70µs        ? ?/sec    10.51  1574.8±9.97µs        ? ?/sec
in_list_dynamic/Utf8/list=28/match=100%               1.00    725.9±9.09µs        ? ?/sec    2.13   1548.3±7.20µs        ? ?/sec
in_list_dynamic/Utf8/list=28/match=50%                1.00   1070.0±7.65µs        ? ?/sec    2.93      3.1±0.02ms        ? ?/sec
in_list_dynamic/Utf8/list=3/match=0%                  1.00     16.1±0.17µs        ? ?/sec    10.07   162.4±1.62µs        ? ?/sec
in_list_dynamic/Utf8/list=3/match=100%                1.00     65.8±1.22µs        ? ?/sec    2.39    157.1±1.79µs        ? ?/sec
in_list_dynamic/Utf8/list=3/match=50%                 1.00     93.4±2.15µs        ? ?/sec    3.32    309.6±1.23µs        ? ?/sec
in_list_dynamic/Utf8/list=8/match=0%                  1.00     43.0±0.29µs        ? ?/sec    10.37   445.6±3.09µs        ? ?/sec
in_list_dynamic/Utf8/list=8/match=100%                1.00    197.5±0.93µs        ? ?/sec    2.21    436.9±2.10µs        ? ?/sec
in_list_dynamic/Utf8/list=8/match=50%                 1.00    296.8±2.36µs        ? ?/sec    2.97    880.5±6.52µs        ? ?/sec

Are there any user-facing changes?

No.

@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Feb 19, 2026
@zhangxffff zhangxffff marked this pull request as draft February 19, 2026 08:52
@zhangxffff zhangxffff marked this pull request as ready for review February 19, 2026 09:43
@adriangb
Copy link
Contributor

Yes. Add 7 testcase to cover dynamic in path. Also add New criterion benchmarks (bench_dynamic_int32, bench_dynamic_utf8).

Could you make a PR adding just the benchmarks first? That way we can merge that and then show a clear improvement on this PR

@adriangb
Copy link
Contributor

It would also be nice to see these seemingly unrelated changes (use kernel and break early) as separate PRs

@zhangxffff
Copy link
Contributor Author

Yes. Add 7 testcase to cover dynamic in path. Also add New criterion benchmarks (bench_dynamic_int32, bench_dynamic_utf8).

Could you make a PR adding just the benchmarks first? That way we can merge that and then show a clear improvement on this PR

Thank you so much for your review!

I’ve opened a separate PR (#20444) focused on adding the benchmarks first, and would greatly appreciate it if you could take a look at this code when you have some time.

I will follow up with further optimizations separately after #20444 is merged.

let rhs = match expr? {
ColumnarValue::Array(array) => {
let use_arrow_eq = !value.data_type().is_nested();
let mut found =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of setting allocating the first to all -false (i.e. extra allocation / memset), it could initialize the first separately.

array.as_ref(),
SortOptions::default(),
)?;
(0..num_rows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BooleanBuffer::collect_bool + cloning the original null buffer is much faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize IN list with columns evaluation with vectorized Arrow eq kernel

3 participants

Comments