perf: Optimize dynamic IN list evaluation with vectorized Arrow eq kernel#20428
perf: Optimize dynamic IN list evaluation with vectorized Arrow eq kernel#20428zhangxffff wants to merge 4 commits intoapache:mainfrom
Conversation
6a309d5 to
61af0e2
Compare
Could you make a PR adding just the benchmarks first? That way we can merge that and then show a clear improvement on this PR |
|
It would also be nice to see these seemingly unrelated changes (use kernel and break early) as separate PRs |
Thank you so much for your review! I’ve opened a separate PR (#20444) focused on adding the benchmarks first, and would greatly appreciate it if you could take a look at this code when you have some time. I will follow up with further optimizations separately after #20444 is merged. |
| let rhs = match expr? { | ||
| ColumnarValue::Array(array) => { | ||
| let use_arrow_eq = !value.data_type().is_nested(); | ||
| let mut found = |
There was a problem hiding this comment.
Instead of setting allocating the first to all -false (i.e. extra allocation / memset), it could initialize the first separately.
| array.as_ref(), | ||
| SortOptions::default(), | ||
| )?; | ||
| (0..num_rows) |
There was a problem hiding this comment.
BooleanBuffer::collect_bool + cloning the original null buffer is much faster.
Which issue does this PR close?
Rationale for this change
The dynamic IN list evaluation path in
InListExpr::evaluate()is triggered when the list contains non-constant expressions such as column references (e.g.,a IN (b, c, d)). It currently compares values row-by-row viamake_comparatorbypassing Arrow's vectorized SIMD kernels. This makes it orders of magnitude slower than the static filter path (HashSet) used for constant literals.What changes are included in this PR?
In the dynamic IN list path (
None =>branch inInListExpr::evaluate()):arrow::compute::kernels::cmp::eqinstead of per-rowmake_comparatorfor non-nested types (primitives, strings, binary, dictionary). For nested types (Struct, List, Map), fall back tomake_comparatorsincearrow_eqintentionally rejects them due to ambiguous null semantics.break: Replacetry_foldwith an explicitforloop. Checkfound.true_count() == num_rowsbefore each iteration andbreakimmediately — skipping bothevaluate()and comparison for remaining list items.Are these changes tested?
Yes. Add 7 testcase to cover dynamic in path. Also add New criterion benchmarks (
bench_dynamic_int32,bench_dynamic_utf8).Benchmarked with
cargo bench --bench in_list -- "in_list_dynamic"on 8192-row batches. Int32 improved 11–320x, Utf8 improved 2–10x, with the largest gains on high match rates due to early termination.Are there any user-facing changes?
No.