Skip to content

chore(bench): kernel-vs-Thrift performance baseline harness + results#790

Open
vikrantpuppala wants to merge 1 commit into
feat/kernel-bind-paramfrom
bench/kernel-vs-thrift
Open

chore(bench): kernel-vs-Thrift performance baseline harness + results#790
vikrantpuppala wants to merge 1 commit into
feat/kernel-bind-paramfrom
bench/kernel-vs-thrift

Conversation

@vikrantpuppala
Copy link
Copy Markdown
Contributor

Stacks on #789 (kernel bind_param). Adds a benchmark script under `scripts/`; no production code change.

Summary

One-shot benchmark script that runs each (backend × SQL-shape) combination N+1 times against a live warehouse, drops the first run (cache warm-up), and reports min/median/max for session-open, time-to-first-row (TTFR), drain, and RSS delta.

Not a CI gate — single-machine, single-warehouse, high-variance. Meant to be re-run by hand when we want a baseline. Output is a Markdown table you can paste into a PR.

```
set -a && source ~/.databricks/pecotesting-creds && set +a
export DATABRICKS_SERVER_HOSTNAME=${DATABRICKS_HOST#https://}
.venv/bin/python scripts/bench_kernel_vs_thrift.py
```

Results (median of 5 samples, warm-up dropped, dogfood)

Shape Thrift drain Kernel drain Ratio Notes
`SELECT 1` 387 ms 1085 ms 2.8× Fixed ~700ms kernel TTFR overhead
`range(10k)` 909 ms 1347 ms 1.48× Overhead amortized
`wide-uuid(100k)` 6907 ms 9925 ms 1.44× Overhead amortized
`metadata.catalogs` 413 ms 550 ms 1.33× Overhead amortized
`range(1M)` 14.0 s panic kernel-side bug: issue #19

Detailed tables with min/max ranges and RSS-delta numbers are in the script output.

Three findings worth flagging

1. Fixed kernel TTFR overhead (~700ms)

`SELECT 1` is the cleanest signal because drain time is essentially zero. Kernel pays ~700ms more than Thrift on every query. On large queries (drain dominates) the relative cost shrinks to 1.3–1.5×.

Plausible causes (not investigated):

  • Per-Session tokio runtime construction (flagged by the kernel PR review).
  • SEA wait-for-result poll handshake doing an extra round-trip.

A flamegraph would distinguish these. Worth a follow-up.

2. CloudFetch panic on large results (kernel issue #19)

`range(1M)` crosses the CloudFetch threshold; the kernel's reader does `runtime_handle.block_on` from a sync trait method, which panics when called from inside our PyO3 `runtime.block_on`. `use_sea=True` is unusable in production for any large-result workload until this is fixed. The connector's e2e tests use `range(10000)` which is below the CloudFetch threshold, which is why it didn't surface earlier.

3. RSS overhead ~+1MB per kernel session

Consistent across every shape. Probably tokio worker thread stacks (default ~2MB × N workers, partially committed). Not a problem at small connection counts; maps to the "process-global `OnceLock`" follow-up the kernel reviewer flagged.

What this doesn't measure

  • Concurrent-connection scaling (single conn at a time here)
  • Anything past 100k rows on kernel (blocked by TLS version handshake issue  #19)
  • Network-jitter sensitivity
  • Memory profile beyond coarse RSS

Useful next benchmarks once #19 lands: real CloudFetch shape (1M+ rows), concurrent sessions, and a memory-profile pass.

This pull request and its description were written by Isaac.

One-shot benchmark script under scripts/ that runs each (backend ×
SQL-shape) combination N+1 times against a live warehouse, drops
the first run (cache warm-up), and reports min/median/max for
session-open, time-to-first-row, drain, and RSS delta.

Not a CI gate — single-machine, single-warehouse, high-variance
script meant to be re-run by hand when we want a baseline.

Shapes:
- SELECT 1           (pure round-trip latency, no data)
- range(10k)         (inline result, ~10K rows)
- range(1M)          (crosses CloudFetch threshold; currently
                      panics on the kernel backend — see kernel
                      issue #19, nested block_on bug)
- wide-uuid(100k)    (wider rows, Arrow serialization)
- metadata.catalogs  (metadata round-trip)

Output is a Markdown table you can paste into a PR. Run with:

    set -a && source ~/.databricks/pecotesting-creds && set +a
    export DATABRICKS_SERVER_HOSTNAME=${DATABRICKS_HOST#https://}
    .venv/bin/python scripts/bench_kernel_vs_thrift.py

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
@vikrantpuppala
Copy link
Copy Markdown
Contributor Author

Update: kernel issue #19 fixed in databricks-sql-kernel#20.

Re-ran the benchmark with the fix applied:

Shape Thrift drain Kernel drain (was) Kernel drain (after fix) Ratio
SELECT 1 396 ms 1085 ms 1079 ms 2.7×
range(10k) 1106 ms 1347 ms 1395 ms 1.26×
range(1M) 15009 ms panic 7317 ms 0.49× (2× faster!)
wide-uuid(100k) 9776 ms 9925 ms 9781 ms 1.00×
metadata.catalogs 420 ms 550 ms 546 ms 1.30×

The big result: on range(1M) — the shape that previously panicked — the kernel-backed path is now 2× faster than Thrift at drain (7.3s vs 15.0s; 137K rows/s vs 67K rows/s). CloudFetch's parallel chunk download is paying off.

Other observations unchanged:

  • Fixed ~700ms kernel TTFR overhead on small queries (SELECT 1 is the cleanest signal). Probably per-Session tokio runtime construction; separate follow-up.
  • wide-uuid(100k) lines up at ~9.8s on both backends — server-side dominates here.
  • RSS delta is larger on kernel for the 1M shape (+21MB vs +4KB) because pyarrow holds the whole drained table in scope at end of drain. Both backends should converge if we drain batch-by-batch instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant