bench: demonstrate tokio-uring thread-per-core execution in dfbench#21716
bench: demonstrate tokio-uring thread-per-core execution in dfbench#21716
Conversation
Adds a demo integration of `tokio-uring` into `dfbench`:
* `util::tokio_uring_pool` — spawns N OS threads, each running its own
`tokio_uring::start` runtime (a current-thread Tokio reactor driven
by `io_uring`). `pool.spawn(|| async { ... })` ships a Send closure
to a round-robin worker, where it builds a possibly-`!Send` future
that runs locally on that worker's ring.
* `util::tokio_uring_store::TokioUringObjectStore` — local
`ObjectStore` whose `get_opts` and `get_ranges` drive reads through
`tokio_uring::fs::File::read_at`, dispatched across the pool so
every read lands on the next worker's ring. Writes, list, copy, and
delete delegate to `LocalFileSystem`.
* ClickBench uses the pool by default on Linux (one worker per CPU).
Each output partition of the physical plan is `pool.spawn`-ed, so
plan execution itself happens on the io_uring runtimes. Top-level
`CoalescePartitionsExec` / `SortPreservingMergeExec` are stripped
so their N-partition children fan out across workers; for SPM the
merge is rebuilt over a `MemorySourceConfig` and re-executed on one
worker.
Enabled via `--tokio-uring-workers N` (default: `available_parallelism()`
on Linux, disabled elsewhere; `0` forces the legacy tokio MT path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the ObjectStore is called from inside a pool worker (the common case — plan execution is already dispatched onto the pool), hop-free local dispatch is strictly better than the round-robin `pool.spawn`: bytes land on the same core that will consume them, and we skip the mpsc + oneshot round-trip. Adds an `IN_WORKER` thread-local set by `run_worker`, and a fast path in `TokioUringObjectStore::read_ranges_uring` that uses `tokio_uring::spawn` on the current ring when `in_worker()` is true. The round-robin path is kept for callers on a non-uring runtime (e.g. during planning on the main tokio MT). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing tokio-uring-benchmark-demo (b127c03) to 3b5008a (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing tokio-uring-benchmark-demo (b127c03) to 3b5008a (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing tokio-uring-benchmark-demo (b127c03) to 3b5008a (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
Which issue does this PR close?
Rationale for this change
Explores thread-per-core execution using
tokio-uringin the benchmark harness. Each worker runs its owntokio_uring::startruntime (a current-thread Tokio reactor driven byio_uring), so Parquet I/O and decode can be scheduled on the same core that will consume the bytes — avoiding cross-thread hops that the default multi-threaded Tokio runtime introduces.This is a demo / experiment intended to measure the impact of thread-per-core +
io_uringon ClickBench, not a production path.What changes are included in this PR?
Two commits:
bench: demonstrate tokio-uring thread-per-core executionutil::tokio_uring_pool— spawns N OS threads, each running its owntokio_uring::startruntime.pool.spawn(|| async { ... })ships aSendclosure to a round-robin worker, where it can build a possibly-!Sendfuture that runs locally on that worker's ring.util::tokio_uring_store::TokioUringObjectStore— localObjectStorewhoseget_opts/get_rangesdrive reads throughtokio_uring::fs::File::read_at, dispatched across the pool. Writes / list / copy / delete delegate toLocalFileSystem.pool.spawn-ed so plan execution itself happens on the io_uring runtimes. Top-levelCoalescePartitionsExec/SortPreservingMergeExecare stripped so their N-partition children fan out across workers; SPM merge is rebuilt over aMemorySourceConfigand re-executed on one worker.--tokio-uring-workers N(default:available_parallelism()on Linux, disabled elsewhere;0forces the legacy tokio MT path).bench: keep tokio-uring reads local to the caller's workerIN_WORKERthread-local set byrun_worker.TokioUringObjectStore::read_ranges_uringusestokio_uring::spawnon the current ring whenin_worker()is true, skipping the round-robin mpsc + oneshot round-trip so bytes land on the same core that will consume them.All changes are confined to
benchmarks/and Linux-only (#[cfg(target_os = "linux")]). No changes to the DataFusion core crates.Are these changes tested?
Exercised via
dfbench clickbenchon Linux. No new unit tests — this is a benchmark harness wiring experiment.Are there any user-facing changes?
No public API changes. New opt-in CLI flag
--tokio-uring-workerson the benchmark binary (Linux only).