perf(pm): sequential writes within tarball extraction#2818
perf(pm): sequential writes within tarball extraction#2818
Conversation
Replace intra-package `par_iter` with a sequential loop when writing extracted tar entries to disk. Each tar entry is typically small and writes complete in microseconds, so splitting them into rayon tasks was causing heavy work-stealing (futex park/unpark) and dominating context switches on large dep graphs. Cross-package parallelism is preserved by the outer `rayon::spawn` in `extract_tarball`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request modifies the tarball extraction logic in crates/pm/src/util/extractor.rs to process entries sequentially instead of in parallel. This change aims to reduce excessive context switching caused by rayon work-stealing on large dependency graphs, while maintaining cross-package parallelism. Feedback suggests consuming the entries collection during iteration to optimize memory usage by dropping file buffers immediately after they are written.
| // Write files sequentially. Cross-package parallelism is handled by the outer | ||
| // rayon::spawn; splitting individual files into rayon tasks caused excessive | ||
| // work-stealing ctx switches on large dep trees. | ||
| for entry in &entries { |
There was a problem hiding this comment.
Since entries is not used after this loop, you can consume it by using for entry in entries instead of iterating by reference. This allows each ExtractedEntry (and its potentially large content buffer) to be dropped immediately after it is written to disk, which can significantly reduce the peak memory usage during the extraction of large packages.
| for entry in &entries { | |
| for entry in entries { |
- Cold bench: drop `| tail -1` so hyperfine's full summary (mean, stddev, range) reaches the log. Failure detection now uses exit status instead of piping. - `BENCH_WARM_RUNS=0` skips the warm phase entirely (previously the warm function always ran and hyperfine would reject --runs 0). - Result aggregator tolerates empty or malformed export-json files (e.g. when a PM's cold install fails): the offending file is reported and skipped instead of crashing the whole summary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the sequential `for` loop over extracted tar entries with `par_chunks(WRITE_CHUNK_SIZE)` — each rayon task writes a contiguous run of 32 files sequentially. This retains multi-core IO overlap for large packages while cutting the rayon task count (and its work- stealing futex traffic) by the chunk factor versus a per-file par_iter. Cross-package parallelism is preserved by the outer rayon::spawn in extract_tarball. Local (macOS, antd-test, 3 runs avg): before par_iter: wall 17.2s sys 6.18s ivcsw 208k for-loop: wall 15.3s sys 2.36s ivcsw 61k par_chunks(32): wall 13.9s sys 5.77s ivcsw 191k chunks wins wall but loses the ctx-switch reduction relative to the pure sequential version; CI with a large dep graph (ant-design-x) is the authoritative measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Accumulate wall microseconds for download, extract, and clone across
all packages during install. Print a one-line summary alongside the
existing `added / reused / downloaded` counts, e.g.
+ 513 added · 3017 reused · 123 downloaded
download 135.8s · extract 2.3s · clone 0.4s · 19.0 MB fetched
The sums are non-exclusive across cores: dividing by wall clock
gives the effective concurrency for each phase, and the ratio
between phases shows where cold-install CPU time actually lands.
Overhead is three atomics per downloaded tarball.
Local antd-test (macOS, npmmirror, 77 packages, wall 16s): download
dominates 98% of the CPU budget, extract 1.6%, clone 0.3% — reshapes
where we should look for cold-install wins.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Needed so the per-phase timings line (`download · extract · clone · bytes`) printed at the end of each install reaches the CI log. Trade-off is noisier logs — registry INFO/WARN lines come through — but that's the price for visibility into where cold-install CPU actually lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Separates three independent measurements for utoo vs bun so each phase's improvement can be judged on its own baseline: Phase 1 · resolve utoo deps / bun install --lockfile-only Phase 3 · cold install utoo install / bun install (empty cache) Phase 4 · warm link utoo install / bun install (cache warm) Phase 3 uses the lockfile generated by phase 1, with cache reset between iterations. Phase 4 resets only node_modules so only the cache → node_modules link step is measured. Uses hyperfine --show-output so utoo's phase-timings line (\`download · extract · clone · bytes\`) reaches the CI log alongside the wall-clock summary. Triggered via workflow_dispatch with configurable project / registry / runs. Defaults to ant-design against npmjs.org, 3 runs per phase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anch merge Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous inline bash -c prepare was silently no-op on CI: utoo's run 2/3 showed '3280 reused' meaning the cache wasn't actually cleared, and bun hit InvalidNPMLockfile because utoo's package-lock.json leaked across iterations. Now each phase writes a dedicated prepare shell script per-PM that: - always drops node_modules (incl. workspace package trees), - clears exactly the lockfiles that would confuse this PM, - wipes the right cache for this phase, - prints a '[prep]' line so the CI log proves prepare ran. Also factored out seed_for_phase so lockfile / cache warmup happens once before the benchmark, not leaking into the measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…che wipe Path-based rm -rf of $HOME/.cache/nm wasn't actually emptying the cache on the CI runner — utoo runs 2/3 of phase 3 still showed '3280 reused', wall was 0.8-1.1s instead of the 10s cold-install baseline, hyperfine itself warned about caches not being filled until after run 1. Let each PM clean its own cache via its CLI so we don't rely on guessing where it stores things. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`utoo clean` / `bun pm cache rm` didn't empty the cache on the CI runner either — so now use explicit bench-local paths the rm -rf prepare can guarantee to wipe: utoo: --cache-dir=/tmp/utoo-bench-cache on every invocation bun: BUN_INSTALL_CACHE_DIR=/tmp/bun-bench-cache (env var) Gets us deterministic cold/warm state between hyperfine iterations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop into diagnostic mode to figure out why hyperfine's --prepare still leaves utoo's cache intact across iterations despite the explicit --cache-dir. Prints the generated prepare script, and logs each per-iteration invocation's before/after du -sh of both caches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The case $phase in p1) p3) p4) \-style patterns never matched against actual phase strings like "p1_resolve" / "p3_cold_install" / "p4_warm_link". Result: write_prepare produced a script containing only the common header and no phase-specific cache-wipe logic, so every run after the first hit a warm cache and timings collapsed. Same off-by-name bug in seed_for_phase: "p3:utoo" pattern never matched "p3_cold_install:utoo", skipping lockfile seeding and warm-cache priming. Switched both to "p*_*" globs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cache-size before/after logs + generated-script dumps were diagnostic scaffolding used to trace the p* vs p*_resolve pattern mismatch. With that fixed, keep the plain hyperfine --prepare invocation so CI logs are readable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…time Each hyperfine iteration now runs inside a metrics wrapper that greps /usr/bin/time -v output for RSS, voluntary/involuntary context switches, page faults, and IO read/write counts. Per-PM per-phase averages across the 3 runs are shown alongside the wall-clock table so we can see, e.g., whether utoo's resolve phase costs more syscalls than bun's, or whether its warm-link advantage comes at a memory cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Linter-applied formatting cleanup, no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Original cap was sized for the FuturesUnordered preload that dispatched 128 simd_json parses through `spawn_blocking` in a burst — letting the default 512 cap run gave bimodal wall (M2: 2.7s fast / 6.9s thrash). Capping at `worker_threads` eliminated the thrash peak. After commit f3f616d (inline parse) preload no longer uses the blocking pool. The dominant consumer is now `cloner.rs` during the install phase: every file's hardlink / clonefile / copy goes through `spawn_blocking`, ~50000 short syscalls per ant-design install. Each syscall is near-instant, so the cap rarely backpressures, but cap=4 on CI does limit how fast cloner can fire syscalls in parallel. Raise cap to `max(worker_threads * 4, 32)`: enough headroom for cloner to keep multiple syscalls in flight, low enough that the historical thrash regime (hundreds of churning threads) stays avoided. Pool is per-runtime; idle threads die after 10s. Expected: small p3_cold_install improvement (current utoo 5.74s vs bun 7.71s); preload phase unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… 32)" This reverts commit 132ef36.
A/B test: replace `entries.par_chunks(WRITE_CHUNK_SIZE).try_for_each` with a plain sequential `for entry in &entries` loop. Each tarball still runs in its own outer `rayon::spawn` task (cross-package parallelism preserved); only the within-tarball write fan-out is removed. Goal: measure whether rayon's intra-package parallelism still earns its keep after the worker-pool preload rewrite. Cross-package parallelism alone may already saturate IO; if so, removing the inner par_chunks cuts work-stealing futex traffic + thread sync overhead with zero throughput cost. If p3_cold_install regresses ≥0.3s → intra-package writes are genuinely IO-bound across cores, restore par_chunks. If p3 unchanged or improves → simpler sequential code wins. This is a test commit. Will be reverted if regression measured. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…act" This reverts commit c7c847d.
`clone_dir` (Linux hardlink/copy path) was using `tokio::task::spawn_blocking` per package — at default cap=4 on CI, only 4 packages cloned at once, each running all file hardlinks sequentially internally. ~3500 packages × N files per install all funneled through that bounded pool. Switch to the same pattern `extractor.rs` already uses: - `rayon::spawn` per package replaces `spawn_blocking` (cross-package parallelism via rayon work-stealing — global pool, not capped at worker_threads) - `par_chunks(CLONE_CHUNK_SIZE)` for the inner hardlink/copy loop (intra-package fan-out across cores; same chunk size = 32 as extractor) Trade-offs: - EXDEV `force_copy` latch is now per-chunk instead of global per clone — chunks each rediscover cross-device errors and fall back locally. A few extra hardlink-then-copy round-trips at chunk boundaries, acceptable for the rare cross-device install. - Pool unification: tokio blocking pool now mostly idle (just git + http tarball + a few one-shot commands), rayon handles all the high-volume IO. Cuts the 3-pool fragmentation observed earlier. Tested: - Iter 1 of this loop (cap bump from N to max(N*4, 32)): no p3 win, p4 regressed → cap raise alone wasn't the answer. - Iter 2 (drop intra-package par_chunks in extractor): p3 +3.67s, σ exploded 0.04 → 2.85s → intra-package fan-out is essential. - This commit applies the same fan-out to clone_dir for the same reason. macOS `clonefile` path (target_os = "macos") unchanged — clonefile is a single syscall per file, different perf profile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This reverts commit 9229e16.
The headline architectural change of #2818. ruborist's preload phase shifts from a single-task `FuturesUnordered` cooperative poller to N long-lived `tokio::spawn` workers (or `wasm_bindgen_futures::spawn_local` on wasm32 where Send isn't satisfied). Stacks on top of #2826. ## Why Old design: main task owned `FuturesUnordered`, polled all preload futures cooperatively, and ran every per-future continuation (post-await body, completion handler, dispatch refill) on the same single task. The deeper await chain inside `resolve_package` (cache check + `OnceMap::get_or_init` + `RetryIf` + `request.send` + `bytes` + parse spawn_blocking) made each future yield 5+ times, and every yield round-tripped through main — saturating it. CI ant-design preload sustained avg_conc=55-61 even after Mutex / allocator hot-path eliminations, while the standalone manifest-bench (same reqwest stack, no resolver — see #2824) hit 92 at the same cap. ## How N long-lived `tokio::spawn` workers pulling from a shared lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker owns an `Arc<R>` clone and runs `resolve_package` on tokio's global executor — futures progress fully independently, no cooperative poll bottleneck. Main task only drains an `mpsc::unbounded_channel` of completions to fire receiver events + on_manifest callback. Termination: workers track `dispatched` / `completed: AtomicUsize` and park on a shared `Notify` when the queue is empty. When the last completion makes `completed == dispatched` and the queue is empty, the finishing worker raises a `shutdown` flag and wakes others; all workers drop their result_tx clones, the channel closes, and the main `recv().await` loop exits. ## Trait surface change - `MockRegistryClient` + `MockPackage` now `derive(Clone)` so tests can wrap the mock in `Arc` for the new signature - `preload_manifests` takes `registry: Arc<R>` (was `&R`); call site in `run_preload_phase` clones the borrowed registry into a fresh `Arc`. Bound at every public surface up the chain bumped to `R: RegistryClient + Clone + MaybeSend + MaybeSync + 'static`, `R::Error: MaybeSend`. The `MaybeSend` / `MaybeSync` shims (added in #2826) keep the trait surface wasm-compatible. ## Companion changes folded in - **Inline simd_json parse** — drop `tokio::task::spawn_blocking` in `service/manifest.rs`. Worker-pool surfaced parse blocking- pool queue saturation: `queue p95=200ms sum=70-89s` over 2730 manifests on cap=4 CI runners. Inline parse on the worker thread eliminates dispatch + queue overhead; 1-5ms CPU per manifest is acceptable on async worker. - **Workspace package.json parallel reads** — `find_workspaces_from_pkg` switched from sequential `for path in matched_paths { read }` loop to `FuturesUnordered` fan-out. ant-design has ~200 workspace packages; saved ~150ms. - **Setup phase + lockfile-write timing logs** — round out the per-phase wall account for the bench-comment infrastructure. - **Manifests concurrency cap 64 → 128** — worker-pool delivered the parallelism that justifies the cap raise. CI ant-design avg_conc 84 at cap=128 (up from 55 under the old architecture); preload wall 3.10s → 2.15s. ## Tests `#[tokio::test(flavor = "multi_thread", worker_threads = 2)]` since worker-pool needs a spawn-able runtime; ruborist's dev-dependencies on `tokio` add the `rt-multi-thread` feature. 164 ruborist + 10 doctests + 248/249 utoo-pm pass (1 pre-existing flake on `test_update_package_binary_fsevents`, runs green alone). ## Wasm CI cfg-gates `tokio::spawn` to `wasm_bindgen_futures::spawn_local` on wasm32 since wasm-bindgen's `JsFuture` is `!Send`. Workers still run independently — single-threaded under wasm but the queue + Notify + mpsc termination story is unchanged. `cargo check -p utoo-wasm --target wasm32-unknown-unknown` clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p guard Two folded changes that started life as separate commits on the parent perf branch: 1. **Sequential / chunked parallel writes** (was: ad0dee9 → 7ab17b8). The old per-file `par_iter().for_each(write)` paid work-stealing futex park/unpark overhead per write. Each entry is <64 KB and a single fs::File::create + write_all returns in μs — rayon scheduler dominated. Switch to `entries.par_chunks(32) .try_for_each(...)`: each rayon task writes a contiguous run of 32 files sequentially. Cuts task count by 32× while keeping multi-core IO-overlap parallelism. Cross-package parallelism is preserved by the outer `rayon::spawn` in `extract_tarball`, which itself was already landed on `next` via earlier work. Verified essential by an A/B on the parent branch: removing the intra-package par_chunks (sequential `for entry in &entries` inside each rayon task) regressed CI p3_cold_install +3.67s and exploded σ from 0.04 → 2.85 — IO can't interleave across cores when each tarball serialises its own writes. 2. **Tar Slip guard** — reject tar entries whose path is absolute or contains `..` components before joining with the destination. Without this an attacker-controlled tarball could overwrite arbitrary files via paths like `../../etc/foo` or `/etc/passwd`. `tar` crate does not enforce this by default; `npm` and `pnpm` both validate. We log+skip such entries. Both changes touch the same single function so they commit together. CI bench shows p3_cold_install at 5.74s vs bun 7.71s (utoo +2s ahead). PR description in #2818 documents the full A/B journey. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundle of independently-motivated allocator + cache hot-path optimisations from the parent perf branch (#2818). Each landed during the worker-pool exploration but doesn't depend on the worker-pool architecture itself — they stand alone as straightforward perf wins for the resolver. ## TLS provider — `aws-lc-rs` instead of `ring` `reqwest` 0.12's default `rustls-tls-native-roots` feature pins `ring` via Cargo's feature unification. Switch to `rustls-tls-native-roots-no-provider`, build our own `rustls::ClientConfig` with the `aws_lc_rs` provider, pass via `Client::use_preconfigured_tls`. CI measurement (4-core ubuntu vs npmjs.org): ring's per-handshake CCS→AppData was 78 ms p50 / 154 ms max, all 128 parallel handshakes serialising across 4 cores. aws-lc-rs (BoringSSL primitives) is ~3× faster on x86_64. Saved ~420 ms preload on cold ant-design. ## DNS — per-family rotation `getaddrinfo` typically returns 10 v6 + 12 v4 for npmjs.org. A flat rotation across the joined list meant offsets 0..10 all started inside the v6 range; on hosts where v6 routing fails (GitHub Actions runners), every connection fell through to the *same* first-reachable v4. Rotate per-family so v4 conns cycle across all v4 addresses (and v6 over v6) — observed pcap on bun shows the same 4×64 distribution we now produce. ## Disk-cache bulk-readdir ETag index `PackageCache` lazy-builds a `HashSet<String>` of names with existing disk cache entries from a single `read_dir(cache_dir)` + per-`@scope` recurse. `get_versions_from_disk` and `get_version_manifest_from_disk` short-circuit via the index. Restores the warm-run 304 path that was temporarily removed in 46cb803 (per-package `try_exists` was 16 ms avg on the cold-run critical path; now zero). ## Lazy per-version `CoreVersionManifest` via `simd_json::OwnedValue` `Versions` now stores `keys: Vec<String>` (ordered version list) + `trees: HashMap<String, Arc<simd_json::OwnedValue>>` (pre-parsed JSON subtrees). Strongly-typed `CoreVersionManifest` is materialised on demand via `CoreVersionManifest::deserialize(tree.as_ref())` — zero-copy through `simd_json::OwnedValue`'s `Deserializer` impl, memoised in a `DashMap`. Resolver typically reads 1-3 of the ~500 versions per manifest; previous design built every one eagerly. ## `Arc<FullManifest>` in `MemoryCache` Cache previously returned `FullManifest` by value, deep-cloning the per-version HashMap (100-500 entries × String key clone + Arc bump per cache hit) on the resolver hot path. ~2730 cache hits during cold preload × ~200-entry HashMap clone = ~500k allocations on shared resolver threads, contending the allocator. Wrap in `Arc<FullManifest>`; cache hit becomes one atomic bump. ## `normalize_spec` returns `Cow<'a, str>` Was unconditionally allocating `(String, String)` even for the ~99 % of deps with no `npm:` / `workspace:` prefix. ~5460 String allocations per ant-design preload, all on resolver hot path. Common path now returns `Cow::Borrowed`. ## Drop `versions.keys.clone()` from cache-hit path `resolve_package`'s full-manifest cache-hit branch was cloning the entire `versions.keys: Vec<String>` (~200 entries) just to pass `&[String]` to `resolve_target_version`. Borrow directly via Arc auto-deref. ~360k String allocs eliminated (~1800 cache hits × ~200 entries). ## OnceMap dedup New `crate::util::oncemap` module: `DashMap` + `tokio::sync::Notify` coalescer for concurrent `resolve_full_manifest` callers of the same name. First caller fetches the network; others wait on the shared `Notify` and read the cached `Arc<V>`. Replaces the prior per-name `tokio::sync::Mutex<()>` gate that serialised the hot dispatch path. ## tracing file_filter info+ default File-layer log filter dropped from `utoo=debug` to `utoo=info`. Hot-path `tracing::debug!()` calls (cache hits, BFS dispatch, preload events) emit ~5-10 events per resolved manifest. With 2730+ manifests during cold preload that's 15-30k events that — even routed through the non_blocking appender's channel — pay format/serialise CPU on the resolving thread before the channel send. Override via `UTOO_FILE_LOG=debug` for diagnostics. ## indicatif progress bar — drop per-package message updates `PreloadFetching` and `PreloadProgress` used to call `format!("fetching/resolved {}", name)` + `PROGRESS_BAR.set_message()` per event. With ~9000 such calls per ant-design preload and an indicatif-internal `Mutex` per call, this serialised the main loop's fill-and-drain rate. The user can't visually parse 5460 message swaps in 3 seconds anyway. Counter still ticks via `PROGRESS_BAR.inc(1)`. ## HTTP + parse diagnostic infrastructure (used by PR4) `service/http.rs` ships `start_http_trace` / `finish_http_trace` + `start_parse_trace` / `finish_parse_trace` plus `record_http_interval` + `record_parse_interval` callbacks. `#[allow(dead_code)]` on the start/finish for now — the preload worker-pool refactor in the next PR (#TBD) wires them in. Also bumps the `+ Sync` bound on `RegistryClient` callers in `builder.rs` / `preload.rs` / `resolver/registry.rs` — required because the trait's default-method futures gained `+ Send` (needed downstream by tokio::spawn, but already correct for single-threaded resolvers too). Tests: 164 ruborist + 248/249 utoo-pm pass (1 pre-existing flake on `test_update_package_binary_fsevents` when run in parallel, passes alone). Stacks: PR4 (preload worker-pool architecture) targets this branch and adds the bound propagation + spawn refactor on top. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The headline architectural change of #2818 — preload phase shifts from a single-task `FuturesUnordered` cooperative poller to N long-lived `tokio::spawn` workers (or `wasm_bindgen_futures::spawn_local` on wasm32 where Send isn't satisfied). Stacks on top of #2826. ## Why Old design: main task owned `FuturesUnordered`, polled all preload futures cooperatively, and ran every per-future continuation (post-await body, completion handler, dispatch refill) on the same single task. The deeper await chain inside `resolve_package` (cache check + `OnceMap::get_or_init` + `RetryIf` + `request.send` + `bytes` + parse spawn_blocking) made each future yield 5+ times, and every yield round-tripped through main — saturating it. CI ant-design preload sustained avg_conc=55-61 even after Mutex / allocator hot-path eliminations, while the standalone manifest-bench (#2824) hit 92 on the same reqwest stack. ## How N long-lived `tokio::spawn` workers pulling from a shared lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker owns an `Arc<R>` clone and runs `resolve_package` on tokio's global executor — futures progress fully independently, no cooperative poll bottleneck. Main task only drains an `mpsc::unbounded_channel` of completions to fire receiver events + on_manifest callback. Termination: workers track `dispatched` / `completed: AtomicUsize` and park on a shared `Notify` when the queue is empty. When the last completion makes `completed == dispatched` and the queue is empty, the finishing worker raises a `shutdown` flag and wakes others; all workers drop their result_tx clones, the channel closes, and the main `recv().await` loop exits. ## Trait surface change - `MockRegistryClient` + `MockPackage` `derive(Clone)` so tests can wrap the mock in `Arc` for the new signature - `preload_manifests` takes `registry: Arc<R>` (was `&R`); call site in `run_preload_phase` clones the borrowed registry into a fresh `Arc`. Bound at every public surface up the chain bumped to `R: RegistryClient + Clone + MaybeSend + MaybeSync + 'static`, `R::Error: MaybeSend`. The `MaybeSend` / `MaybeSync` shims (added in #2826) keep the trait surface wasm-compatible. ## Companion changes folded in - **Inline simd_json parse** — drop `tokio::task::spawn_blocking` in `service/manifest.rs`. Worker-pool surfaced parse blocking- pool queue saturation: `queue p95=200ms sum=70-89s` over 2730 manifests on cap=4 CI runners. Inline parse on the worker thread eliminates dispatch + queue overhead. - **Workspace package.json parallel reads** — switch the per-pattern `for path in matched_paths` serial loop to `FuturesUnordered` fan-out. ant-design has ~200 workspace packages; saved ~150ms. - **Setup phase + lockfile-write timing logs** — round out the per-phase wall account for the bench-comment infrastructure. - **Manifests concurrency cap 64 → 128** — worker-pool delivers the parallelism that justifies the cap raise. CI ant-design avg_conc 84 at cap=128 (up from 55 under the old architecture); preload wall 3.10s → 2.15s. ## Wasm CI cfg-gates `tokio::spawn` to `wasm_bindgen_futures::spawn_local` on wasm32 since wasm-bindgen's `JsFuture` is `!Send`. Workers still run independently — single-threaded under wasm but the queue + Notify + mpsc termination story is unchanged. `cargo check -p utoo-wasm --target wasm32-unknown-unknown` clean. Tests: 164 ruborist + 10 doctests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p guard Two folded changes that started life as separate commits on the parent perf branch: 1. **Sequential / chunked parallel writes** (was: ad0dee9 → 7ab17b8). The old per-file `par_iter().for_each(write)` paid work-stealing futex park/unpark overhead per write. Each entry is <64 KB and a single fs::File::create + write_all returns in μs — rayon scheduler dominated. Switch to `entries.par_chunks(32) .try_for_each(...)`: each rayon task writes a contiguous run of 32 files sequentially. Cuts task count by 32× while keeping multi-core IO-overlap parallelism. Cross-package parallelism is preserved by the outer `rayon::spawn` in `extract_tarball`, which itself was already landed on `next` via earlier work. Verified essential by an A/B on the parent branch: removing the intra-package par_chunks (sequential `for entry in &entries` inside each rayon task) regressed CI p3_cold_install +3.67s and exploded σ from 0.04 → 2.85 — IO can't interleave across cores when each tarball serialises its own writes. 2. **Tar Slip guard** — reject tar entries whose path is absolute or contains `..` components before joining with the destination. Without this an attacker-controlled tarball could overwrite arbitrary files via paths like `../../etc/foo` or `/etc/passwd`. `tar` crate does not enforce this by default; `npm` and `pnpm` both validate. We log+skip such entries. Both changes touch the same single function so they commit together. CI bench shows p3_cold_install at 5.74s vs bun 7.71s (utoo +2s ahead). PR description in #2818 documents the full A/B journey. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundle of independently-motivated allocator + cache hot-path optimisations from the parent perf branch (#2818). Each landed during the worker-pool exploration but doesn't depend on the worker-pool architecture itself — they stand alone as straightforward perf wins for the resolver. ## TLS provider — `aws-lc-rs` instead of `ring` `reqwest` 0.12's default `rustls-tls-native-roots` feature pins `ring` via Cargo's feature unification. Switch to `rustls-tls-native-roots-no-provider`, build our own `rustls::ClientConfig` with the `aws_lc_rs` provider, pass via `Client::use_preconfigured_tls`. CI measurement (4-core ubuntu vs npmjs.org): ring's per-handshake CCS→AppData was 78 ms p50 / 154 ms max, all 128 parallel handshakes serialising across 4 cores. aws-lc-rs (BoringSSL primitives) is ~3× faster on x86_64. Saved ~420 ms preload on cold ant-design. ## DNS — per-family rotation `getaddrinfo` typically returns 10 v6 + 12 v4 for npmjs.org. A flat rotation across the joined list meant offsets 0..10 all started inside the v6 range; on hosts where v6 routing fails (GitHub Actions runners), every connection fell through to the *same* first-reachable v4. Rotate per-family so v4 conns cycle across all v4 addresses (and v6 over v6) — observed pcap on bun shows the same 4×64 distribution we now produce. ## Disk-cache bulk-readdir ETag index `PackageCache` lazy-builds a `HashSet<String>` of names with existing disk cache entries from a single `read_dir(cache_dir)` + per-`@scope` recurse. `get_versions_from_disk` and `get_version_manifest_from_disk` short-circuit via the index. Restores the warm-run 304 path that was temporarily removed in 46cb803 (per-package `try_exists` was 16 ms avg on the cold-run critical path; now zero). ## Lazy per-version `CoreVersionManifest` via `simd_json::OwnedValue` `Versions` now stores `keys: Vec<String>` (ordered version list) + `trees: HashMap<String, Arc<simd_json::OwnedValue>>` (pre-parsed JSON subtrees). Strongly-typed `CoreVersionManifest` is materialised on demand via `CoreVersionManifest::deserialize(tree.as_ref())` — zero-copy through `simd_json::OwnedValue`'s `Deserializer` impl, memoised in a `DashMap`. Resolver typically reads 1-3 of the ~500 versions per manifest; previous design built every one eagerly. ## `Arc<FullManifest>` in `MemoryCache` Cache previously returned `FullManifest` by value, deep-cloning the per-version HashMap (100-500 entries × String key clone + Arc bump per cache hit) on the resolver hot path. ~2730 cache hits during cold preload × ~200-entry HashMap clone = ~500k allocations on shared resolver threads, contending the allocator. Wrap in `Arc<FullManifest>`; cache hit becomes one atomic bump. ## `normalize_spec` returns `Cow<'a, str>` Was unconditionally allocating `(String, String)` even for the ~99 % of deps with no `npm:` / `workspace:` prefix. ~5460 String allocations per ant-design preload, all on resolver hot path. Common path now returns `Cow::Borrowed`. ## Drop `versions.keys.clone()` from cache-hit path `resolve_package`'s full-manifest cache-hit branch was cloning the entire `versions.keys: Vec<String>` (~200 entries) just to pass `&[String]` to `resolve_target_version`. Borrow directly via Arc auto-deref. ~360k String allocs eliminated (~1800 cache hits × ~200 entries). ## OnceMap dedup New `crate::util::oncemap` module: `DashMap` + `tokio::sync::Notify` coalescer for concurrent `resolve_full_manifest` callers of the same name. First caller fetches the network; others wait on the shared `Notify` and read the cached `Arc<V>`. Replaces the prior per-name `tokio::sync::Mutex<()>` gate that serialised the hot dispatch path. ## tracing file_filter info+ default File-layer log filter dropped from `utoo=debug` to `utoo=info`. Hot-path `tracing::debug!()` calls (cache hits, BFS dispatch, preload events) emit ~5-10 events per resolved manifest. With 2730+ manifests during cold preload that's 15-30k events that — even routed through the non_blocking appender's channel — pay format/serialise CPU on the resolving thread before the channel send. Override via `UTOO_FILE_LOG=debug` for diagnostics. ## indicatif progress bar — drop per-package message updates `PreloadFetching` and `PreloadProgress` used to call `format!("fetching/resolved {}", name)` + `PROGRESS_BAR.set_message()` per event. With ~9000 such calls per ant-design preload and an indicatif-internal `Mutex` per call, this serialised the main loop's fill-and-drain rate. The user can't visually parse 5460 message swaps in 3 seconds anyway. Counter still ticks via `PROGRESS_BAR.inc(1)`. ## HTTP + parse diagnostic infrastructure (used by PR4) `service/http.rs` ships `start_http_trace` / `finish_http_trace` + `start_parse_trace` / `finish_parse_trace` plus `record_http_interval` + `record_parse_interval` callbacks. `#[allow(dead_code)]` on the start/finish for now — the preload worker-pool refactor in the next PR (#TBD) wires them in. Also bumps the `+ Sync` bound on `RegistryClient` callers in `builder.rs` / `preload.rs` / `resolver/registry.rs` — required because the trait's default-method futures gained `+ Send` (needed downstream by tokio::spawn, but already correct for single-threaded resolvers too). Tests: 164 ruborist + 248/249 utoo-pm pass (1 pre-existing flake on `test_update_package_binary_fsevents` when run in parallel, passes alone). Stacks: PR4 (preload worker-pool architecture) targets this branch and adds the bound propagation + spawn refactor on top. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The headline architectural change of #2818 — preload phase shifts from a single-task `FuturesUnordered` cooperative poller to N long-lived `tokio::spawn` workers (or `wasm_bindgen_futures::spawn_local` on wasm32 where Send isn't satisfied). Stacks on top of #2826. ## Why Old design: main task owned `FuturesUnordered`, polled all preload futures cooperatively, and ran every per-future continuation (post-await body, completion handler, dispatch refill) on the same single task. The deeper await chain inside `resolve_package` (cache check + `OnceMap::get_or_init` + `RetryIf` + `request.send` + `bytes` + parse spawn_blocking) made each future yield 5+ times, and every yield round-tripped through main — saturating it. CI ant-design preload sustained avg_conc=55-61 even after Mutex / allocator hot-path eliminations, while the standalone manifest-bench (#2824) hit 92 on the same reqwest stack. ## How N long-lived `tokio::spawn` workers pulling from a shared lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker owns an `Arc<R>` clone and runs `resolve_package` on tokio's global executor — futures progress fully independently, no cooperative poll bottleneck. Main task only drains an `mpsc::unbounded_channel` of completions to fire receiver events + on_manifest callback. Termination: workers track `dispatched` / `completed: AtomicUsize` and park on a shared `Notify` when the queue is empty. When the last completion makes `completed == dispatched` and the queue is empty, the finishing worker raises a `shutdown` flag and wakes others; all workers drop their result_tx clones, the channel closes, and the main `recv().await` loop exits. ## Trait surface change - `MockRegistryClient` + `MockPackage` `derive(Clone)` so tests can wrap the mock in `Arc` for the new signature - `preload_manifests` takes `registry: Arc<R>` (was `&R`); call site in `run_preload_phase` clones the borrowed registry into a fresh `Arc`. Bound at every public surface up the chain bumped to `R: RegistryClient + Clone + MaybeSend + MaybeSync + 'static`, `R::Error: MaybeSend`. The `MaybeSend` / `MaybeSync` shims (added in #2826) keep the trait surface wasm-compatible. ## Companion changes folded in - **Inline simd_json parse** — drop `tokio::task::spawn_blocking` in `service/manifest.rs`. Worker-pool surfaced parse blocking- pool queue saturation: `queue p95=200ms sum=70-89s` over 2730 manifests on cap=4 CI runners. Inline parse on the worker thread eliminates dispatch + queue overhead. - **Workspace package.json parallel reads** — switch the per-pattern `for path in matched_paths` serial loop to `FuturesUnordered` fan-out. ant-design has ~200 workspace packages; saved ~150ms. - **Setup phase + lockfile-write timing logs** — round out the per-phase wall account for the bench-comment infrastructure. - **Manifests concurrency cap 64 → 128** — worker-pool delivers the parallelism that justifies the cap raise. CI ant-design avg_conc 84 at cap=128 (up from 55 under the old architecture); preload wall 3.10s → 2.15s. ## Wasm CI cfg-gates `tokio::spawn` to `wasm_bindgen_futures::spawn_local` on wasm32 since wasm-bindgen's `JsFuture` is `!Send`. Workers still run independently — single-threaded under wasm but the queue + Notify + mpsc termination story is unchanged. `cargo check -p utoo-wasm --target wasm32-unknown-unknown` clean. Tests: 164 ruborist + 10 doctests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- delete crates/manifest-bench (debug-only, never merged) - tombi format crates/ruborist/Cargo.toml - typos: unparseable → unparsable in bench/pm-bench.sh
📊 pm-bench-phases ·
|
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 9.54s | 0.28s | 10.18s | 10.17s | 653M | 330.6K |
| utoo-npm | 10.23s | 0.20s | 11.61s | 13.29s | 1.13G | 159.4K |
| utoo | 9.21s | 1.07s | 11.20s | 12.27s | 2.26G | 260.4K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 17.3K | 18.1K | 1.16G | 7M | 1.83G | 1.72G | 1M |
| utoo-npm | 174.2K | 160.1K | 1.14G | 4M | 1.68G | 1.68G | 2M |
| utoo | 79.1K | 40.4K | 1.13G | 5M | 1.68G | 1.68G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 2.37s | 0.08s | 3.87s | 1.06s | 483M | 174.2K |
| utoo-npm | 6.01s | 0.60s | 6.07s | 1.09s | 430M | 74.5K |
| utoo | 2.65s | 0.05s | 5.62s | 1.97s | 1.44G | 193.4K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 12.0K | 3.3K | 200M | 3M | 104M | - | 1M |
| utoo-npm | 68.8K | 2.5K | 202M | 2M | 9M | 5M | 2M |
| utoo | 18.3K | 15.1K | 197M | 3M | 7M | 5M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 7.47s | 0.71s | 6.14s | 10.00s | 595M | 203.7K |
| utoo-npm | 9.40s | 1.22s | 5.61s | 12.10s | 905M | 122.3K |
| utoo | 7.69s | 2.82s | 5.46s | 10.86s | 878M | 107.1K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 6.7K | 7.4K | 993M | 4M | 1.73G | 1.73G | 1M |
| utoo-npm | 153.9K | 109.7K | 965M | 4M | 1.67G | 1.67G | 2M |
| utoo | 89.8K | 43.0K | 965M | 3M | 1.67G | 1.67G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 3.35s | 0.06s | 0.23s | 2.32s | 135M | 32.1K |
| utoo-npm | 2.28s | 0.18s | 0.61s | 3.91s | 84M | 19.6K |
| utoo | 2.12s | 0.04s | 0.40s | 3.42s | 64M | 13.7K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 382 | 32 | 7M | 38K | 1.88G | 1.72G | 1M |
| utoo-npm | 53.0K | 22.0K | 21K | 12K | 1.67G | 1.67G | 2M |
| utoo | 16.7K | 8.9K | 20K | 10K | 1.68G | 1.67G | 2M |
npmmirror.com
p0_full_cold
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 28.79s | 6.58s | 9.42s | 10.19s | 523M | 367.5K |
| utoo-npm | 30.49s | 13.81s | 8.04s | 14.54s | 681M | 116.6K |
| utoo | 14.32s | 0.63s | 7.42s | 12.27s | 894M | 132.4K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 92.9K | 5.5K | 1.12G | 12M | 1.85G | 1.73G | 2M |
| utoo-npm | 250.4K | 98.7K | 978M | 9M | 1.67G | 1.68G | 2M |
| utoo | 154.1K | 61.6K | 984M | 9M | 1.67G | 1.68G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 1.58s | 0.04s | 4.00s | 1.11s | 552M | 185.7K |
| utoo-npm | 6.66s | 1.17s | 2.12s | 0.60s | 74M | 16.1K |
| utoo | 1.12s | 0.10s | 1.15s | 0.38s | 88M | 18.6K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 5.5K | 5.9K | 151M | 3M | 106M | - | 2M |
| utoo-npm | 47.6K | 622 | 13M | 2M | - | 4M | 2M |
| utoo | 15.1K | 917 | 17M | 3M | - | 4M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 30.52s | 19.81s | 5.92s | 9.44s | 238M | 95.5K |
| utoo-npm | 44.38s | 33.35s | 6.23s | 12.96s | 612M | 108.7K |
| utoo | 20.24s | 3.22s | 5.80s | 11.57s | 663M | 98.1K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 69.2K | 3.4K | 999M | 9M | 1.73G | 1.73G | 2M |
| utoo-npm | 198.8K | 99.0K | 984M | 7M | 1.67G | 1.67G | 2M |
| utoo | 137.1K | 49.8K | 968M | 7M | 1.67G | 1.67G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 3.32s | 0.05s | 0.22s | 2.33s | 135M | 31.3K |
| utoo-npm | 2.57s | 0.15s | 0.63s | 3.98s | 84M | 19.7K |
| utoo | 2.13s | 0.30s | 0.43s | 3.43s | 65M | 14.3K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 759 | 24 | 7M | 86K | 1.88G | 1.72G | 2M |
| utoo-npm | 55.3K | 23.0K | 39K | 12K | 1.67G | 1.67G | 2M |
| utoo | 16.5K | 9.4K | 40K | 12K | 1.67G | 1.67G | 2M |
📊 pm-bench-phases ·
|
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 17.84s | 3.14s | 6.29s | 19.27s | 793M | 51.2K |
| utoo-npm | 23.09s | 0.97s | 11.05s | 27.49s | 970M | 97.7K |
| utoo | 18.90s | 1.47s | 9.54s | 22.98s | 1.97G | 176.9K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 16.9K | 145.3K | - | - | 1.76G | 1.91G | 1M |
| utoo-npm | 13.2K | 381.6K | - | - | 1.63G | 1.83G | 2M |
| utoo | 4.4K | 216.5K | - | - | 1.63G | 1.88G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 1.97s | 0.12s | 2.26s | 0.95s | 505M | 32.9K |
| utoo-npm | 4.86s | 0.13s | 3.99s | 2.05s | 542M | 36.7K |
| utoo | 3.00s | 0.13s | 3.92s | 2.06s | 1.62G | 107.3K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 10 | 24.0K | - | - | 110M | - | 1M |
| utoo-npm | 13 | 78.9K | - | - | 28M | 5M | 2M |
| utoo | 42 | 46.7K | - | - | 27M | 5M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 20.05s | 3.20s | 4.01s | 22.14s | 531M | 34.5K |
| utoo-npm | 18.67s | 1.75s | 4.82s | 24.00s | 737M | 80.8K |
| utoo | 12.40s | 2.43s | 3.74s | 17.48s | 718M | 77.9K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 4.8K | 138.4K | - | - | 1.70G | 1.94G | 1M |
| utoo-npm | 1.5K | 242.7K | - | - | 1.61G | 1.83G | 2M |
| utoo | 1.3K | 154.1K | - | - | 1.61G | 1.83G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 5.12s | 0.62s | 0.11s | 2.25s | 48M | 3.7K |
| utoo-npm | 4.00s | 0.35s | 0.57s | 2.92s | 91M | 6.8K |
| utoo | 3.86s | 0.57s | 0.37s | 2.56s | 82M | 5.9K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 15.6K | 876 | - | - | 1.86G | 1.90G | 1M |
| utoo-npm | 13.0K | 73.4K | - | - | 1.61G | 1.82G | 2M |
| utoo | 13.7K | 20.1K | - | - | 1.63G | 1.82G | 2M |
npmmirror.com
p0_full_cold
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 56.79s | 22.62s | 6.67s | 18.61s | 556M | 36.0K |
| utoo-npm | 63.80s | 40.68s | 8.94s | 24.20s | 641M | 74.0K |
| utoo | 29.08s | 7.31s | 7.43s | 23.56s | 719M | 79.0K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 13.8K | 175.9K | - | - | 1.79G | 1.90G | 2M |
| utoo-npm | 4.1K | 472.4K | - | - | 1.61G | 1.87G | 2M |
| utoo | 1.9K | 285.5K | - | - | 1.61G | 1.87G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 35.23s | 3.97s | 2.99s | 1.86s | 499M | 32.5K |
| utoo-npm | 27.66s | 16.35s | 2.53s | 1.55s | 80M | 5.8K |
| utoo | 10.49s | 10.44s | 1.62s | 0.71s | 92M | 6.6K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 49 | 37.1K | - | - | 113M | - | 2M |
| utoo-npm | 15 | 50.4K | - | - | - | 4M | 2M |
| utoo | 31 | 28.4K | - | - | - | 4M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 23.13s | 0.12s | 3.99s | 19.21s | 269M | 17.8K |
| utoo-npm | 36.33s | 2.43s | 5.41s | 19.41s | 699M | 76.8K |
| utoo | 32.64s | 0.65s | 5.45s | 20.26s | 679M | 77.9K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 1.8K | 156.7K | - | - | 1.64G | 1.91G | 2M |
| utoo-npm | 1.6K | 334.1K | - | - | 1.60G | 1.83G | 2M |
| utoo | 1.3K | 251.7K | - | - | 1.60G | 1.83G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 4.82s | 0.82s | 0.15s | 2.45s | 53M | 4.0K |
| utoo-npm | 5.29s | 1.06s | 0.81s | 4.04s | 88M | 6.5K |
| utoo | 5.60s | 0.06s | 0.51s | 3.54s | 85M | 6.1K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 14.6K | 4.5K | - | - | 1.87G | 1.93G | 2M |
| utoo-npm | 12.4K | 78.9K | - | - | 1.61G | 1.87G | 2M |
| utoo | 13.1K | 20.8K | - | - | 1.61G | 1.87G | 2M |
Summary
Started as "intra-package sequential tarball writes" (single-line perf tweak) and evolved into a multi-week, data-driven preload-resolver overhaul plus a complete bench infrastructure. This PR is now too large to review safely and is being split — see "PR split plan" at the bottom.
This description captures the full exploration journey so subsequent split-PRs can reference it as context.
End-to-end results (ant-design / npmjs.org / GitHub Actions ubuntu)
utoo now matches or beats bun on 3 of 4 phases. The remaining p1 gap (+0.4s) is fully accounted for and at the architectural floor.
Journey timeline — p1_resolve preload wall
Key architectural moves
1. Worker-pool preload (commit
ed7b551e) — the core winProblem:
FuturesUnorderedpolled all preload futures from a single main task. Eachresolve_packagecall had 5+ awaits (cache check +OnceMap::get_or_init+RetryIf+request.send+bytes+ parse). Every yield round-tripped through main, saturating it. Even after killing every Mutex/clone hot-path, avg_conc held at 55-60 while standalone manifest-bench (same reqwest stack, no resolver) sustained 92.Fix: N long-lived
tokio::spawnworkers pulling work fromArc<SegQueue<Dep>>withDashSetdedup. Workers run on tokio's global executor independently; main task only drains anmpsc::unbounded_channelfor receiver events + on_manifest callback. Termination viadispatched/completed: AtomicUsize+Notify.Trait surface:
RegistryClientfutures gained+ Sendbounds,MockRegistryClientderivesClone.preload_manifeststakesArc<R>instead of&R. BoundR: Clone + Send + Sync + 'static, R::Error: Sendpropagated up the API chain.Result: avg_conc 55 → 84 (CI), wall 3.10s → 2.15s (-31%).
2. HTTP stack ceiling — empirically verified (
crates/manifest-bench)Built a standalone HTTP-only fetch tool that strips out everything ruborist does on top of network: BFS, dedup, parse, project cache, lockfile. Dispatches identical workload through identical reqwest+rustls+tokio stack.
Key data point (CI ant-design npmjs.org cap=128, controlled for same Cloudflare conditions):
ruborist now matches the HTTP stack ceiling. Further preload speedup requires a non-reqwest stack (HTTP/3, custom hyper Connector, etc.). This empirically refuted ~10 candidate optimizations as not the bottleneck.
3. Allocator hot-path cleanup
Each
resolve_packagecall allocated severalStrings on the resolver hot path. Cumulatively ~10k allocs per ant-design preload, all on workers competing for shared allocator. Eliminated via:normalize_specreturnsCow<'_, str>instead of(String, String)(commit12d34dd4)Arc<FullManifest>inMemoryCache— clone is atomic bump not deep HashMap clone (commit337d4f26)versions.keys.clone()removed from cache-hit path — pass borrow directly (commitbb256ecd)format!()dedup key kept (testing removal broke transitive prefetch)4. Inline parse (drop
spawn_blocking)Worker-pool surfaced parse blocking-pool queue saturation:
queue p95=200ms sum=70-89sover 2730 manifests. Cap=4 on CI was funneling all parses through a 4-slot queue.Fix: inline
simd_jsonparse on the worker (commitf3f616d8). 1-5ms CPU per manifest is acceptable on async worker; eliminates dispatch + queue overhead.5. Workspace discovery parallelization
find_workspaces_from_pkgwas reading 200+ workspacepackage.jsonfiles sequentially in aforloop. Replaced withFuturesUnordered(commitbf149957). Saved150ms (200×1ms serial → ~10ms parallel).6. tarball extraction + cloner — install path
extractor.rs:rayon::spawnper package +par_chunks(32)intra-package writes. Verified essential by testing removal — sequential write regressed p3 +3.67s, σ exploded 0.04→2.85.cloner.rs: kept ontokio::task::spawn_blocking. Verified essential by testing rayon migration — that regressed p3 +2.65s due to oversubscription with extractor's rayon pool. The currenttokio::blocking + cap=worker_threadsis the local optimum for hardlink (single short syscall, no fan-out benefit).Failed experiments (kept here so future attempts don't repeat)
All reverted. Each tested with CI bench data.
25814552reverted byc610a58202ef05624e125908/ae2a6088/5a897e4a1a16d25ereverted byf379f7a9de5c83edreverted by90e421a5max_blocking_threads = N*4132ef36ereverted by9a071f29c7c847d6reverted by2f1092c39229e160reverted bye38329c8Cloudflare per-source-IP throttle — measured
Standalone manifest-bench cap sweep (cap=32/64/96/128/192/256):
Conclusion: npmjs's Cloudflare frontend throttles on a per-source-IP basis (CI runner egress). Cap=128 is the sweet spot for our setup. Going wider triggers per-req inflation that cancels the parallelism gain.
This refuted the original "raise cap aggressively" intuition and 6 cap-sweep experiments.
Diagnostic infrastructure (kept, valuable for future work)
5e3c12d2): per-preloadwall / busy / sum / avg_conc / p50/p95/max / cpu_tail6e0e60e5):spawn_blockingqueue p50/p95 + exec p50/p95b92aa81f): standalone HTTP-only A/B against ruboristf6846d6a): every CI run produces standalone control numbers88a4b056): warm 304 path restored without per-package syscall stormTotal wall account (p1_resolve, latest CI)
Remaining optimization space (out of this PR's scope)
rustls-native-certslazy load,reqwest::Clientdeferred build)FullManifestraw bytes + index instead ofOwnedValuetree (1.46GB → ~700MB)None are small enough to fit this PR.
PR split plan
Per discussion, this PR is being decomposed into 4-5 focused PRs:
This PR will close once 1-4 land. Track in subsequent PRs.
Test plan
cargo fmt+cargo clippy --all-targets -- -D warnings --no-depsclean across cratescargo test -p utoo-pm245 passedcargo test -p utoo-ruborist164 + 10 doctests passedpm-bench-phasesCI run >50 times, results documented above🤖 Generated with Claude Code