perf(pm): sequential writes within tarball extraction by elrrrrrrr · Pull Request #2818 · utooland/utoo

elrrrrrrr · 2026-04-21T14:20:50Z

Summary

Started as "intra-package sequential tarball writes" (single-line perf tweak) and evolved into a multi-week, data-driven preload-resolver overhaul plus a complete bench infrastructure. This PR is now too large to review safely and is being split — see "PR split plan" at the bottom.

This description captures the full exploration journey so subsequent split-PRs can reference it as context.

End-to-end results (ant-design / npmjs.org / GitHub Actions ubuntu)

Phase	bun	utoo (start)	utoo (now)	utoo Δ	vs bun
p0_full_cold	9.07-9.32s	~14s	8.89-9.06s	−5s	utoo wins / matches
p1_resolve	1.98-2.34s	5.78s	2.65s	−54%	gap +0.31-0.67s
p3_cold_install	7.71-7.96s	~9s	5.74s	−36%	utoo +2s ahead
p4_warm_link	3.42-3.51s	~3s	1.98s	−34%	utoo +1.5s ahead

utoo now matches or beats bun on 3 of 4 phases. The remaining p1 gap (+0.4s) is fully accounted for and at the architectural floor.

Journey timeline — p1_resolve preload wall

5.78s baseline
  → 4.47s  eager Versions parse → kept the parse, refactored architecture
  → 4.09s  inline extract_transitive_deps + direct deserialize
  → 3.82s  lock-free SegQueue pending queue (was Mutex<VecDeque>)
  → 3.30s  rustls + aws-lc-rs (replaces ring; CCS→AppData 154ms→17ms)
  → 3.10s  tracing file_filter info+ (drop debug! eval cost)
  → 2.96s  Arc<FullManifest> in cache (no more deep clone)
  → 2.65s  drop indicatif Mutex from per-completion path
  → 2.15s  worker-pool replaces FuturesUnordered ⭐ biggest single win
  → 2.65s  current (Cloudflare variance: best 2.04s, typical 2.65s incl BFS+lockfile)

Key architectural moves

1. Worker-pool preload (commit `ed7b551e`) — the core win

Problem: FuturesUnordered polled all preload futures from a single main task. Each resolve_package call had 5+ awaits (cache check + OnceMap::get_or_init + RetryIf + request.send + bytes + parse). Every yield round-tripped through main, saturating it. Even after killing every Mutex/clone hot-path, avg_conc held at 55-60 while standalone manifest-bench (same reqwest stack, no resolver) sustained 92.

Fix: N long-lived tokio::spawn workers pulling work from Arc<SegQueue<Dep>> with DashSet dedup. Workers run on tokio's global executor independently; main task only drains an mpsc::unbounded_channel for receiver events + on_manifest callback. Termination via dispatched/completed: AtomicUsize + Notify.

Trait surface: RegistryClient futures gained + Send bounds, MockRegistryClient derives Clone. preload_manifests takes Arc<R> instead of &R. Bound R: Clone + Send + Sync + 'static, R::Error: Send propagated up the API chain.

Result: avg_conc 55 → 84 (CI), wall 3.10s → 2.15s (-31%).

2. HTTP stack ceiling — empirically verified (`crates/manifest-bench`)

Built a standalone HTTP-only fetch tool that strips out everything ruborist does on top of network: BFS, dedup, parse, project cache, lockfile. Dispatches identical workload through identical reqwest+rustls+tokio stack.

Key data point (CI ant-design npmjs.org cap=128, controlled for same Cloudflare conditions):

standalone manifest-bench wall ≈ 2.10-2.30s, avg_conc 89-95
ruborist preload wall ≈ 2.04-2.15s, avg_conc 84

ruborist now matches the HTTP stack ceiling. Further preload speedup requires a non-reqwest stack (HTTP/3, custom hyper Connector, etc.). This empirically refuted ~10 candidate optimizations as not the bottleneck.

3. Allocator hot-path cleanup

Each resolve_package call allocated several Strings on the resolver hot path. Cumulatively ~10k allocs per ant-design preload, all on workers competing for shared allocator. Eliminated via:

normalize_spec returns Cow<'_, str> instead of (String, String) (commit 12d34dd4)
Arc<FullManifest> in MemoryCache — clone is atomic bump not deep HashMap clone (commit 337d4f26)
versions.keys.clone() removed from cache-hit path — pass borrow directly (commit bb256ecd)
format!() dedup key kept (testing removal broke transitive prefetch)

4. Inline parse (drop `spawn_blocking`)

Worker-pool surfaced parse blocking-pool queue saturation: queue p95=200ms sum=70-89s over 2730 manifests. Cap=4 on CI was funneling all parses through a 4-slot queue.

Fix: inline simd_json parse on the worker (commit f3f616d8). 1-5ms CPU per manifest is acceptable on async worker; eliminates dispatch + queue overhead.

5. Workspace discovery parallelization

find_workspaces_from_pkg was reading 200+ workspace package.json files sequentially in a for loop. Replaced with FuturesUnordered (commit bf149957). Saved ~~150ms (200×~~1ms serial → ~10ms parallel).

6. tarball extraction + cloner — install path

extractor.rs: rayon::spawn per package + par_chunks(32) intra-package writes. Verified essential by testing removal — sequential write regressed p3 +3.67s, σ exploded 0.04→2.85.
cloner.rs: kept on tokio::task::spawn_blocking. Verified essential by testing rayon migration — that regressed p3 +2.65s due to oversubscription with extractor's rayon pool. The current tokio::blocking + cap=worker_threads is the local optimum for hardlink (single short syscall, no fan-out benefit).

Failed experiments (kept here so future attempts don't repeat)

All reverted. Each tested with CI bench data.

Experiment	Hypothesis	Result	Commits
cap=256	More concurrency = faster	wall +0.65s, per-req doubled	`25814552` reverted by `c610a582`
cap=64	Avoid Cloudflare throttle	wall same, per-req halved (cancels)	`02ef0562`
Per-IP multi-client	Cloudflare throttles per-destination	No change at avg_conc	`4e125908`/`ae2a6088`/`5a897e4a`
HTTP/2 negotiate	Multiplex many streams over fewer conns	wall identical to H1 (Cloudflare throttle is per-source-IP)	`1a16d25e` reverted by `f379f7a9`
dedup-by-name only	Skip same-name spec duplicates	preload wall down BUT BFS exploded +5s (lost transitive prefetch)	`de5c83ed` reverted by `90e421a5`
`max_blocking_threads = N*4`	Cloner cap=4 was throttle	p3 unchanged, p4 +0.35s (cold-pool spawn cost)	`132ef36e` reverted by `9a071f29`
Drop intra-package par_chunks in extractor	Cross-package parallelism enough	p3 +3.67s catastrophic	`c7c847d6` reverted by `2f1092c3`
Migrate cloner to rayon	Pool unification	p3 +2.65s (oversubscription with extractor rayon pool)	`9229e160` reverted by `e38329c8`

Cloudflare per-source-IP throttle — measured

Standalone manifest-bench cap sweep (cap=32/64/96/128/192/256):

per-req wall grows with cap: 30ms → 38ms → 53ms → 70ms → 107ms → 146ms
sum more than doubles between cap=128 and cap=256
avg_conc plateau at cap≈128 around 90 effective concurrent

Conclusion: npmjs's Cloudflare frontend throttles on a per-source-IP basis (CI runner egress). Cap=128 is the sweet spot for our setup. Going wider triggers per-req inflation that cancels the parallelism gain.

This refuted the original "raise cap aggressively" intuition and 6 cap-sweep experiments.

Diagnostic infrastructure (kept, valuable for future work)

HTTP diag (5e3c12d2): per-preload wall / busy / sum / avg_conc / p50/p95/max / cpu_tail
parse diag (6e0e60e5): spawn_blocking queue p50/p95 + exec p50/p95
manifest-bench tool (b92aa81f): standalone HTTP-only A/B against ruborist
manifest-bench CI step (f6846d6a): every CI run produces standalone control numbers
Phase timing (preload, build_deps, serialize, project_cache_save, save_package_lock, setup) — full account of every ms in p1
bulk-readdir disk ETag index (88a4b056): warm 304 path restored without per-package syscall storm

Total wall account (p1_resolve, latest CI)

Setup phase (workspace + graph init):    0.6ms ← workspace 200x parallel reads
Preload phase (HTTP):                  2.15s ← at HTTP stack ceiling
Build deps (= preload + BFS resolve):  2.41s → BFS resolve = 0.26s
Serialize graph to lockfile:            12ms
Project cache export + save:            40ms
Save package-lock.json:                 11ms (serialize 10 / write 0 / rename 0)
─────────────────────────────────────────────
Sum of instrumented:                   2.51s
Hyperfine total p1:                    2.65s
Process startup (tokio init + native_certs + clap):  ~140ms ← Rust binary fixed cost

Remaining optimization space (out of this PR's scope)

BFS layer parallelization (~0.26s → ~0.13s achievable)
Process startup ~140ms (rustls-native-certs lazy load, reqwest::Client deferred build)
HTTP/3 / 0-RTT TLS (~50-100ms TCP+TLS handshake savings)
RSS reduction — FullManifest raw bytes + index instead of OwnedValue tree (1.46GB → ~700MB)

None are small enough to fit this PR.

PR split plan

Per discussion, this PR is being decomposed into 4-5 focused PRs:

#	Theme	Risk	Status
1	bench-phase infra + manifest-bench tool + PR auto-comment	low	TBD
2	install path: extractor sequential writes + chunked parallel writes	low	TBD
3	manifest cache & alloc cleanup (Arc, normalize_spec Cow, aws-lc-rs, bulk-readdir, tracing filter, etc.)	medium	TBD
4	preload worker-pool architecture rewrite (+ wasm CI fix)	high	TBD
5	(optional) config tweaks	low	TBD

This PR will close once 1-4 land. Track in subsequent PRs.

Test plan

cargo fmt + cargo clippy --all-targets -- -D warnings --no-deps clean across crates
cargo test -p utoo-pm 245 passed
cargo test -p utoo-ruborist 164 + 10 doctests passed
pm-bench-phases CI run >50 times, results documented above
Standalone manifest-bench CI integrated for control measurements

🤖 Generated with Claude Code

Replace intra-package `par_iter` with a sequential loop when writing extracted tar entries to disk. Each tar entry is typically small and writes complete in microseconds, so splitting them into rayon tasks was causing heavy work-stealing (futex park/unpark) and dominating context switches on large dep graphs. Cross-package parallelism is preserved by the outer `rayon::spawn` in `extract_tarball`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request modifies the tarball extraction logic in crates/pm/src/util/extractor.rs to process entries sequentially instead of in parallel. This change aims to reduce excessive context switching caused by rayon work-stealing on large dependency graphs, while maintaining cross-package parallelism. Feedback suggests consuming the entries collection during iteration to optimize memory usage by dropping file buffers immediately after they are written.

gemini-code-assist · 2026-04-21T14:22:45Z

+    // Write files sequentially. Cross-package parallelism is handled by the outer
+    // rayon::spawn; splitting individual files into rayon tasks caused excessive
+    // work-stealing ctx switches on large dep trees.
+    for entry in &entries {


Since entries is not used after this loop, you can consume it by using for entry in entries instead of iterating by reference. This allows each ExtractedEntry (and its potentially large content buffer) to be dropped immediately after it is written to disk, which can significantly reduce the peak memory usage during the extraction of large packages.

Suggested change

for entry in &entries {

for entry in entries {

- Cold bench: drop `| tail -1` so hyperfine's full summary (mean, stddev, range) reaches the log. Failure detection now uses exit status instead of piping. - `BENCH_WARM_RUNS=0` skips the warm phase entirely (previously the warm function always ran and hyperfine would reject --runs 0). - Result aggregator tolerates empty or malformed export-json files (e.g. when a PM's cold install fails): the offending file is reported and skipped instead of crashing the whole summary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the sequential `for` loop over extracted tar entries with `par_chunks(WRITE_CHUNK_SIZE)` — each rayon task writes a contiguous run of 32 files sequentially. This retains multi-core IO overlap for large packages while cutting the rayon task count (and its work- stealing futex traffic) by the chunk factor versus a per-file par_iter. Cross-package parallelism is preserved by the outer rayon::spawn in extract_tarball. Local (macOS, antd-test, 3 runs avg): before par_iter: wall 17.2s sys 6.18s ivcsw 208k for-loop: wall 15.3s sys 2.36s ivcsw 61k par_chunks(32): wall 13.9s sys 5.77s ivcsw 191k chunks wins wall but loses the ctx-switch reduction relative to the pure sequential version; CI with a large dep graph (ant-design-x) is the authoritative measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Accumulate wall microseconds for download, extract, and clone across all packages during install. Print a one-line summary alongside the existing `added / reused / downloaded` counts, e.g. + 513 added · 3017 reused · 123 downloaded download 135.8s · extract 2.3s · clone 0.4s · 19.0 MB fetched The sums are non-exclusive across cores: dividing by wall clock gives the effective concurrency for each phase, and the ratio between phases shows where cold-install CPU time actually lands. Overhead is three atomics per downloaded tarball. Local antd-test (macOS, npmmirror, 77 packages, wall 16s): download dominates 98% of the CPU budget, extract 1.6%, clone 0.3% — reshapes where we should look for cold-install wins. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Needed so the per-phase timings line (`download · extract · clone · bytes`) printed at the end of each install reaches the CI log. Trade-off is noisier logs — registry INFO/WARN lines come through — but that's the price for visibility into where cold-install CPU actually lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Separates three independent measurements for utoo vs bun so each phase's improvement can be judged on its own baseline: Phase 1 · resolve utoo deps / bun install --lockfile-only Phase 3 · cold install utoo install / bun install (empty cache) Phase 4 · warm link utoo install / bun install (cache warm) Phase 3 uses the lockfile generated by phase 1, with cache reset between iterations. Phase 4 resets only node_modules so only the cache → node_modules link step is measured. Uses hyperfine --show-output so utoo's phase-timings line (\`download · extract · clone · bytes\`) reaches the CI log alongside the wall-clock summary. Triggered via workflow_dispatch with configurable project / registry / runs. Defaults to ant-design against npmjs.org, 3 runs per phase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…anch merge Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous inline bash -c prepare was silently no-op on CI: utoo's run 2/3 showed '3280 reused' meaning the cache wasn't actually cleared, and bun hit InvalidNPMLockfile because utoo's package-lock.json leaked across iterations. Now each phase writes a dedicated prepare shell script per-PM that: - always drops node_modules (incl. workspace package trees), - clears exactly the lockfiles that would confuse this PM, - wipes the right cache for this phase, - prints a '[prep]' line so the CI log proves prepare ran. Also factored out seed_for_phase so lockfile / cache warmup happens once before the benchmark, not leaking into the measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…che wipe Path-based rm -rf of $HOME/.cache/nm wasn't actually emptying the cache on the CI runner — utoo runs 2/3 of phase 3 still showed '3280 reused', wall was 0.8-1.1s instead of the 10s cold-install baseline, hyperfine itself warned about caches not being filled until after run 1. Let each PM clean its own cache via its CLI so we don't rely on guessing where it stores things. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`utoo clean` / `bun pm cache rm` didn't empty the cache on the CI runner either — so now use explicit bench-local paths the rm -rf prepare can guarantee to wipe: utoo: --cache-dir=/tmp/utoo-bench-cache on every invocation bun: BUN_INSTALL_CACHE_DIR=/tmp/bun-bench-cache (env var) Gets us deterministic cold/warm state between hyperfine iterations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop into diagnostic mode to figure out why hyperfine's --prepare still leaves utoo's cache intact across iterations despite the explicit --cache-dir. Prints the generated prepare script, and logs each per-iteration invocation's before/after du -sh of both caches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The case $phase in p1) p3) p4) \-style patterns never matched against actual phase strings like "p1_resolve" / "p3_cold_install" / "p4_warm_link". Result: write_prepare produced a script containing only the common header and no phase-specific cache-wipe logic, so every run after the first hit a warm cache and timings collapsed. Same off-by-name bug in seed_for_phase: "p3:utoo" pattern never matched "p3_cold_install:utoo", skipping lockfile seeding and warm-cache priming. Switched both to "p*_*" globs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The cache-size before/after logs + generated-script dumps were diagnostic scaffolding used to trace the p* vs p*_resolve pattern mismatch. With that fixed, keep the plain hyperfine --prepare invocation so CI logs are readable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…time Each hyperfine iteration now runs inside a metrics wrapper that greps /usr/bin/time -v output for RSS, voluntary/involuntary context switches, page faults, and IO read/write counts. Per-PM per-phase averages across the 3 runs are shown alongside the wall-clock table so we can see, e.g., whether utoo's resolve phase costs more syscalls than bun's, or whether its warm-link advantage comes at a memory cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Linter-applied formatting cleanup, no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Original cap was sized for the FuturesUnordered preload that dispatched 128 simd_json parses through `spawn_blocking` in a burst — letting the default 512 cap run gave bimodal wall (M2: 2.7s fast / 6.9s thrash). Capping at `worker_threads` eliminated the thrash peak. After commit f3f616d (inline parse) preload no longer uses the blocking pool. The dominant consumer is now `cloner.rs` during the install phase: every file's hardlink / clonefile / copy goes through `spawn_blocking`, ~50000 short syscalls per ant-design install. Each syscall is near-instant, so the cap rarely backpressures, but cap=4 on CI does limit how fast cloner can fire syscalls in parallel. Raise cap to `max(worker_threads * 4, 32)`: enough headroom for cloner to keep multiple syscalls in flight, low enough that the historical thrash regime (hundreds of churning threads) stays avoided. Pool is per-runtime; idle threads die after 10s. Expected: small p3_cold_install improvement (current utoo 5.74s vs bun 7.71s); preload phase unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… 32)" This reverts commit 132ef36.

A/B test: replace `entries.par_chunks(WRITE_CHUNK_SIZE).try_for_each` with a plain sequential `for entry in &entries` loop. Each tarball still runs in its own outer `rayon::spawn` task (cross-package parallelism preserved); only the within-tarball write fan-out is removed. Goal: measure whether rayon's intra-package parallelism still earns its keep after the worker-pool preload rewrite. Cross-package parallelism alone may already saturate IO; if so, removing the inner par_chunks cuts work-stealing futex traffic + thread sync overhead with zero throughput cost. If p3_cold_install regresses ≥0.3s → intra-package writes are genuinely IO-bound across cores, restore par_chunks. If p3 unchanged or improves → simpler sequential code wins. This is a test commit. Will be reverted if regression measured. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…act" This reverts commit c7c847d.

`clone_dir` (Linux hardlink/copy path) was using `tokio::task::spawn_blocking` per package — at default cap=4 on CI, only 4 packages cloned at once, each running all file hardlinks sequentially internally. ~3500 packages × N files per install all funneled through that bounded pool. Switch to the same pattern `extractor.rs` already uses: - `rayon::spawn` per package replaces `spawn_blocking` (cross-package parallelism via rayon work-stealing — global pool, not capped at worker_threads) - `par_chunks(CLONE_CHUNK_SIZE)` for the inner hardlink/copy loop (intra-package fan-out across cores; same chunk size = 32 as extractor) Trade-offs: - EXDEV `force_copy` latch is now per-chunk instead of global per clone — chunks each rediscover cross-device errors and fall back locally. A few extra hardlink-then-copy round-trips at chunk boundaries, acceptable for the rare cross-device install. - Pool unification: tokio blocking pool now mostly idle (just git + http tarball + a few one-shot commands), rayon handles all the high-volume IO. Cuts the 3-pool fragmentation observed earlier. Tested: - Iter 1 of this loop (cap bump from N to max(N*4, 32)): no p3 win, p4 regressed → cap raise alone wasn't the answer. - Iter 2 (drop intra-package par_chunks in extractor): p3 +3.67s, σ exploded 0.04 → 2.85s → intra-package fan-out is essential. - This commit applies the same fan-out to clone_dir for the same reason. macOS `clonefile` path (target_os = "macos") unchanged — clonefile is a single syscall per file, different perf profile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This reverts commit 9229e16.

The headline architectural change of #2818. ruborist's preload phase shifts from a single-task `FuturesUnordered` cooperative poller to N long-lived `tokio::spawn` workers (or `wasm_bindgen_futures::spawn_local` on wasm32 where Send isn't satisfied). Stacks on top of #2826. ## Why Old design: main task owned `FuturesUnordered`, polled all preload futures cooperatively, and ran every per-future continuation (post-await body, completion handler, dispatch refill) on the same single task. The deeper await chain inside `resolve_package` (cache check + `OnceMap::get_or_init` + `RetryIf` + `request.send` + `bytes` + parse spawn_blocking) made each future yield 5+ times, and every yield round-tripped through main — saturating it. CI ant-design preload sustained avg_conc=55-61 even after Mutex / allocator hot-path eliminations, while the standalone manifest-bench (same reqwest stack, no resolver — see #2824) hit 92 at the same cap. ## How N long-lived `tokio::spawn` workers pulling from a shared lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker owns an `Arc<R>` clone and runs `resolve_package` on tokio's global executor — futures progress fully independently, no cooperative poll bottleneck. Main task only drains an `mpsc::unbounded_channel` of completions to fire receiver events + on_manifest callback. Termination: workers track `dispatched` / `completed: AtomicUsize` and park on a shared `Notify` when the queue is empty. When the last completion makes `completed == dispatched` and the queue is empty, the finishing worker raises a `shutdown` flag and wakes others; all workers drop their result_tx clones, the channel closes, and the main `recv().await` loop exits. ## Trait surface change - `MockRegistryClient` + `MockPackage` now `derive(Clone)` so tests can wrap the mock in `Arc` for the new signature - `preload_manifests` takes `registry: Arc<R>` (was `&R`); call site in `run_preload_phase` clones the borrowed registry into a fresh `Arc`. Bound at every public surface up the chain bumped to `R: RegistryClient + Clone + MaybeSend + MaybeSync + 'static`, `R::Error: MaybeSend`. The `MaybeSend` / `MaybeSync` shims (added in #2826) keep the trait surface wasm-compatible. ## Companion changes folded in - **Inline simd_json parse** — drop `tokio::task::spawn_blocking` in `service/manifest.rs`. Worker-pool surfaced parse blocking- pool queue saturation: `queue p95=200ms sum=70-89s` over 2730 manifests on cap=4 CI runners. Inline parse on the worker thread eliminates dispatch + queue overhead; 1-5ms CPU per manifest is acceptable on async worker. - **Workspace package.json parallel reads** — `find_workspaces_from_pkg` switched from sequential `for path in matched_paths { read }` loop to `FuturesUnordered` fan-out. ant-design has ~200 workspace packages; saved ~150ms. - **Setup phase + lockfile-write timing logs** — round out the per-phase wall account for the bench-comment infrastructure. - **Manifests concurrency cap 64 → 128** — worker-pool delivered the parallelism that justifies the cap raise. CI ant-design avg_conc 84 at cap=128 (up from 55 under the old architecture); preload wall 3.10s → 2.15s. ## Tests `#[tokio::test(flavor = "multi_thread", worker_threads = 2)]` since worker-pool needs a spawn-able runtime; ruborist's dev-dependencies on `tokio` add the `rt-multi-thread` feature. 164 ruborist + 10 doctests + 248/249 utoo-pm pass (1 pre-existing flake on `test_update_package_binary_fsevents`, runs green alone). ## Wasm CI cfg-gates `tokio::spawn` to `wasm_bindgen_futures::spawn_local` on wasm32 since wasm-bindgen's `JsFuture` is `!Send`. Workers still run independently — single-threaded under wasm but the queue + Notify + mpsc termination story is unchanged. `cargo check -p utoo-wasm --target wasm32-unknown-unknown` clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…p guard Two folded changes that started life as separate commits on the parent perf branch: 1. **Sequential / chunked parallel writes** (was: ad0dee9 → 7ab17b8). The old per-file `par_iter().for_each(write)` paid work-stealing futex park/unpark overhead per write. Each entry is <64 KB and a single fs::File::create + write_all returns in μs — rayon scheduler dominated. Switch to `entries.par_chunks(32) .try_for_each(...)`: each rayon task writes a contiguous run of 32 files sequentially. Cuts task count by 32× while keeping multi-core IO-overlap parallelism. Cross-package parallelism is preserved by the outer `rayon::spawn` in `extract_tarball`, which itself was already landed on `next` via earlier work. Verified essential by an A/B on the parent branch: removing the intra-package par_chunks (sequential `for entry in &entries` inside each rayon task) regressed CI p3_cold_install +3.67s and exploded σ from 0.04 → 2.85 — IO can't interleave across cores when each tarball serialises its own writes. 2. **Tar Slip guard** — reject tar entries whose path is absolute or contains `..` components before joining with the destination. Without this an attacker-controlled tarball could overwrite arbitrary files via paths like `../../etc/foo` or `/etc/passwd`. `tar` crate does not enforce this by default; `npm` and `pnpm` both validate. We log+skip such entries. Both changes touch the same single function so they commit together. CI bench shows p3_cold_install at 5.74s vs bun 7.71s (utoo +2s ahead). PR description in #2818 documents the full A/B journey. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bundle of independently-motivated allocator + cache hot-path optimisations from the parent perf branch (#2818). Each landed during the worker-pool exploration but doesn't depend on the worker-pool architecture itself — they stand alone as straightforward perf wins for the resolver. ## TLS provider — `aws-lc-rs` instead of `ring` `reqwest` 0.12's default `rustls-tls-native-roots` feature pins `ring` via Cargo's feature unification. Switch to `rustls-tls-native-roots-no-provider`, build our own `rustls::ClientConfig` with the `aws_lc_rs` provider, pass via `Client::use_preconfigured_tls`. CI measurement (4-core ubuntu vs npmjs.org): ring's per-handshake CCS→AppData was 78 ms p50 / 154 ms max, all 128 parallel handshakes serialising across 4 cores. aws-lc-rs (BoringSSL primitives) is ~3× faster on x86_64. Saved ~420 ms preload on cold ant-design. ## DNS — per-family rotation `getaddrinfo` typically returns 10 v6 + 12 v4 for npmjs.org. A flat rotation across the joined list meant offsets 0..10 all started inside the v6 range; on hosts where v6 routing fails (GitHub Actions runners), every connection fell through to the *same* first-reachable v4. Rotate per-family so v4 conns cycle across all v4 addresses (and v6 over v6) — observed pcap on bun shows the same 4×64 distribution we now produce. ## Disk-cache bulk-readdir ETag index `PackageCache` lazy-builds a `HashSet<String>` of names with existing disk cache entries from a single `read_dir(cache_dir)` + per-`@scope` recurse. `get_versions_from_disk` and `get_version_manifest_from_disk` short-circuit via the index. Restores the warm-run 304 path that was temporarily removed in 46cb803 (per-package `try_exists` was 16 ms avg on the cold-run critical path; now zero). ## Lazy per-version `CoreVersionManifest` via `simd_json::OwnedValue` `Versions` now stores `keys: Vec<String>` (ordered version list) + `trees: HashMap<String, Arc<simd_json::OwnedValue>>` (pre-parsed JSON subtrees). Strongly-typed `CoreVersionManifest` is materialised on demand via `CoreVersionManifest::deserialize(tree.as_ref())` — zero-copy through `simd_json::OwnedValue`'s `Deserializer` impl, memoised in a `DashMap`. Resolver typically reads 1-3 of the ~500 versions per manifest; previous design built every one eagerly. ## `Arc<FullManifest>` in `MemoryCache` Cache previously returned `FullManifest` by value, deep-cloning the per-version HashMap (100-500 entries × String key clone + Arc bump per cache hit) on the resolver hot path. ~2730 cache hits during cold preload × ~200-entry HashMap clone = ~500k allocations on shared resolver threads, contending the allocator. Wrap in `Arc<FullManifest>`; cache hit becomes one atomic bump. ## `normalize_spec` returns `Cow<'a, str>` Was unconditionally allocating `(String, String)` even for the ~99 % of deps with no `npm:` / `workspace:` prefix. ~5460 String allocations per ant-design preload, all on resolver hot path. Common path now returns `Cow::Borrowed`. ## Drop `versions.keys.clone()` from cache-hit path `resolve_package`'s full-manifest cache-hit branch was cloning the entire `versions.keys: Vec<String>` (~200 entries) just to pass `&[String]` to `resolve_target_version`. Borrow directly via Arc auto-deref. ~360k String allocs eliminated (~1800 cache hits × ~200 entries). ## OnceMap dedup New `crate::util::oncemap` module: `DashMap` + `tokio::sync::Notify` coalescer for concurrent `resolve_full_manifest` callers of the same name. First caller fetches the network; others wait on the shared `Notify` and read the cached `Arc<V>`. Replaces the prior per-name `tokio::sync::Mutex<()>` gate that serialised the hot dispatch path. ## tracing file_filter info+ default File-layer log filter dropped from `utoo=debug` to `utoo=info`. Hot-path `tracing::debug!()` calls (cache hits, BFS dispatch, preload events) emit ~5-10 events per resolved manifest. With 2730+ manifests during cold preload that's 15-30k events that — even routed through the non_blocking appender's channel — pay format/serialise CPU on the resolving thread before the channel send. Override via `UTOO_FILE_LOG=debug` for diagnostics. ## indicatif progress bar — drop per-package message updates `PreloadFetching` and `PreloadProgress` used to call `format!("fetching/resolved {}", name)` + `PROGRESS_BAR.set_message()` per event. With ~9000 such calls per ant-design preload and an indicatif-internal `Mutex` per call, this serialised the main loop's fill-and-drain rate. The user can't visually parse 5460 message swaps in 3 seconds anyway. Counter still ticks via `PROGRESS_BAR.inc(1)`. ## HTTP + parse diagnostic infrastructure (used by PR4) `service/http.rs` ships `start_http_trace` / `finish_http_trace` + `start_parse_trace` / `finish_parse_trace` plus `record_http_interval` + `record_parse_interval` callbacks. `#[allow(dead_code)]` on the start/finish for now — the preload worker-pool refactor in the next PR (#TBD) wires them in. Also bumps the `+ Sync` bound on `RegistryClient` callers in `builder.rs` / `preload.rs` / `resolver/registry.rs` — required because the trait's default-method futures gained `+ Send` (needed downstream by tokio::spawn, but already correct for single-threaded resolvers too). Tests: 164 ruborist + 248/249 utoo-pm pass (1 pre-existing flake on `test_update_package_binary_fsevents` when run in parallel, passes alone). Stacks: PR4 (preload worker-pool architecture) targets this branch and adds the bound propagation + spawn refactor on top. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The headline architectural change of #2818 — preload phase shifts from a single-task `FuturesUnordered` cooperative poller to N long-lived `tokio::spawn` workers (or `wasm_bindgen_futures::spawn_local` on wasm32 where Send isn't satisfied). Stacks on top of #2826. ## Why Old design: main task owned `FuturesUnordered`, polled all preload futures cooperatively, and ran every per-future continuation (post-await body, completion handler, dispatch refill) on the same single task. The deeper await chain inside `resolve_package` (cache check + `OnceMap::get_or_init` + `RetryIf` + `request.send` + `bytes` + parse spawn_blocking) made each future yield 5+ times, and every yield round-tripped through main — saturating it. CI ant-design preload sustained avg_conc=55-61 even after Mutex / allocator hot-path eliminations, while the standalone manifest-bench (#2824) hit 92 on the same reqwest stack. ## How N long-lived `tokio::spawn` workers pulling from a shared lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker owns an `Arc<R>` clone and runs `resolve_package` on tokio's global executor — futures progress fully independently, no cooperative poll bottleneck. Main task only drains an `mpsc::unbounded_channel` of completions to fire receiver events + on_manifest callback. Termination: workers track `dispatched` / `completed: AtomicUsize` and park on a shared `Notify` when the queue is empty. When the last completion makes `completed == dispatched` and the queue is empty, the finishing worker raises a `shutdown` flag and wakes others; all workers drop their result_tx clones, the channel closes, and the main `recv().await` loop exits. ## Trait surface change - `MockRegistryClient` + `MockPackage` `derive(Clone)` so tests can wrap the mock in `Arc` for the new signature - `preload_manifests` takes `registry: Arc<R>` (was `&R`); call site in `run_preload_phase` clones the borrowed registry into a fresh `Arc`. Bound at every public surface up the chain bumped to `R: RegistryClient + Clone + MaybeSend + MaybeSync + 'static`, `R::Error: MaybeSend`. The `MaybeSend` / `MaybeSync` shims (added in #2826) keep the trait surface wasm-compatible. ## Companion changes folded in - **Inline simd_json parse** — drop `tokio::task::spawn_blocking` in `service/manifest.rs`. Worker-pool surfaced parse blocking- pool queue saturation: `queue p95=200ms sum=70-89s` over 2730 manifests on cap=4 CI runners. Inline parse on the worker thread eliminates dispatch + queue overhead. - **Workspace package.json parallel reads** — switch the per-pattern `for path in matched_paths` serial loop to `FuturesUnordered` fan-out. ant-design has ~200 workspace packages; saved ~150ms. - **Setup phase + lockfile-write timing logs** — round out the per-phase wall account for the bench-comment infrastructure. - **Manifests concurrency cap 64 → 128** — worker-pool delivers the parallelism that justifies the cap raise. CI ant-design avg_conc 84 at cap=128 (up from 55 under the old architecture); preload wall 3.10s → 2.15s. ## Wasm CI cfg-gates `tokio::spawn` to `wasm_bindgen_futures::spawn_local` on wasm32 since wasm-bindgen's `JsFuture` is `!Send`. Workers still run independently — single-threaded under wasm but the queue + Notify + mpsc termination story is unchanged. `cargo check -p utoo-wasm --target wasm32-unknown-unknown` clean. Tests: 164 ruborist + 10 doctests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…p guard Two folded changes that started life as separate commits on the parent perf branch: 1. **Sequential / chunked parallel writes** (was: ad0dee9 → 7ab17b8). The old per-file `par_iter().for_each(write)` paid work-stealing futex park/unpark overhead per write. Each entry is <64 KB and a single fs::File::create + write_all returns in μs — rayon scheduler dominated. Switch to `entries.par_chunks(32) .try_for_each(...)`: each rayon task writes a contiguous run of 32 files sequentially. Cuts task count by 32× while keeping multi-core IO-overlap parallelism. Cross-package parallelism is preserved by the outer `rayon::spawn` in `extract_tarball`, which itself was already landed on `next` via earlier work. Verified essential by an A/B on the parent branch: removing the intra-package par_chunks (sequential `for entry in &entries` inside each rayon task) regressed CI p3_cold_install +3.67s and exploded σ from 0.04 → 2.85 — IO can't interleave across cores when each tarball serialises its own writes. 2. **Tar Slip guard** — reject tar entries whose path is absolute or contains `..` components before joining with the destination. Without this an attacker-controlled tarball could overwrite arbitrary files via paths like `../../etc/foo` or `/etc/passwd`. `tar` crate does not enforce this by default; `npm` and `pnpm` both validate. We log+skip such entries. Both changes touch the same single function so they commit together. CI bench shows p3_cold_install at 5.74s vs bun 7.71s (utoo +2s ahead). PR description in #2818 documents the full A/B journey. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bundle of independently-motivated allocator + cache hot-path optimisations from the parent perf branch (#2818). Each landed during the worker-pool exploration but doesn't depend on the worker-pool architecture itself — they stand alone as straightforward perf wins for the resolver. ## TLS provider — `aws-lc-rs` instead of `ring` `reqwest` 0.12's default `rustls-tls-native-roots` feature pins `ring` via Cargo's feature unification. Switch to `rustls-tls-native-roots-no-provider`, build our own `rustls::ClientConfig` with the `aws_lc_rs` provider, pass via `Client::use_preconfigured_tls`. CI measurement (4-core ubuntu vs npmjs.org): ring's per-handshake CCS→AppData was 78 ms p50 / 154 ms max, all 128 parallel handshakes serialising across 4 cores. aws-lc-rs (BoringSSL primitives) is ~3× faster on x86_64. Saved ~420 ms preload on cold ant-design. ## DNS — per-family rotation `getaddrinfo` typically returns 10 v6 + 12 v4 for npmjs.org. A flat rotation across the joined list meant offsets 0..10 all started inside the v6 range; on hosts where v6 routing fails (GitHub Actions runners), every connection fell through to the *same* first-reachable v4. Rotate per-family so v4 conns cycle across all v4 addresses (and v6 over v6) — observed pcap on bun shows the same 4×64 distribution we now produce. ## Disk-cache bulk-readdir ETag index `PackageCache` lazy-builds a `HashSet<String>` of names with existing disk cache entries from a single `read_dir(cache_dir)` + per-`@scope` recurse. `get_versions_from_disk` and `get_version_manifest_from_disk` short-circuit via the index. Restores the warm-run 304 path that was temporarily removed in 46cb803 (per-package `try_exists` was 16 ms avg on the cold-run critical path; now zero). ## Lazy per-version `CoreVersionManifest` via `simd_json::OwnedValue` `Versions` now stores `keys: Vec<String>` (ordered version list) + `trees: HashMap<String, Arc<simd_json::OwnedValue>>` (pre-parsed JSON subtrees). Strongly-typed `CoreVersionManifest` is materialised on demand via `CoreVersionManifest::deserialize(tree.as_ref())` — zero-copy through `simd_json::OwnedValue`'s `Deserializer` impl, memoised in a `DashMap`. Resolver typically reads 1-3 of the ~500 versions per manifest; previous design built every one eagerly. ## `Arc<FullManifest>` in `MemoryCache` Cache previously returned `FullManifest` by value, deep-cloning the per-version HashMap (100-500 entries × String key clone + Arc bump per cache hit) on the resolver hot path. ~2730 cache hits during cold preload × ~200-entry HashMap clone = ~500k allocations on shared resolver threads, contending the allocator. Wrap in `Arc<FullManifest>`; cache hit becomes one atomic bump. ## `normalize_spec` returns `Cow<'a, str>` Was unconditionally allocating `(String, String)` even for the ~99 % of deps with no `npm:` / `workspace:` prefix. ~5460 String allocations per ant-design preload, all on resolver hot path. Common path now returns `Cow::Borrowed`. ## Drop `versions.keys.clone()` from cache-hit path `resolve_package`'s full-manifest cache-hit branch was cloning the entire `versions.keys: Vec<String>` (~200 entries) just to pass `&[String]` to `resolve_target_version`. Borrow directly via Arc auto-deref. ~360k String allocs eliminated (~1800 cache hits × ~200 entries). ## OnceMap dedup New `crate::util::oncemap` module: `DashMap` + `tokio::sync::Notify` coalescer for concurrent `resolve_full_manifest` callers of the same name. First caller fetches the network; others wait on the shared `Notify` and read the cached `Arc<V>`. Replaces the prior per-name `tokio::sync::Mutex<()>` gate that serialised the hot dispatch path. ## tracing file_filter info+ default File-layer log filter dropped from `utoo=debug` to `utoo=info`. Hot-path `tracing::debug!()` calls (cache hits, BFS dispatch, preload events) emit ~5-10 events per resolved manifest. With 2730+ manifests during cold preload that's 15-30k events that — even routed through the non_blocking appender's channel — pay format/serialise CPU on the resolving thread before the channel send. Override via `UTOO_FILE_LOG=debug` for diagnostics. ## indicatif progress bar — drop per-package message updates `PreloadFetching` and `PreloadProgress` used to call `format!("fetching/resolved {}", name)` + `PROGRESS_BAR.set_message()` per event. With ~9000 such calls per ant-design preload and an indicatif-internal `Mutex` per call, this serialised the main loop's fill-and-drain rate. The user can't visually parse 5460 message swaps in 3 seconds anyway. Counter still ticks via `PROGRESS_BAR.inc(1)`. ## HTTP + parse diagnostic infrastructure (used by PR4) `service/http.rs` ships `start_http_trace` / `finish_http_trace` + `start_parse_trace` / `finish_parse_trace` plus `record_http_interval` + `record_parse_interval` callbacks. `#[allow(dead_code)]` on the start/finish for now — the preload worker-pool refactor in the next PR (#TBD) wires them in. Also bumps the `+ Sync` bound on `RegistryClient` callers in `builder.rs` / `preload.rs` / `resolver/registry.rs` — required because the trait's default-method futures gained `+ Send` (needed downstream by tokio::spawn, but already correct for single-threaded resolvers too). Tests: 164 ruborist + 248/249 utoo-pm pass (1 pre-existing flake on `test_update_package_binary_fsevents` when run in parallel, passes alone). Stacks: PR4 (preload worker-pool architecture) targets this branch and adds the bound propagation + spawn refactor on top. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The headline architectural change of #2818 — preload phase shifts from a single-task `FuturesUnordered` cooperative poller to N long-lived `tokio::spawn` workers (or `wasm_bindgen_futures::spawn_local` on wasm32 where Send isn't satisfied). Stacks on top of #2826. ## Why Old design: main task owned `FuturesUnordered`, polled all preload futures cooperatively, and ran every per-future continuation (post-await body, completion handler, dispatch refill) on the same single task. The deeper await chain inside `resolve_package` (cache check + `OnceMap::get_or_init` + `RetryIf` + `request.send` + `bytes` + parse spawn_blocking) made each future yield 5+ times, and every yield round-tripped through main — saturating it. CI ant-design preload sustained avg_conc=55-61 even after Mutex / allocator hot-path eliminations, while the standalone manifest-bench (#2824) hit 92 on the same reqwest stack. ## How N long-lived `tokio::spawn` workers pulling from a shared lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker owns an `Arc<R>` clone and runs `resolve_package` on tokio's global executor — futures progress fully independently, no cooperative poll bottleneck. Main task only drains an `mpsc::unbounded_channel` of completions to fire receiver events + on_manifest callback. Termination: workers track `dispatched` / `completed: AtomicUsize` and park on a shared `Notify` when the queue is empty. When the last completion makes `completed == dispatched` and the queue is empty, the finishing worker raises a `shutdown` flag and wakes others; all workers drop their result_tx clones, the channel closes, and the main `recv().await` loop exits. ## Trait surface change - `MockRegistryClient` + `MockPackage` `derive(Clone)` so tests can wrap the mock in `Arc` for the new signature - `preload_manifests` takes `registry: Arc<R>` (was `&R`); call site in `run_preload_phase` clones the borrowed registry into a fresh `Arc`. Bound at every public surface up the chain bumped to `R: RegistryClient + Clone + MaybeSend + MaybeSync + 'static`, `R::Error: MaybeSend`. The `MaybeSend` / `MaybeSync` shims (added in #2826) keep the trait surface wasm-compatible. ## Companion changes folded in - **Inline simd_json parse** — drop `tokio::task::spawn_blocking` in `service/manifest.rs`. Worker-pool surfaced parse blocking- pool queue saturation: `queue p95=200ms sum=70-89s` over 2730 manifests on cap=4 CI runners. Inline parse on the worker thread eliminates dispatch + queue overhead. - **Workspace package.json parallel reads** — switch the per-pattern `for path in matched_paths` serial loop to `FuturesUnordered` fan-out. ant-design has ~200 workspace packages; saved ~150ms. - **Setup phase + lockfile-write timing logs** — round out the per-phase wall account for the bench-comment infrastructure. - **Manifests concurrency cap 64 → 128** — worker-pool delivers the parallelism that justifies the cap raise. CI ant-design avg_conc 84 at cap=128 (up from 55 under the old architecture); preload wall 3.10s → 2.15s. ## Wasm CI cfg-gates `tokio::spawn` to `wasm_bindgen_futures::spawn_local` on wasm32 since wasm-bindgen's `JsFuture` is `!Send`. Workers still run independently — single-threaded under wasm but the queue + Notify + mpsc termination story is unchanged. `cargo check -p utoo-wasm --target wasm32-unknown-unknown` clean. Tests: 164 ruborist + 10 doctests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- delete crates/manifest-bench (debug-only, never merged) - tombi format crates/ruborist/Cargo.toml - typos: unparseable → unparsable in bench/pm-bench.sh

…al-writes

github-actions · 2026-04-28T03:21:39Z

📊 pm-bench-phases · `ec1b50b` · linux (`ubuntu-latest`)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM	wall	±σ	user	sys	RSS	pgMinor
bun	9.54s	0.28s	10.18s	10.17s	653M	330.6K
utoo-npm	10.23s	0.20s	11.61s	13.29s	1.13G	159.4K
utoo	9.21s	1.07s	11.20s	12.27s	2.26G	260.4K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	17.3K	18.1K	1.16G	7M	1.83G	1.72G	1M
utoo-npm	174.2K	160.1K	1.14G	4M	1.68G	1.68G	2M
utoo	79.1K	40.4K	1.13G	5M	1.68G	1.68G	2M

p1_resolve

PM	wall	±σ	user	sys	RSS	pgMinor
bun	2.37s	0.08s	3.87s	1.06s	483M	174.2K
utoo-npm	6.01s	0.60s	6.07s	1.09s	430M	74.5K
utoo	2.65s	0.05s	5.62s	1.97s	1.44G	193.4K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	12.0K	3.3K	200M	3M	104M	-	1M
utoo-npm	68.8K	2.5K	202M	2M	9M	5M	2M
utoo	18.3K	15.1K	197M	3M	7M	5M	2M

p3_cold_install

PM	wall	±σ	user	sys	RSS	pgMinor
bun	7.47s	0.71s	6.14s	10.00s	595M	203.7K
utoo-npm	9.40s	1.22s	5.61s	12.10s	905M	122.3K
utoo	7.69s	2.82s	5.46s	10.86s	878M	107.1K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	6.7K	7.4K	993M	4M	1.73G	1.73G	1M
utoo-npm	153.9K	109.7K	965M	4M	1.67G	1.67G	2M
utoo	89.8K	43.0K	965M	3M	1.67G	1.67G	2M

p4_warm_link

PM	wall	±σ	user	sys	RSS	pgMinor
bun	3.35s	0.06s	0.23s	2.32s	135M	32.1K
utoo-npm	2.28s	0.18s	0.61s	3.91s	84M	19.6K
utoo	2.12s	0.04s	0.40s	3.42s	64M	13.7K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	382	32	7M	38K	1.88G	1.72G	1M
utoo-npm	53.0K	22.0K	21K	12K	1.67G	1.67G	2M
utoo	16.7K	8.9K	20K	10K	1.68G	1.67G	2M

npmmirror.com

p0_full_cold

PM	wall	±σ	user	sys	RSS	pgMinor
bun	28.79s	6.58s	9.42s	10.19s	523M	367.5K
utoo-npm	30.49s	13.81s	8.04s	14.54s	681M	116.6K
utoo	14.32s	0.63s	7.42s	12.27s	894M	132.4K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	92.9K	5.5K	1.12G	12M	1.85G	1.73G	2M
utoo-npm	250.4K	98.7K	978M	9M	1.67G	1.68G	2M
utoo	154.1K	61.6K	984M	9M	1.67G	1.68G	2M

p1_resolve

PM	wall	±σ	user	sys	RSS	pgMinor
bun	1.58s	0.04s	4.00s	1.11s	552M	185.7K
utoo-npm	6.66s	1.17s	2.12s	0.60s	74M	16.1K
utoo	1.12s	0.10s	1.15s	0.38s	88M	18.6K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	5.5K	5.9K	151M	3M	106M	-	2M
utoo-npm	47.6K	622	13M	2M	-	4M	2M
utoo	15.1K	917	17M	3M	-	4M	2M

p3_cold_install

PM	wall	±σ	user	sys	RSS	pgMinor
bun	30.52s	19.81s	5.92s	9.44s	238M	95.5K
utoo-npm	44.38s	33.35s	6.23s	12.96s	612M	108.7K
utoo	20.24s	3.22s	5.80s	11.57s	663M	98.1K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	69.2K	3.4K	999M	9M	1.73G	1.73G	2M
utoo-npm	198.8K	99.0K	984M	7M	1.67G	1.67G	2M
utoo	137.1K	49.8K	968M	7M	1.67G	1.67G	2M

p4_warm_link

PM	wall	±σ	user	sys	RSS	pgMinor
bun	3.32s	0.05s	0.22s	2.33s	135M	31.3K
utoo-npm	2.57s	0.15s	0.63s	3.98s	84M	19.7K
utoo	2.13s	0.30s	0.43s	3.43s	65M	14.3K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	759	24	7M	86K	1.88G	1.72G	2M
utoo-npm	55.3K	23.0K	39K	12K	1.67G	1.67G	2M
utoo	16.5K	9.4K	40K	12K	1.67G	1.67G	2M

github-actions · 2026-04-28T03:46:14Z

📊 pm-bench-phases · `ec1b50b` · mac (`macos-latest`)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM	wall	±σ	user	sys	RSS	pgMinor
bun	17.84s	3.14s	6.29s	19.27s	793M	51.2K
utoo-npm	23.09s	0.97s	11.05s	27.49s	970M	97.7K
utoo	18.90s	1.47s	9.54s	22.98s	1.97G	176.9K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	16.9K	145.3K	-	-	1.76G	1.91G	1M
utoo-npm	13.2K	381.6K	-	-	1.63G	1.83G	2M
utoo	4.4K	216.5K	-	-	1.63G	1.88G	2M

p1_resolve

PM	wall	±σ	user	sys	RSS	pgMinor
bun	1.97s	0.12s	2.26s	0.95s	505M	32.9K
utoo-npm	4.86s	0.13s	3.99s	2.05s	542M	36.7K
utoo	3.00s	0.13s	3.92s	2.06s	1.62G	107.3K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	10	24.0K	-	-	110M	-	1M
utoo-npm	13	78.9K	-	-	28M	5M	2M
utoo	42	46.7K	-	-	27M	5M	2M

p3_cold_install

PM	wall	±σ	user	sys	RSS	pgMinor
bun	20.05s	3.20s	4.01s	22.14s	531M	34.5K
utoo-npm	18.67s	1.75s	4.82s	24.00s	737M	80.8K
utoo	12.40s	2.43s	3.74s	17.48s	718M	77.9K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	4.8K	138.4K	-	-	1.70G	1.94G	1M
utoo-npm	1.5K	242.7K	-	-	1.61G	1.83G	2M
utoo	1.3K	154.1K	-	-	1.61G	1.83G	2M

p4_warm_link

PM	wall	±σ	user	sys	RSS	pgMinor
bun	5.12s	0.62s	0.11s	2.25s	48M	3.7K
utoo-npm	4.00s	0.35s	0.57s	2.92s	91M	6.8K
utoo	3.86s	0.57s	0.37s	2.56s	82M	5.9K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	15.6K	876	-	-	1.86G	1.90G	1M
utoo-npm	13.0K	73.4K	-	-	1.61G	1.82G	2M
utoo	13.7K	20.1K	-	-	1.63G	1.82G	2M

npmmirror.com

p0_full_cold

PM	wall	±σ	user	sys	RSS	pgMinor
bun	56.79s	22.62s	6.67s	18.61s	556M	36.0K
utoo-npm	63.80s	40.68s	8.94s	24.20s	641M	74.0K
utoo	29.08s	7.31s	7.43s	23.56s	719M	79.0K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	13.8K	175.9K	-	-	1.79G	1.90G	2M
utoo-npm	4.1K	472.4K	-	-	1.61G	1.87G	2M
utoo	1.9K	285.5K	-	-	1.61G	1.87G	2M

p1_resolve

PM	wall	±σ	user	sys	RSS	pgMinor
bun	35.23s	3.97s	2.99s	1.86s	499M	32.5K
utoo-npm	27.66s	16.35s	2.53s	1.55s	80M	5.8K
utoo	10.49s	10.44s	1.62s	0.71s	92M	6.6K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	49	37.1K	-	-	113M	-	2M
utoo-npm	15	50.4K	-	-	-	4M	2M
utoo	31	28.4K	-	-	-	4M	2M

p3_cold_install

PM	wall	±σ	user	sys	RSS	pgMinor
bun	23.13s	0.12s	3.99s	19.21s	269M	17.8K
utoo-npm	36.33s	2.43s	5.41s	19.41s	699M	76.8K
utoo	32.64s	0.65s	5.45s	20.26s	679M	77.9K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	1.8K	156.7K	-	-	1.64G	1.91G	2M
utoo-npm	1.6K	334.1K	-	-	1.60G	1.83G	2M
utoo	1.3K	251.7K	-	-	1.60G	1.83G	2M

p4_warm_link

PM	wall	±σ	user	sys	RSS	pgMinor
bun	4.82s	0.82s	0.15s	2.45s	53M	4.0K
utoo-npm	5.29s	1.06s	0.81s	4.04s	88M	6.5K
utoo	5.60s	0.06s	0.51s	3.54s	85M	6.1K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	14.6K	4.5K	-	-	1.87G	1.93G	2M
utoo-npm	12.4K	78.9K	-	-	1.61G	1.87G	2M
utoo	13.1K	20.8K	-	-	1.61G	1.87G	2M

elrrrrrrr added the benchmark Run pm-bench on PR label Apr 21, 2026

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

elrrrrrrr and others added 6 commits April 21, 2026 23:21

ci(pm-bench-phases): trigger on PR label so it runs before default-br…

d021c23

…anch merge Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

elrrrrrrr added the bench-phases Run pm-bench-phases workflow label Apr 22, 2026

elrrrrrrr added bench-phases Run pm-bench-phases workflow and removed bench-phases Run pm-bench-phases workflow labels Apr 22, 2026

elrrrrrrr removed the bench-phases Run pm-bench-phases workflow label Apr 22, 2026

elrrrrrrr and others added 7 commits April 25, 2026 20:24

chore(ruborist): drop trailing newline in preload.rs

1b0c67c

Linter-applied formatting cleanup, no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Revert "perf(pm): raise tokio max_blocking_threads from N to max(N*4,…

9a071f2

… 32)" This reverts commit 132ef36.

Revert "test(pm): drop intra-package rayon par_chunks in tarball extr…

2f1092c

…act" This reverts commit c7c847d.

Revert "perf(pm): clone_dir uses rayon (mirror extractor pattern)"

e38329c

This reverts commit 9229e16.

This was referenced Apr 27, 2026

perf(pm): parallelize workspace package.json reads #2831

Closed

perf(pm): #2818 rebase reproduce — bench probe #2834

Closed

elrrrrrrr added a commit that referenced this pull request Apr 27, 2026

chore(pm): fix CI gates on rebased #2818

d579072

- delete crates/manifest-bench (debug-only, never merged) - tombi format crates/ruborist/Cargo.toml - typos: unparseable → unparsable in bench/pm-bench.sh

This was referenced Apr 27, 2026

perf(pm): probe — rustls aws-lc-rs alone #2835

Closed

perf(pm): probe — #2818 minus worker-pool #2836

Closed

perf(pm): minimum-viable preload bundle (worker-pool + OnceMap + aws-lc-rs) #2838

Closed

elrrrrrrr added 3 commits April 28, 2026 10:56

Merge remote-tracking branch 'origin/next' into perf/extract-sequenti…

1fa0a60

…al-writes

chore: tombi format Cargo.toml drift after merge

e23d15f

chore(bench): fix typo unparseable -> unparsable

8b11fc7

elrrrrrrr mentioned this pull request Apr 28, 2026

perf(pm): rustls with aws-lc-rs crypto provider instead of ring #2856

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(pm): sequential writes within tarball extraction#2818

perf(pm): sequential writes within tarball extraction#2818
elrrrrrrr wants to merge 104 commits intonextfrom
perf/extract-sequential-writes

elrrrrrrr commented Apr 21, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

elrrrrrrr commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

End-to-end results (ant-design / npmjs.org / GitHub Actions ubuntu)

Journey timeline — p1_resolve preload wall

Key architectural moves

1. Worker-pool preload (commit ed7b551e) — the core win

2. HTTP stack ceiling — empirically verified (crates/manifest-bench)

3. Allocator hot-path cleanup

4. Inline parse (drop spawn_blocking)

5. Workspace discovery parallelization

6. tarball extraction + cloner — install path

Failed experiments (kept here so future attempts don't repeat)

Cloudflare per-source-IP throttle — measured

Diagnostic infrastructure (kept, valuable for future work)

Total wall account (p1_resolve, latest CI)

Remaining optimization space (out of this PR's scope)

PR split plan

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 28, 2026

📊 pm-bench-phases · ec1b50b · linux (ubuntu-latest)

npmjs.org

p0_full_cold

p1_resolve

p3_cold_install

p4_warm_link

npmmirror.com

p0_full_cold

p1_resolve

p3_cold_install

p4_warm_link

Uh oh!

github-actions Bot commented Apr 28, 2026

📊 pm-bench-phases · ec1b50b · mac (macos-latest)

npmjs.org

p0_full_cold

p1_resolve

p3_cold_install

p4_warm_link

npmmirror.com

p0_full_cold

p1_resolve

p3_cold_install

p4_warm_link

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

elrrrrrrr commented Apr 21, 2026 •

edited

Loading

1. Worker-pool preload (commit `ed7b551e`) — the core win

2. HTTP stack ceiling — empirically verified (`crates/manifest-bench`)

4. Inline parse (drop `spawn_blocking`)

📊 pm-bench-phases · `ec1b50b` · linux (`ubuntu-latest`)

📊 pm-bench-phases · `ec1b50b` · mac (`macos-latest`)