perf(pm): #2818 rebase reproduce — bench probe#2834
perf(pm): #2818 rebase reproduce — bench probe#2834
Conversation
Replace intra-package `par_iter` with a sequential loop when writing extracted tar entries to disk. Each tar entry is typically small and writes complete in microseconds, so splitting them into rayon tasks was causing heavy work-stealing (futex park/unpark) and dominating context switches on large dep graphs. Cross-package parallelism is preserved by the outer `rayon::spawn` in `extract_tarball`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Cold bench: drop `| tail -1` so hyperfine's full summary (mean, stddev, range) reaches the log. Failure detection now uses exit status instead of piping. - `BENCH_WARM_RUNS=0` skips the warm phase entirely (previously the warm function always ran and hyperfine would reject --runs 0). - Result aggregator tolerates empty or malformed export-json files (e.g. when a PM's cold install fails): the offending file is reported and skipped instead of crashing the whole summary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the sequential `for` loop over extracted tar entries with `par_chunks(WRITE_CHUNK_SIZE)` — each rayon task writes a contiguous run of 32 files sequentially. This retains multi-core IO overlap for large packages while cutting the rayon task count (and its work- stealing futex traffic) by the chunk factor versus a per-file par_iter. Cross-package parallelism is preserved by the outer rayon::spawn in extract_tarball. Local (macOS, antd-test, 3 runs avg): before par_iter: wall 17.2s sys 6.18s ivcsw 208k for-loop: wall 15.3s sys 2.36s ivcsw 61k par_chunks(32): wall 13.9s sys 5.77s ivcsw 191k chunks wins wall but loses the ctx-switch reduction relative to the pure sequential version; CI with a large dep graph (ant-design-x) is the authoritative measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Accumulate wall microseconds for download, extract, and clone across
all packages during install. Print a one-line summary alongside the
existing `added / reused / downloaded` counts, e.g.
+ 513 added · 3017 reused · 123 downloaded
download 135.8s · extract 2.3s · clone 0.4s · 19.0 MB fetched
The sums are non-exclusive across cores: dividing by wall clock
gives the effective concurrency for each phase, and the ratio
between phases shows where cold-install CPU time actually lands.
Overhead is three atomics per downloaded tarball.
Local antd-test (macOS, npmmirror, 77 packages, wall 16s): download
dominates 98% of the CPU budget, extract 1.6%, clone 0.3% — reshapes
where we should look for cold-install wins.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Needed so the per-phase timings line (`download · extract · clone · bytes`) printed at the end of each install reaches the CI log. Trade-off is noisier logs — registry INFO/WARN lines come through — but that's the price for visibility into where cold-install CPU actually lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Separates three independent measurements for utoo vs bun so each phase's improvement can be judged on its own baseline: Phase 1 · resolve utoo deps / bun install --lockfile-only Phase 3 · cold install utoo install / bun install (empty cache) Phase 4 · warm link utoo install / bun install (cache warm) Phase 3 uses the lockfile generated by phase 1, with cache reset between iterations. Phase 4 resets only node_modules so only the cache → node_modules link step is measured. Uses hyperfine --show-output so utoo's phase-timings line (\`download · extract · clone · bytes\`) reaches the CI log alongside the wall-clock summary. Triggered via workflow_dispatch with configurable project / registry / runs. Defaults to ant-design against npmjs.org, 3 runs per phase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anch merge Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous inline bash -c prepare was silently no-op on CI: utoo's run 2/3 showed '3280 reused' meaning the cache wasn't actually cleared, and bun hit InvalidNPMLockfile because utoo's package-lock.json leaked across iterations. Now each phase writes a dedicated prepare shell script per-PM that: - always drops node_modules (incl. workspace package trees), - clears exactly the lockfiles that would confuse this PM, - wipes the right cache for this phase, - prints a '[prep]' line so the CI log proves prepare ran. Also factored out seed_for_phase so lockfile / cache warmup happens once before the benchmark, not leaking into the measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…che wipe Path-based rm -rf of $HOME/.cache/nm wasn't actually emptying the cache on the CI runner — utoo runs 2/3 of phase 3 still showed '3280 reused', wall was 0.8-1.1s instead of the 10s cold-install baseline, hyperfine itself warned about caches not being filled until after run 1. Let each PM clean its own cache via its CLI so we don't rely on guessing where it stores things. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`utoo clean` / `bun pm cache rm` didn't empty the cache on the CI runner either — so now use explicit bench-local paths the rm -rf prepare can guarantee to wipe: utoo: --cache-dir=/tmp/utoo-bench-cache on every invocation bun: BUN_INSTALL_CACHE_DIR=/tmp/bun-bench-cache (env var) Gets us deterministic cold/warm state between hyperfine iterations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop into diagnostic mode to figure out why hyperfine's --prepare still leaves utoo's cache intact across iterations despite the explicit --cache-dir. Prints the generated prepare script, and logs each per-iteration invocation's before/after du -sh of both caches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The case $phase in p1) p3) p4) \-style patterns never matched against actual phase strings like "p1_resolve" / "p3_cold_install" / "p4_warm_link". Result: write_prepare produced a script containing only the common header and no phase-specific cache-wipe logic, so every run after the first hit a warm cache and timings collapsed. Same off-by-name bug in seed_for_phase: "p3:utoo" pattern never matched "p3_cold_install:utoo", skipping lockfile seeding and warm-cache priming. Switched both to "p*_*" globs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cache-size before/after logs + generated-script dumps were diagnostic scaffolding used to trace the p* vs p*_resolve pattern mismatch. With that fixed, keep the plain hyperfine --prepare invocation so CI logs are readable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…time Each hyperfine iteration now runs inside a metrics wrapper that greps /usr/bin/time -v output for RSS, voluntary/involuntary context switches, page faults, and IO read/write counts. Per-PM per-phase averages across the 3 runs are shown alongside the wall-clock table so we can see, e.g., whether utoo's resolve phase costs more syscalls than bun's, or whether its warm-link advantage comes at a memory cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expand the metrics wrapper to collect everything that's cheap on Linux:
- user / sys CPU seconds (from /usr/bin/time -v, lets us see CPU share)
- RSS, voluntary + involuntary ctx, major + minor page faults
- network RX / TX bytes (system-wide /proc/net/dev delta, excludes lo)
- disk page-in / page-out bytes (/proc/vmstat pgpg{in,out} × 4K pages)
Summary prints two tables per phase:
A. wall / ±σ / user / sys / RSS / minor faults
B. vCtx / iCtx / net RX / net TX / disk R / disk W
This makes resolve-phase vs link-phase comparison legible: e.g. network
cost should dominate download phases while disk writes dominate link.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous run attributed 525MB of writes to utoo's resolve phase when local check showed utoo only wrote ~28MB to its cache. The overshoot came from /proc/vmstat pgpgout being system-wide — it picked up ext4 journal, page-cache writeback, and other kernel activity unrelated to the benchmarked process. Switch to du-before/after on the paths that matter (cache dir, project node_modules, lockfiles) for a per-PM figure that reflects what the install actually produced. Summary now shows Δcache / Δnode_mod / Δlock per phase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Measuring disk footprint via du before+after each iteration added 2-3s of traversal to every run (wall jumped from 2.3s → 4.9s on the warm-link phase). Both snapshots happened inside hyperfine's timed region because the wrapper runs as the benchmark command. Hot path keeps only /usr/bin/time + /proc/net/dev snapshots now. After hyperfine exits, capture_footprint does one du pass per phase/PM to record the final on-disk size of the cache, node_modules, and lockfile. Summary prints absolute sizes instead of per-iteration deltas — single sample is enough to compare what each PM produced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
parseKey matched both `_${phase}_${pm}.json` (hyperfine export) and
`_${phase}_${pm}_footprint.json` (our new du snapshot), so the loop
tried to read .results[0] off the footprint and crashed the whole
summary. Add footprint suffix to the exclusion filter.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
npm registries compress manifest responses ~13× (antd abbreviated goes from 4.2MB to 309KB with gzip), but ruborist's reqwest client had neither compression feature enabled — so it never advertised `Accept-Encoding: gzip,br` and the server delivered raw JSON. Adding `gzip` + `brotli` to the feature list cuts the cold `utoo deps` manifest traffic on ant-design from ~275 MB of JSON over the wire to ~21 MB. Wall improvement is modest on high-latency links (connection setup dominates) but the bandwidth reduction is real and the CPU cost of decompression is negligible next to simd_json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reqwest's HTTP/2 client multiplexes every manifest fetch over a SINGLE TCP connection to each registry host. Bun opens ~10 parallel HTTP/2 connections and gets proportional extra bandwidth; we can't reproduce that through reqwest without custom pooling. Falling back to HTTP/1.1 with pool_max_idle_per_host(64) lets the pool open independent connections (one request per connection, 64 parallel). Local cold `utoo deps` on ant-design against registry.antgroup-inc.cn: HTTP/2 single connection: 4.9s avg HTTP/1.1 + pool of 64: 4.0s avg (-18%) bun (reference): 3.2s Full parity with bun still wants multi-connection HTTP/2 (bun's strategy), which reqwest doesn't expose without a custom client pool — future work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…etching" This reverts commit 51b5ede.
Temporary diagnostic. Tracks send_us / body_us / bytes per fetch_full_manifest call and prints p50/p90/p99/max every 500 samples so the final output reflects the tail distribution of the full run. Remove before merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reqwest multiplexes all requests over a single HTTP/2 connection by default, which causes head-of-line blocking on npm registries with high RTT: a slow tail response stalls the whole manifest fetch phase. An HTTP/1.1 pool lets concurrent manifest requests open independent TCP streams, so a single slow response no longer blocks the rest. Locally on ant-design with npmjs, this cut cold deps-resolve from ~121s (H2 single) to ~21s (H1 pool) — 5.75× faster. On low-latency registries (antgroup) the two are neutral, so there is no downside. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a per-name single-flight gate to UnifiedRegistry::resolve_full_manifest. Concurrent callers for the same package name now serialize on a per-name mutex; the first caller hits the network and populates the memory cache, the rest re-check the cache after the gate and return the cached manifest. On ant-design cold deps this eliminates ~100+ duplicate full-manifest fetches observed when many deps point at the same transitive package. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the temporary record_sample() and per-request timing diagnostics added in 14f2777 / 50a7014. The distribution data was used to identify HTTP/2 head-of-line blocking; now that H1 + pool and dedup are in, the diagnostic prints are no longer needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runs the complete cold install (utoo install / bun install) with everything wiped — lockfile, all caches, node_modules. Matches the end-to-end "freshly cloned repo" user scenario and is directly comparable to pm-bench.yml's cold install number. Reported alongside the existing p1_resolve / p3_cold_install / p4_warm_link phases; does not replace any of them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reqwest pins every new connection to the first resolved IP even when DNS returns multiple A records. On registries backed by a CDN with many IPs (antgroup returns 8, npm/Cloudflare returns 2-4) this means all concurrent pool connections land on one IP, which caps effective parallelism regardless of `pool_max_idle_per_host`. Rotate the returned address list by an atomic counter on every `resolve` call so reqwest's connect loop picks a different IP per new connection. Connections end up uniformly distributed across all A records returned by DNS. Measured on ant-design / antgroup registry (cold deps, local): - utoo-h1 (single IP): 5.38s HTTP phase, 120 conn on 1 IP - utoo-h1 + DNS rotation: 3.95s HTTP phase, 8 IPs × 8 conn each - bun baseline: 3.72s HTTP phase, 4 IPs × 64 conn each Total deps-resolve wall time now matches bun (~3.3s vs 3.3s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Local antgroup runs show DNS rotation cuts utoo's resolve HTTP phase from 5.38s to 3.95s (matching bun). On CI against npmjs however the resolve wall time is flat — possibly because: - npmjs from GH Actions returns fewer A records (Cloudflare Anycast) - low RTT already masks HOL tail Capture a single cold resolve run per PM under tcpdump so we can see the actual connection topology on CI and compare against the local antgroup evidence. Output uploaded as pm-bench-pcap artifact. Runs once after the main phased bench; reuses the already-cloned project directory and wipes lockfiles + caches itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pcap comparison against bun on both local (antgroup) and CI (npmjs) consistently shows bun opens ~256 parallel TCP connections during a cold install (4 IPs × 64 conn each), while utoo was capped at 64 — ~1/4 the effective parallelism even after the DNS round-robin fix, because reqwest treats all addresses of a host as a single pool rather than per-IP like bun. Raise the default concurrent manifest fetch count from 64 to 256 to match bun's observed network footprint. The CLI flag `--manifests-concurrency-limit` still overrides it. Pool idle cap bumped to 256 so the keep-alive pool can park every in-flight connection without churning. Risk: with DNS returning few A records the 256 connections may concentrate on one IP and trigger per-IP rate limits. Pushing to CI to measure before committing to this as the default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standalone manifest-bench cap=128 hits avg_conc=95 with the same reqwest stack; ruborist stuck at avg_conc=56 even after dropping indicatif Mutex calls (commit 2b89d0b). Same-CI-run comparison under matched Cloudflare conditions: standalone wall=2.06s vs ruborist wall=3.09s — 15-conc gap that isn't HTTP, isn't parse, and isn't progress-bar lock contention. Hypothesis: `MemoryCache::get_full_manifest` returned `FullManifest` by value, deep-cloning the per-version `HashMap<String, Arc<simd_json::OwnedValue>>` (100-500 entries, key Strings + Arc bumps per entry) on every cache hit. Each `resolve_package` call issues this read at line 226 of registry.rs as its first sync step, running on the main task that owns `FuturesUnordered` — so the deep clone serialises directly with the fill-and-drain loop and caps in-flight count. Change cache storage to `Arc<FullManifest>`: - `MemoryCache.full_manifests: RwLock<HashMap<String, Arc<FullManifest>>>` - `get_full_manifest -> Option<Arc<FullManifest>>` (atomic-bump clone) - `set_full_manifest(name, Arc<FullManifest>)` (avoid wrapping at boundary) - `FullManifestResult::Full(Arc<FullManifest>)` so OnceMap dedup also hands shared `Arc`s to coalesced waiters instead of cloning the whole struct per caller `UnifiedRegistry::resolve_full_manifest` constructs the `Arc` once on the network path (line 281, 318) and passes the same handle to both `cache.set` and `Ok(FullManifestResult::Full)`. Trait method `get_cached_full_manifest` keeps its `Option<FullManifest>` signature (one external caller is `ut view`, off the hot path) and deep-clones on demand from the `Arc`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Final hypothesis after Arc<FullManifest> didn't lift the avg_conc=56 ceiling: ruborist hot paths emit ~5-10 `tracing::debug!()` per resolved manifest (cache hits, preload events, BFS dispatch). With 2730+ manifests during cold preload that's 15-30k events. Even through tracing_appender's non_blocking channel, each event pays format/serialise CPU on the resolving thread before the channel send. The standalone manifest-bench has zero tracing calls and hits avg_conc=92 at cap=128 with the same reqwest stack. Drop file-layer default from `utoo=debug` to `utoo=info`. The hot debug events stop firing entirely (no format, no channel send). Override path preserved: `UTOO_FILE_LOG=debug` (or any RUST_LOG-style spec) re-enables verbose file capture when actually diagnosing. Console filter behaviour unchanged. Expected: avg_conc lifts from 56 toward standalone's 92, p1_resolve preload wall drops toward standalone's 2.0-2.4 s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`resolve_package`'s full-manifest cache-hit branch (registry.rs:541) was cloning the entire `versions.keys: Vec<String>` (100-500 entries per package) just to pass `&[String]` to `resolve_target_version`. Cold ant-design preload hits this branch ~1800 times (every dep beyond the first unique-(name) pop falls through here once preload has populated the full manifest). 1800 × ~200 entries = ≈360k String allocations on the resolver worker pool — global allocator contention that doesn't show up in our HTTP/parse diag because it runs on resumed-future threads, not the main task. Borrow `&full_manifest.versions.keys` directly; `Arc<FullManifest>` auto-derefs and the slice coercion satisfies the API. Zero alloc. Diagnostic context: standalone manifest-bench cap=128 hits avg_conc=92 with the same reqwest stack; ruborist held at 55-57 even after Mutex/clone hot-path eliminations elsewhere. Allocator pressure on resolver threads is a remaining structural source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`normalize_spec` unconditionally allocated `(String, String)` — including the ~99 % case where the spec has no `npm:` or `workspace:` prefix and no normalisation is needed. ~5460 String allocs per ant-design preload (2 per `resolve_package` call × 2730 unique deps), all on resolver futures driven by main task's cooperative polling. Switch return type to `(Cow<'a, str>, Cow<'a, str>)`. Common path returns `Cow::Borrowed` and pays zero allocations. `npm:` / `workspace:` prefix paths still build the substring borrow without allocating (they're already slices into the input). Callers (3 sites: traits/registry.rs, service/registry.rs, resolver/registry.rs) work unchanged thanks to Cow's `Deref<Target=str>`. Diagnostic context: standalone manifest-bench cap=128 reaches avg_conc=92 with the same reqwest stack; ruborist held at 55-58 even after Mutex / FullManifest / progress-bar / tracing / keys.clone() eliminations. Allocator pressure on the resolver worker pool — each per-future hot-path String alloc compounds across 2700+ futures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Old design: main task owned `FuturesUnordered`, polled all preload futures cooperatively, and ran every per-future continuation (post-await body, completion handler, dispatch refill) on the same single task. The deeper await chain inside `resolve_package` (cache check + `OnceMap::get_or_init` + `RetryIf` + `request.send` + `bytes` + parse `spawn_blocking`) made each future yield 5+ times, and every yield round-tripped through main — saturating it. CI ant-design preload sustained avg_conc=55-61 even after Mutex / allocator hot-path eliminations, while the standalone manifest-bench (same reqwest stack, no resolver) hit 92 at the same cap. New design: N long-lived `tokio::spawn` workers pulling from a shared lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker owns an `Arc<R>` clone and runs `resolve_package` on tokio's global executor — futures progress fully independently, no cooperative poll bottleneck. Main task only drains an `mpsc::unbounded_channel` of completions to fire receiver events + on_manifest callback. Termination: workers track `dispatched`/`completed: AtomicUsize` and park on a shared `Notify` when the queue is empty. When the last completion makes `completed == dispatched` and the queue is empty, the finishing worker raises a `shutdown` flag and wakes others; all workers drop their result_tx clones, the channel closes, and the main `recv().await` loop exits. Trait surface change: - `RegistryClient`'s default-method futures gained `+ Send` bounds (and `Self: Sync` where blanket-default fn calls into `&self`) - `MockRegistryClient` + `MockPackage` now `derive(Clone)` so tests can wrap the mock in `Arc` for the new signature - `preload_manifests` takes `registry: Arc<R>` (was `&R`); call site in `run_preload_phase` clones the borrowed registry into a fresh `Arc`. Bound at every public surface up the chain bumped to `R: RegistryClient + Clone + Send + Sync + 'static`, `R::Error: Send`. - `resolve_package` / `resolve_registry_dep` / `process_dependency` helper bounds gained `+ Sync` (their `R::Future: Send` bounds are inherited from the trait change above). Local npmmirror smoke (cap=256 via DEFAULT_CONCURRENCY): avg_conc jumped from ~55 (old) to 86.8 (new). Worker-pool delivers the parallelism standalone manifest-bench was already showing. Tests use `#[tokio::test(flavor = "multi_thread", worker_threads = 2)]` since worker-pool needs spawn-able runtime; ruborist's dev-dependencies on `tokio` add the `rt-multi-thread` feature. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worker-pool preload (ruborist ed7b551) sustains avg_conc=66 at cap=96 on CI vs the prior FuturesUnordered's 58 — and same-run standalone manifest-bench reached 93/2.14s at cap=128 with the identical reqwest stack. With workers running independently on tokio's global executor (no cooperative-poll serialisation through one task), more cap slots translate directly to more parallel TCP requests in flight. The Cloudflare per-req throttle curve we measured under the old architecture (per-req wall doubled at cap 128→256) was conflated with the FuturesUnordered ceiling. With workers decoupled the curve needs re-measurement; cap=128 is the cheapest experiment that brings ruborist to standalone parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worker-pool sweep on CI ant-design p1_resolve:
cap=96: wall=2.23s avg_conc=66 per-req=53ms
cap=128: wall=2.15s avg_conc=84 per-req=66ms
→ per-req drops with cap (refutes the FuturesUnordered-era
"server throttle past 70 conc" reading; that was main-task
saturation). Same-run standalone manifest-bench cap=192 hit
130 conc / 2.10s, so cap=160 should bring another 0.1-0.2s
out of preload before the curve flattens.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worker-pool preload at cap=160 surfaced parse blocking-pool queue saturation: parse diag showed `queue p95=200ms sum=70-89s` over 2730 manifests — ~26ms average queue wait per parse. That accounted for the entire ruborist-vs-standalone per-req gap (55ms vs 28ms under identical Cloudflare conditions). Cause: blocking pool is sized to `worker_threads` (= num_cpus = 4 on CI). Worker-pool preload sustains 80+ concurrent fetches; each spawn_blocking parse goes into a 4-slot queue and waits behind others. Original spawn_blocking offload was justified under FuturesUnordered + main-task polling (would have stalled the single poll loop), but worker-pool runs each future on tokio's global executor — a brief 1-5ms sync CPU burst on a worker is cheaper than spawn_blocking dispatch + queue wait. Inline simd_json parse on the resolving worker. Each worker thread parses its own response immediately after `bytes().await`; no extra hop. Worker-pool's independent task scheduling means one stalled worker doesn't starve the others — we just lose ~5ms of one worker's cycle, which is far less than the dispatch-and-queue round-trip we were paying. Both fetch sites updated (`fetch_full_manifest` for npmjs full manifest path, `fetch_version_manifest` for semver registries like npmmirror). Expected: ruborist preload per-req drops from 55-66ms → ~30-40ms (matching standalone), wall toward ~1.7s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cap=160 + inline parse pushed avg_conc to 119 — past the per-source Cloudflare throttle threshold. Per-req inflated 55 ms → 93 ms; net wall flat at 2.14s. cap=128 + inline parse: avg_conc target ~85-95 (matching standalone manifest-bench cap=128 = 70-90 / 1.6-2.0s under similar Cloudflare conditions). Inline parse alone (no spawn_blocking queue) plus sane concurrency should land preload at ~1.7-1.8s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`find_workspaces_from_pkg` was reading every workspace's package.json
sequentially in a `for path in matched_paths { read_package_json(...).await }`
loop. Ant-design has ~200 workspace packages; at ~1 ms per single-file
async FS round-trip on CI runners that's ~150-200 ms of serial I/O —
the largest unmeasured chunk between preload completion and lockfile
write (hyperfine total p1 minus instrumented sub-phases).
Collect workspace paths from every glob pattern first, then dispatch
all `read_package_json` calls into a `FuturesUnordered` for parallel
execution. Each read is small (typical workspace package.json < 4 KB)
so completion order is irrelevant — just push results as they land.
Expected: ant-design p1_resolve hyperfine wall drops by 100-150 ms
(toward ~2.40s vs current 2.58s).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
p1_resolve hyperfine still has ~80 ms of unmeasured wall after parallel workspace reads (commit bf14995). Suspected: 2-3 MB package-lock.json serialize + atomic-write-rename. Add per-step timing log so we know which knob to turn (compact-json, to_writer streaming, async fs::rename quirks, etc). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add timer covering find_root_path → read root package.json → engines inject → graph init → root edges → workspace discovery → workspace nodes/edges. This is the chunk between hyperfine start and build_deps entry — currently uninstrumented and the residual ~85ms gap source after lockfile timing showed save is only 11ms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Linter-applied formatting cleanup, no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Original cap was sized for the FuturesUnordered preload that dispatched 128 simd_json parses through `spawn_blocking` in a burst — letting the default 512 cap run gave bimodal wall (M2: 2.7s fast / 6.9s thrash). Capping at `worker_threads` eliminated the thrash peak. After commit f3f616d (inline parse) preload no longer uses the blocking pool. The dominant consumer is now `cloner.rs` during the install phase: every file's hardlink / clonefile / copy goes through `spawn_blocking`, ~50000 short syscalls per ant-design install. Each syscall is near-instant, so the cap rarely backpressures, but cap=4 on CI does limit how fast cloner can fire syscalls in parallel. Raise cap to `max(worker_threads * 4, 32)`: enough headroom for cloner to keep multiple syscalls in flight, low enough that the historical thrash regime (hundreds of churning threads) stays avoided. Pool is per-runtime; idle threads die after 10s. Expected: small p3_cold_install improvement (current utoo 5.74s vs bun 7.71s); preload phase unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… 32)" This reverts commit 132ef36.
A/B test: replace `entries.par_chunks(WRITE_CHUNK_SIZE).try_for_each` with a plain sequential `for entry in &entries` loop. Each tarball still runs in its own outer `rayon::spawn` task (cross-package parallelism preserved); only the within-tarball write fan-out is removed. Goal: measure whether rayon's intra-package parallelism still earns its keep after the worker-pool preload rewrite. Cross-package parallelism alone may already saturate IO; if so, removing the inner par_chunks cuts work-stealing futex traffic + thread sync overhead with zero throughput cost. If p3_cold_install regresses ≥0.3s → intra-package writes are genuinely IO-bound across cores, restore par_chunks. If p3 unchanged or improves → simpler sequential code wins. This is a test commit. Will be reverted if regression measured. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…act" This reverts commit c7c847d.
`clone_dir` (Linux hardlink/copy path) was using `tokio::task::spawn_blocking` per package — at default cap=4 on CI, only 4 packages cloned at once, each running all file hardlinks sequentially internally. ~3500 packages × N files per install all funneled through that bounded pool. Switch to the same pattern `extractor.rs` already uses: - `rayon::spawn` per package replaces `spawn_blocking` (cross-package parallelism via rayon work-stealing — global pool, not capped at worker_threads) - `par_chunks(CLONE_CHUNK_SIZE)` for the inner hardlink/copy loop (intra-package fan-out across cores; same chunk size = 32 as extractor) Trade-offs: - EXDEV `force_copy` latch is now per-chunk instead of global per clone — chunks each rediscover cross-device errors and fall back locally. A few extra hardlink-then-copy round-trips at chunk boundaries, acceptable for the rare cross-device install. - Pool unification: tokio blocking pool now mostly idle (just git + http tarball + a few one-shot commands), rayon handles all the high-volume IO. Cuts the 3-pool fragmentation observed earlier. Tested: - Iter 1 of this loop (cap bump from N to max(N*4, 32)): no p3 win, p4 regressed → cap raise alone wasn't the answer. - Iter 2 (drop intra-package par_chunks in extractor): p3 +3.67s, σ exploded 0.04 → 2.85s → intra-package fan-out is essential. - This commit applies the same fan-out to clone_dir for the same reason. macOS `clonefile` path (target_os = "macos") unchanged — clonefile is a single syscall per file, different perf profile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This reverts commit 9229e16.
- delete crates/manifest-bench (debug-only, never merged) - tombi format crates/ruborist/Cargo.toml - typos: unparseable → unparsable in bench/pm-bench.sh
There was a problem hiding this comment.
Code Review
This pull request implements a comprehensive suite of performance optimizations for the package manager's resolver and installation pipeline. Key enhancements include a transition to a worker-pool architecture for manifest preloading, lazy JSON parsing with memoization, and the adoption of aws-lc-rs for accelerated TLS handshakes. Additionally, the changes introduce round-robin DNS rotation to distribute network load, parallelized workspace discovery, and significant allocation reductions in critical paths. Review feedback suggests reverting the Rust edition to a stable version in the new benchmark utility, optimizing string splitting operations, and improving diagnostics for TLS certificate loading.
I am having trouble creating individual review comments. Click here to see my feedback.
crates/manifest-bench/Cargo.toml (4)
The 2024 Rust edition has not been released yet. Using it will cause a compilation error. Please use a stable edition, such as 2021.
edition = "2021"
crates/manifest-bench/src/main.rs (158-160)
This implementation can be made more efficient and concise by using rsplit and next to get the last segment without creating an intermediate Vec.
key.rsplit("node_modules/").next().unwrap_or("").to_string()
crates/manifest-bench/src/main.rs (269)
The load_native_certs call can produce errors for certificates it fails to parse. These are available in native.errors. For better diagnostics, you should consider logging these errors, similar to the implementation in crates/ruborist/src/service/http.rs. This can help debug TLS issues on different environments.
if !native.errors.is_empty() {
eprintln!(
"warning: rustls-native-certs reported {} non-fatal load issues",
native.errors.len()
);
}
📊 pm-bench-phases ·
|
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 8.96s | 0.12s | 10.02s | 10.12s | 673M | 323.5K |
| utoo-npm | 10.20s | 0.22s | 11.28s | 13.04s | 1.37G | 165.4K |
| utoo | 9.81s | 1.50s | 11.12s | 12.29s | 2.33G | 264.8K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 17.0K | 18.0K | 1.16G | 7M | 1.83G | 1.72G | 1M |
| utoo-npm | 164.1K | 156.0K | 1.14G | 4M | 1.68G | 1.68G | 2M |
| utoo | 89.4K | 48.9K | 1.13G | 6M | 1.68G | 1.68G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 2.28s | 0.05s | 3.88s | 1.07s | 483M | 177.4K |
| utoo-npm | 5.45s | 0.04s | 5.96s | 1.11s | 433M | 74.7K |
| utoo | 2.62s | 0.07s | 5.60s | 2.06s | 1.45G | 194.3K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 11.0K | 3.7K | 203M | 3M | 106M | - | 1M |
| utoo-npm | 66.1K | 2.7K | 204M | 2M | 9M | 5M | 2M |
| utoo | 18.0K | 15.0K | 197M | 3M | 7M | 5M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 6.92s | 0.85s | 6.09s | 9.96s | 616M | 199.8K |
| utoo-npm | 7.96s | 1.55s | 5.46s | 11.40s | 812M | 101.9K |
| utoo | 6.63s | 1.29s | 5.40s | 10.50s | 828M | 100.8K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 6.1K | 7.6K | 993M | 4M | 1.73G | 1.73G | 1M |
| utoo-npm | 124.7K | 88.7K | 963M | 3M | 1.67G | 1.67G | 2M |
| utoo | 77.2K | 42.6K | 964M | 3M | 1.67G | 1.67G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 3.19s | 0.00s | 0.19s | 2.33s | 138M | 32.5K |
| utoo-npm | 2.68s | 0.56s | 0.55s | 3.84s | 84M | 20.0K |
| utoo | 2.11s | 0.07s | 0.42s | 3.33s | 63M | 14.1K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 291 | 18 | 7M | 31K | 1.88G | 1.70G | 1M |
| utoo-npm | 45.7K | 18.6K | 425K | 12K | 1.67G | 1.67G | 2M |
| utoo | 15.9K | 8.8K | 427K | 15K | 1.68G | 1.67G | 2M |
npmmirror.com
p0_full_cold
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 52.93s | 30.91s | 8.99s | 9.43s | 539M | 380.9K |
| utoo-npm | 96.34s | 122.65s | 8.13s | 14.37s | 666M | 106.9K |
| utoo | 16.29s | 3.91s | 7.17s | 12.11s | 737M | 130.2K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 74.9K | 3.3K | 1.12G | 11M | 1.84G | 1.72G | 2M |
| utoo-npm | 257.2K | 73.4K | 990M | 9M | 1.67G | 1.68G | 2M |
| utoo | 154.7K | 64.5K | 998M | 9M | 1.67G | 1.68G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 1.55s | 0.08s | 3.92s | 1.07s | 590M | 183.9K |
| utoo-npm | 4.38s | 0.30s | 1.94s | 0.57s | 75M | 15.9K |
| utoo | 5.98s | 8.61s | 1.20s | 0.38s | 85M | 18.0K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 5.2K | 5.7K | 151M | 3M | 106M | - | 2M |
| utoo-npm | 45.7K | 854 | 14M | 2M | - | 4M | 2M |
| utoo | 18.0K | 691 | 17M | 3M | - | 4M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 19.03s | 1.12s | 5.59s | 8.82s | 234M | 87.9K |
| utoo-npm | 52.35s | 37.91s | 6.09s | 13.00s | 547M | 99.5K |
| utoo | 47.76s | 42.65s | 5.86s | 11.63s | 659M | 94.3K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 53.8K | 2.9K | 985M | 8M | 1.70G | 1.70G | 2M |
| utoo-npm | 199.3K | 87.5K | 965M | 7M | 1.67G | 1.67G | 2M |
| utoo | 151.3K | 42.3K | 977M | 7M | 1.67G | 1.67G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 3.04s | 0.10s | 0.17s | 2.24s | 135M | 31.0K |
| utoo-npm | 2.29s | 0.03s | 0.57s | 3.80s | 85M | 19.9K |
| utoo | 2.07s | 0.12s | 0.42s | 3.38s | 65M | 14.6K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 509 | 27 | 3M | 48K | 1.82G | 1.72G | 2M |
| utoo-npm | 47.3K | 20.1K | 430K | 14K | 1.67G | 1.67G | 2M |
| utoo | 16.4K | 9.0K | 431K | 38K | 1.67G | 1.67G | 2M |
📊 pm-bench-phases ·
|
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 13.80s | 0.81s | 5.38s | 14.06s | 782M | 50.5K |
| utoo-npm | 15.93s | 0.92s | 8.03s | 16.25s | 971M | 100.6K |
| utoo | 14.36s | 1.73s | 7.85s | 16.08s | 1.85G | 162.4K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 16.1K | 137.4K | - | - | 1.76G | 1.91G | 1M |
| utoo-npm | 14.1K | 373.8K | - | - | 1.63G | 1.83G | 2M |
| utoo | 4.6K | 220.0K | - | - | 1.63G | 1.84G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 2.39s | 0.21s | 2.48s | 1.08s | 498M | 32.3K |
| utoo-npm | 6.48s | 2.33s | 4.58s | 2.65s | 559M | 37.8K |
| utoo | 3.78s | 0.97s | 4.08s | 2.09s | 1.65G | 109.1K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 8 | 22.9K | - | - | 110M | - | 1M |
| utoo-npm | 14 | 74.5K | - | - | 28M | 5M | 2M |
| utoo | 26 | 41.5K | - | - | 27M | 5M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 16.90s | 4.06s | 3.40s | 15.68s | 543M | 35.3K |
| utoo-npm | 15.74s | 3.91s | 4.08s | 16.72s | 767M | 80.2K |
| utoo | 15.84s | 0.63s | 4.39s | 22.11s | 621M | 72.0K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 4.8K | 127.6K | - | - | 1.70G | 1.94G | 1M |
| utoo-npm | 1.5K | 234.5K | - | - | 1.60G | 1.83G | 2M |
| utoo | 1.4K | 165.6K | - | - | 1.60G | 1.83G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 6.93s | 0.83s | 0.15s | 2.87s | 49M | 3.7K |
| utoo-npm | 6.45s | 1.09s | 0.86s | 4.28s | 94M | 6.9K |
| utoo | 4.92s | 0.41s | 0.44s | 3.02s | 92M | 6.5K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 17.8K | 1.1K | - | - | 1.86G | 1.91G | 1M |
| utoo-npm | 13.2K | 78.4K | - | - | 1.61G | 1.86G | 2M |
| utoo | 13.9K | 21.0K | - | - | 1.63G | 1.86G | 2M |
npmmirror.com
p0_full_cold
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 44.34s | 20.14s | 7.29s | 21.21s | 556M | 35.9K |
| utoo-npm | 34.00s | 1.15s | 7.48s | 21.12s | 717M | 75.7K |
| utoo | 57.13s | 35.45s | 6.73s | 19.59s | 714M | 79.2K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 14.1K | 162.3K | - | - | 1.79G | 1.89G | 2M |
| utoo-npm | 4.4K | 420.2K | - | - | 1.61G | 1.88G | 2M |
| utoo | 1.0K | 310.6K | - | - | 1.61G | 1.87G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 3.25s | 1.38s | 2.88s | 1.45s | 563M | 36.6K |
| utoo-npm | 14.42s | 15.76s | 2.00s | 1.18s | 77M | 5.6K |
| utoo | 2.23s | 0.67s | 1.31s | 0.49s | 83M | 6.0K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 13 | 21.5K | - | - | 111M | - | 2M |
| utoo-npm | 5 | 43.6K | - | - | - | 4M | 2M |
| utoo | 17 | 14.7K | - | - | - | 4M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 21.23s | 2.08s | 3.33s | 14.07s | 286M | 18.9K |
| utoo-npm | 59.91s | 39.18s | 5.67s | 20.47s | 779M | 80.1K |
| utoo | 32.47s | 4.90s | 5.69s | 22.57s | 650M | 78.2K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 2.1K | 155.9K | - | - | 1.65G | 1.92G | 2M |
| utoo-npm | 1.9K | 349.6K | - | - | 1.60G | 1.87G | 2M |
| utoo | 1.8K | 225.9K | - | - | 1.60G | 1.87G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 3.86s | 0.41s | 0.08s | 1.78s | 44M | 3.4K |
| utoo-npm | 3.40s | 0.40s | 0.53s | 2.70s | 92M | 6.9K |
| utoo | 3.09s | 0.73s | 0.33s | 2.19s | 77M | 5.6K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 13.0K | 933 | - | - | 1.78G | 1.90G | 2M |
| utoo-npm | 12.4K | 74.2K | - | - | 1.61G | 1.84G | 2M |
| utoo | 13.2K | 20.9K | - | - | 1.61G | 1.84G | 2M |
Summary
Rebase of #2818 (`perf/extract-sequential-writes`) onto current `origin/next` to test whether the original umbrella perf gains reproduce on a fresh `next` baseline.
101 commits stacked, includes:
Why this PR exists
#2830 (spawn_local-only): nop on p1_resolve
#2832 (mt-pool minimal): mean drop -13% but σ=1.66s on utoo (huge)
Neither isolated PR cleanly reproduced #2818's claimed -30%. We need to verify the original umbrella commit set actually ships that gain on current `next` before continuing isolation work — otherwise we're chasing a number that may have been a one-time CI-noise artifact.
Plan
🤖 Generated with Claude Code