Skip to content

perf(pm): probe — #2818 minus worker-pool#2836

Closed
elrrrrrrr wants to merge 103 commits intonextfrom
perf/strip-worker-pool
Closed

perf(pm): probe — #2818 minus worker-pool#2836
elrrrrrrr wants to merge 103 commits intonextfrom
perf/strip-worker-pool

Conversation

@elrrrrrrr
Copy link
Copy Markdown
Contributor

Summary

Take #2818's full bundle (rebased in #2834, gives p1_resolve -52%) and revert ONLY the preload worker-pool commit (`ce574d58 perf(ruborist): preload worker-pool replaces FuturesUnordered`).

If perf still -52% → worker-pool isn't part of the driver mix, the gain is purely from network/cache layer (aws-lc-rs, OnceMap, DNS, etc.)
If perf drops back toward baseline → worker-pool IS part of the synergistic driver

Context

probe p1_resolve takeaway
#2832 mt-pool only 4.59s ±1.66 small mean drop, huge σ
#2835 aws-lc-rs only 6.13s ±1.00 no improvement
#2834 all 101 commits 2.62s ±0.07 -52%, very tight σ
this = #2834 − worker-pool TBD

Test plan

  • cargo build pass
  • CI bench-phases-linux

🤖 Generated with Claude Code

elrrrrrrr and others added 30 commits April 27, 2026 18:02
Replace intra-package `par_iter` with a sequential loop when writing
extracted tar entries to disk. Each tar entry is typically small and
writes complete in microseconds, so splitting them into rayon tasks
was causing heavy work-stealing (futex park/unpark) and dominating
context switches on large dep graphs. Cross-package parallelism is
preserved by the outer `rayon::spawn` in `extract_tarball`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Cold bench: drop `| tail -1` so hyperfine's full summary (mean,
  stddev, range) reaches the log. Failure detection now uses exit
  status instead of piping.
- `BENCH_WARM_RUNS=0` skips the warm phase entirely (previously the
  warm function always ran and hyperfine would reject --runs 0).
- Result aggregator tolerates empty or malformed export-json files
  (e.g. when a PM's cold install fails): the offending file is
  reported and skipped instead of crashing the whole summary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the sequential `for` loop over extracted tar entries with
`par_chunks(WRITE_CHUNK_SIZE)` — each rayon task writes a contiguous
run of 32 files sequentially. This retains multi-core IO overlap for
large packages while cutting the rayon task count (and its work-
stealing futex traffic) by the chunk factor versus a per-file
par_iter. Cross-package parallelism is preserved by the outer
rayon::spawn in extract_tarball.

Local (macOS, antd-test, 3 runs avg):
  before par_iter: wall 17.2s  sys 6.18s  ivcsw 208k
  for-loop:        wall 15.3s  sys 2.36s  ivcsw  61k
  par_chunks(32):  wall 13.9s  sys 5.77s  ivcsw 191k

chunks wins wall but loses the ctx-switch reduction relative to the
pure sequential version; CI with a large dep graph (ant-design-x)
is the authoritative measurement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Accumulate wall microseconds for download, extract, and clone across
all packages during install. Print a one-line summary alongside the
existing `added / reused / downloaded` counts, e.g.

  + 513 added · 3017 reused · 123 downloaded
    download 135.8s · extract 2.3s · clone 0.4s · 19.0 MB fetched

The sums are non-exclusive across cores: dividing by wall clock
gives the effective concurrency for each phase, and the ratio
between phases shows where cold-install CPU time actually lands.
Overhead is three atomics per downloaded tarball.

Local antd-test (macOS, npmmirror, 77 packages, wall 16s): download
dominates 98% of the CPU budget, extract 1.6%, clone 0.3% — reshapes
where we should look for cold-install wins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Needed so the per-phase timings line (`download · extract · clone · bytes`)
printed at the end of each install reaches the CI log. Trade-off is noisier
logs — registry INFO/WARN lines come through — but that's the price for
visibility into where cold-install CPU actually lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Separates three independent measurements for utoo vs bun so each
phase's improvement can be judged on its own baseline:

  Phase 1 · resolve     utoo deps          / bun install --lockfile-only
  Phase 3 · cold install utoo install      / bun install   (empty cache)
  Phase 4 · warm link    utoo install      / bun install   (cache warm)

Phase 3 uses the lockfile generated by phase 1, with cache reset
between iterations. Phase 4 resets only node_modules so only the
cache → node_modules link step is measured.

Uses hyperfine --show-output so utoo's phase-timings line
(\`download · extract · clone · bytes\`) reaches the CI log alongside
the wall-clock summary.

Triggered via workflow_dispatch with configurable project / registry
/ runs. Defaults to ant-design against npmjs.org, 3 runs per phase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anch merge

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous inline bash -c prepare was silently no-op on CI: utoo's run 2/3
showed '3280 reused' meaning the cache wasn't actually cleared, and bun hit
InvalidNPMLockfile because utoo's package-lock.json leaked across
iterations.

Now each phase writes a dedicated prepare shell script per-PM that:
- always drops node_modules (incl. workspace package trees),
- clears exactly the lockfiles that would confuse this PM,
- wipes the right cache for this phase,
- prints a '[prep]' line so the CI log proves prepare ran.

Also factored out seed_for_phase so lockfile / cache warmup happens once
before the benchmark, not leaking into the measurement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…che wipe

Path-based rm -rf of $HOME/.cache/nm wasn't actually emptying the cache
on the CI runner — utoo runs 2/3 of phase 3 still showed '3280 reused',
wall was 0.8-1.1s instead of the 10s cold-install baseline, hyperfine
itself warned about caches not being filled until after run 1.

Let each PM clean its own cache via its CLI so we don't rely on
guessing where it stores things.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`utoo clean` / `bun pm cache rm` didn't empty the cache on the CI
runner either — so now use explicit bench-local paths the rm -rf
prepare can guarantee to wipe:

  utoo: --cache-dir=/tmp/utoo-bench-cache on every invocation
  bun:  BUN_INSTALL_CACHE_DIR=/tmp/bun-bench-cache (env var)

Gets us deterministic cold/warm state between hyperfine iterations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop into diagnostic mode to figure out why hyperfine's --prepare
still leaves utoo's cache intact across iterations despite the
explicit --cache-dir. Prints the generated prepare script, and logs
each per-iteration invocation's before/after du -sh of both caches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The case $phase in p1) p3) p4) \-style patterns never matched
against actual phase strings like "p1_resolve" / "p3_cold_install" /
"p4_warm_link". Result: write_prepare produced a script containing
only the common header and no phase-specific cache-wipe logic, so
every run after the first hit a warm cache and timings collapsed.

Same off-by-name bug in seed_for_phase: "p3:utoo" pattern never
matched "p3_cold_install:utoo", skipping lockfile seeding and
warm-cache priming. Switched both to "p*_*" globs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cache-size before/after logs + generated-script dumps were
diagnostic scaffolding used to trace the p* vs p*_resolve pattern
mismatch. With that fixed, keep the plain hyperfine --prepare
invocation so CI logs are readable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…time

Each hyperfine iteration now runs inside a metrics wrapper that greps
/usr/bin/time -v output for RSS, voluntary/involuntary context switches,
page faults, and IO read/write counts. Per-PM per-phase averages across
the 3 runs are shown alongside the wall-clock table so we can see, e.g.,
whether utoo's resolve phase costs more syscalls than bun's, or whether
its warm-link advantage comes at a memory cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expand the metrics wrapper to collect everything that's cheap on Linux:

- user / sys CPU seconds (from /usr/bin/time -v, lets us see CPU share)
- RSS, voluntary + involuntary ctx, major + minor page faults
- network RX / TX bytes (system-wide /proc/net/dev delta, excludes lo)
- disk page-in / page-out bytes (/proc/vmstat pgpg{in,out} × 4K pages)

Summary prints two tables per phase:
  A. wall / ±σ / user / sys / RSS / minor faults
  B. vCtx / iCtx / net RX / net TX / disk R / disk W

This makes resolve-phase vs link-phase comparison legible: e.g. network
cost should dominate download phases while disk writes dominate link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous run attributed 525MB of writes to utoo's resolve phase when
local check showed utoo only wrote ~28MB to its cache. The overshoot
came from /proc/vmstat pgpgout being system-wide — it picked up ext4
journal, page-cache writeback, and other kernel activity unrelated to
the benchmarked process.

Switch to du-before/after on the paths that matter (cache dir, project
node_modules, lockfiles) for a per-PM figure that reflects what the
install actually produced. Summary now shows Δcache / Δnode_mod / Δlock
per phase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Measuring disk footprint via du before+after each iteration added
2-3s of traversal to every run (wall jumped from 2.3s → 4.9s on the
warm-link phase). Both snapshots happened inside hyperfine's timed
region because the wrapper runs as the benchmark command.

Hot path keeps only /usr/bin/time + /proc/net/dev snapshots now. After
hyperfine exits, capture_footprint does one du pass per phase/PM to
record the final on-disk size of the cache, node_modules, and
lockfile. Summary prints absolute sizes instead of per-iteration
deltas — single sample is enough to compare what each PM produced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
parseKey matched both `_${phase}_${pm}.json` (hyperfine export) and
`_${phase}_${pm}_footprint.json` (our new du snapshot), so the loop
tried to read .results[0] off the footprint and crashed the whole
summary. Add footprint suffix to the exclusion filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
npm registries compress manifest responses ~13× (antd abbreviated goes
from 4.2MB to 309KB with gzip), but ruborist's reqwest client had
neither compression feature enabled — so it never advertised
`Accept-Encoding: gzip,br` and the server delivered raw JSON.

Adding `gzip` + `brotli` to the feature list cuts the cold
`utoo deps` manifest traffic on ant-design from ~275 MB of JSON
over the wire to ~21 MB. Wall improvement is modest on high-latency
links (connection setup dominates) but the bandwidth reduction is
real and the CPU cost of decompression is negligible next to simd_json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reqwest's HTTP/2 client multiplexes every manifest fetch over a SINGLE
TCP connection to each registry host. Bun opens ~10 parallel HTTP/2
connections and gets proportional extra bandwidth; we can't reproduce
that through reqwest without custom pooling.

Falling back to HTTP/1.1 with pool_max_idle_per_host(64) lets the pool
open independent connections (one request per connection, 64 parallel).
Local cold `utoo deps` on ant-design against registry.antgroup-inc.cn:

  HTTP/2 single connection: 4.9s avg
  HTTP/1.1 + pool of 64:    4.0s avg  (-18%)
  bun (reference):          3.2s

Full parity with bun still wants multi-connection HTTP/2 (bun's
strategy), which reqwest doesn't expose without a custom client pool —
future work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Temporary diagnostic. Tracks send_us / body_us / bytes per
fetch_full_manifest call and prints p50/p90/p99/max every 500 samples
so the final output reflects the tail distribution of the full run.

Remove before merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reqwest multiplexes all requests over a single HTTP/2 connection by
default, which causes head-of-line blocking on npm registries with
high RTT: a slow tail response stalls the whole manifest fetch phase.

An HTTP/1.1 pool lets concurrent manifest requests open independent
TCP streams, so a single slow response no longer blocks the rest.
Locally on ant-design with npmjs, this cut cold deps-resolve from
~121s (H2 single) to ~21s (H1 pool) — 5.75× faster. On low-latency
registries (antgroup) the two are neutral, so there is no downside.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a per-name single-flight gate to UnifiedRegistry::resolve_full_manifest.
Concurrent callers for the same package name now serialize on a per-name
mutex; the first caller hits the network and populates the memory cache,
the rest re-check the cache after the gate and return the cached manifest.

On ant-design cold deps this eliminates ~100+ duplicate full-manifest
fetches observed when many deps point at the same transitive package.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the temporary record_sample() and per-request timing diagnostics
added in 14f2777 / 50a7014. The distribution data was used to identify
HTTP/2 head-of-line blocking; now that H1 + pool and dedup are in, the
diagnostic prints are no longer needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runs the complete cold install (utoo install / bun install) with
everything wiped — lockfile, all caches, node_modules. Matches the
end-to-end "freshly cloned repo" user scenario and is directly
comparable to pm-bench.yml's cold install number.

Reported alongside the existing p1_resolve / p3_cold_install / p4_warm_link
phases; does not replace any of them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reqwest pins every new connection to the first resolved IP even when
DNS returns multiple A records. On registries backed by a CDN with
many IPs (antgroup returns 8, npm/Cloudflare returns 2-4) this means
all concurrent pool connections land on one IP, which caps effective
parallelism regardless of `pool_max_idle_per_host`.

Rotate the returned address list by an atomic counter on every
`resolve` call so reqwest's connect loop picks a different IP per
new connection. Connections end up uniformly distributed across all
A records returned by DNS.

Measured on ant-design / antgroup registry (cold deps, local):
- utoo-h1 (single IP): 5.38s HTTP phase, 120 conn on 1 IP
- utoo-h1 + DNS rotation: 3.95s HTTP phase, 8 IPs × 8 conn each
- bun baseline: 3.72s HTTP phase, 4 IPs × 64 conn each

Total deps-resolve wall time now matches bun (~3.3s vs 3.3s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Local antgroup runs show DNS rotation cuts utoo's resolve HTTP phase
from 5.38s to 3.95s (matching bun). On CI against npmjs however the
resolve wall time is flat — possibly because:
  - npmjs from GH Actions returns fewer A records (Cloudflare Anycast)
  - low RTT already masks HOL tail

Capture a single cold resolve run per PM under tcpdump so we can see
the actual connection topology on CI and compare against the local
antgroup evidence. Output uploaded as pm-bench-pcap artifact.

Runs once after the main phased bench; reuses the already-cloned
project directory and wipes lockfiles + caches itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pcap comparison against bun on both local (antgroup) and CI (npmjs)
consistently shows bun opens ~256 parallel TCP connections during
a cold install (4 IPs × 64 conn each), while utoo was capped at
64 — ~1/4 the effective parallelism even after the DNS round-robin
fix, because reqwest treats all addresses of a host as a single pool
rather than per-IP like bun.

Raise the default concurrent manifest fetch count from 64 to 256 to
match bun's observed network footprint. The CLI flag
`--manifests-concurrency-limit` still overrides it. Pool idle cap
bumped to 256 so the keep-alive pool can park every in-flight
connection without churning.

Risk: with DNS returning few A records the 256 connections may
concentrate on one IP and trigger per-IP rate limits. Pushing to
CI to measure before committing to this as the default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
elrrrrrrr and others added 22 commits April 27, 2026 18:03
Standalone manifest-bench cap=128 hits avg_conc=95 with the same
reqwest stack; ruborist stalls at avg_conc=56. Per-completion
indicatif Mutex contention is the remaining gap source after
dropping log_progress(format!()) (commit f455a0b) and reverting
the over-aggressive dedup-by-name.

Each PreloadQueued / PreloadProgress event calls
PROGRESS_BAR.inc[_length](1), each grabbing indicatif's internal
ProgressBar Mutex. With 4571 dispatches + 4571 completions the
main task pays ~9000 lock acquisitions during a 3-4 s phase, all
contending with the steady_tick draw thread (100 ms). That cap on
main loop throughput is what holds avg_conc at 56 vs the
standalone reqwest-only sweep's 95.

Drop the per-event bar updates entirely during preload. Phase
spinner still animates via steady_tick so the user sees activity;
PreloadComplete prints the final ok/fail summary. The numeric
during-preload counter is gone but the phase is short (3-4 s) and
the user sees the finished totals.

Expected: ruborist p1_resolve preload wall drops toward standalone
manifest-bench's 2.4 s, closing most of the remaining gap to bun.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standalone manifest-bench cap=128 hits avg_conc=95 with the same
reqwest stack; ruborist stuck at avg_conc=56 even after dropping
indicatif Mutex calls (commit 2b89d0b). Same-CI-run comparison
under matched Cloudflare conditions: standalone wall=2.06s vs
ruborist wall=3.09s — 15-conc gap that isn't HTTP, isn't parse, and
isn't progress-bar lock contention.

Hypothesis: `MemoryCache::get_full_manifest` returned `FullManifest`
by value, deep-cloning the per-version `HashMap<String,
Arc<simd_json::OwnedValue>>` (100-500 entries, key Strings + Arc
bumps per entry) on every cache hit. Each `resolve_package` call
issues this read at line 226 of registry.rs as its first sync step,
running on the main task that owns `FuturesUnordered` — so the
deep clone serialises directly with the fill-and-drain loop and
caps in-flight count.

Change cache storage to `Arc<FullManifest>`:
- `MemoryCache.full_manifests: RwLock<HashMap<String, Arc<FullManifest>>>`
- `get_full_manifest -> Option<Arc<FullManifest>>` (atomic-bump clone)
- `set_full_manifest(name, Arc<FullManifest>)` (avoid wrapping at boundary)
- `FullManifestResult::Full(Arc<FullManifest>)` so OnceMap dedup also
  hands shared `Arc`s to coalesced waiters instead of cloning the
  whole struct per caller

`UnifiedRegistry::resolve_full_manifest` constructs the `Arc` once
on the network path (line 281, 318) and passes the same handle to
both `cache.set` and `Ok(FullManifestResult::Full)`. Trait method
`get_cached_full_manifest` keeps its `Option<FullManifest>`
signature (one external caller is `ut view`, off the hot path) and
deep-clones on demand from the `Arc`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Final hypothesis after Arc<FullManifest> didn't lift the avg_conc=56
ceiling: ruborist hot paths emit ~5-10 `tracing::debug!()` per
resolved manifest (cache hits, preload events, BFS dispatch). With
2730+ manifests during cold preload that's 15-30k events. Even
through tracing_appender's non_blocking channel, each event pays
format/serialise CPU on the resolving thread before the channel
send. The standalone manifest-bench has zero tracing calls and
hits avg_conc=92 at cap=128 with the same reqwest stack.

Drop file-layer default from `utoo=debug` to `utoo=info`. The hot
debug events stop firing entirely (no format, no channel send).

Override path preserved: `UTOO_FILE_LOG=debug` (or any
RUST_LOG-style spec) re-enables verbose file capture when actually
diagnosing. Console filter behaviour unchanged.

Expected: avg_conc lifts from 56 toward standalone's 92, p1_resolve
preload wall drops toward standalone's 2.0-2.4 s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`resolve_package`'s full-manifest cache-hit branch (registry.rs:541)
was cloning the entire `versions.keys: Vec<String>` (100-500 entries
per package) just to pass `&[String]` to `resolve_target_version`.

Cold ant-design preload hits this branch ~1800 times (every dep
beyond the first unique-(name) pop falls through here once preload
has populated the full manifest). 1800 × ~200 entries = ≈360k
String allocations on the resolver worker pool — global allocator
contention that doesn't show up in our HTTP/parse diag because it
runs on resumed-future threads, not the main task.

Borrow `&full_manifest.versions.keys` directly; `Arc<FullManifest>`
auto-derefs and the slice coercion satisfies the API. Zero alloc.

Diagnostic context: standalone manifest-bench cap=128 hits
avg_conc=92 with the same reqwest stack; ruborist held at 55-57
even after Mutex/clone hot-path eliminations elsewhere. Allocator
pressure on resolver threads is a remaining structural source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`normalize_spec` unconditionally allocated `(String, String)` —
including the ~99 % case where the spec has no `npm:` or
`workspace:` prefix and no normalisation is needed. ~5460 String
allocs per ant-design preload (2 per `resolve_package` call ×
2730 unique deps), all on resolver futures driven by main task's
cooperative polling.

Switch return type to `(Cow<'a, str>, Cow<'a, str>)`. Common path
returns `Cow::Borrowed` and pays zero allocations. `npm:` /
`workspace:` prefix paths still build the substring borrow without
allocating (they're already slices into the input). Callers (3
sites: traits/registry.rs, service/registry.rs, resolver/registry.rs)
work unchanged thanks to Cow's `Deref<Target=str>`.

Diagnostic context: standalone manifest-bench cap=128 reaches
avg_conc=92 with the same reqwest stack; ruborist held at 55-58
even after Mutex / FullManifest / progress-bar / tracing /
keys.clone() eliminations. Allocator pressure on the resolver
worker pool — each per-future hot-path String alloc compounds
across 2700+ futures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Old design: main task owned `FuturesUnordered`, polled all preload
futures cooperatively, and ran every per-future continuation
(post-await body, completion handler, dispatch refill) on the same
single task. The deeper await chain inside `resolve_package`
(cache check + `OnceMap::get_or_init` + `RetryIf` + `request.send` +
`bytes` + parse `spawn_blocking`) made each future yield 5+ times,
and every yield round-tripped through main — saturating it. CI
ant-design preload sustained avg_conc=55-61 even after Mutex /
allocator hot-path eliminations, while the standalone manifest-bench
(same reqwest stack, no resolver) hit 92 at the same cap.

New design: N long-lived `tokio::spawn` workers pulling from a
shared lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker
owns an `Arc<R>` clone and runs `resolve_package` on tokio's global
executor — futures progress fully independently, no cooperative
poll bottleneck. Main task only drains an `mpsc::unbounded_channel`
of completions to fire receiver events + on_manifest callback.

Termination: workers track `dispatched`/`completed: AtomicUsize` and
park on a shared `Notify` when the queue is empty. When the last
completion makes `completed == dispatched` and the queue is empty,
the finishing worker raises a `shutdown` flag and wakes others; all
workers drop their result_tx clones, the channel closes, and the
main `recv().await` loop exits.

Trait surface change:
- `RegistryClient`'s default-method futures gained `+ Send` bounds
  (and `Self: Sync` where blanket-default fn calls into `&self`)
- `MockRegistryClient` + `MockPackage` now `derive(Clone)` so tests
  can wrap the mock in `Arc` for the new signature
- `preload_manifests` takes `registry: Arc<R>` (was `&R`); call site
  in `run_preload_phase` clones the borrowed registry into a fresh
  `Arc`. Bound at every public surface up the chain bumped to
  `R: RegistryClient + Clone + Send + Sync + 'static`,
  `R::Error: Send`.
- `resolve_package` / `resolve_registry_dep` / `process_dependency`
  helper bounds gained `+ Sync` (their `R::Future: Send` bounds are
  inherited from the trait change above).

Local npmmirror smoke (cap=256 via DEFAULT_CONCURRENCY): avg_conc
jumped from ~55 (old) to 86.8 (new). Worker-pool delivers the
parallelism standalone manifest-bench was already showing.

Tests use `#[tokio::test(flavor = "multi_thread", worker_threads = 2)]`
since worker-pool needs spawn-able runtime; ruborist's
dev-dependencies on `tokio` add the `rt-multi-thread` feature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worker-pool preload (ruborist ed7b551) sustains avg_conc=66 at
cap=96 on CI vs the prior FuturesUnordered's 58 — and same-run
standalone manifest-bench reached 93/2.14s at cap=128 with the
identical reqwest stack. With workers running independently on
tokio's global executor (no cooperative-poll serialisation through
one task), more cap slots translate directly to more parallel
TCP requests in flight.

The Cloudflare per-req throttle curve we measured under the old
architecture (per-req wall doubled at cap 128→256) was conflated
with the FuturesUnordered ceiling. With workers decoupled the
curve needs re-measurement; cap=128 is the cheapest experiment
that brings ruborist to standalone parity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worker-pool sweep on CI ant-design p1_resolve:
  cap=96:  wall=2.23s avg_conc=66 per-req=53ms
  cap=128: wall=2.15s avg_conc=84 per-req=66ms
  → per-req drops with cap (refutes the FuturesUnordered-era
    "server throttle past 70 conc" reading; that was main-task
    saturation). Same-run standalone manifest-bench cap=192 hit
    130 conc / 2.10s, so cap=160 should bring another 0.1-0.2s
    out of preload before the curve flattens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worker-pool preload at cap=160 surfaced parse blocking-pool queue
saturation: parse diag showed `queue p95=200ms sum=70-89s` over
2730 manifests — ~26ms average queue wait per parse. That accounted
for the entire ruborist-vs-standalone per-req gap (55ms vs 28ms
under identical Cloudflare conditions).

Cause: blocking pool is sized to `worker_threads` (= num_cpus = 4 on
CI). Worker-pool preload sustains 80+ concurrent fetches; each
spawn_blocking parse goes into a 4-slot queue and waits behind
others. Original spawn_blocking offload was justified under
FuturesUnordered + main-task polling (would have stalled the single
poll loop), but worker-pool runs each future on tokio's global
executor — a brief 1-5ms sync CPU burst on a worker is cheaper than
spawn_blocking dispatch + queue wait.

Inline simd_json parse on the resolving worker. Each worker thread
parses its own response immediately after `bytes().await`; no extra
hop. Worker-pool's independent task scheduling means one stalled
worker doesn't starve the others — we just lose ~5ms of one
worker's cycle, which is far less than the dispatch-and-queue
round-trip we were paying.

Both fetch sites updated (`fetch_full_manifest` for npmjs full
manifest path, `fetch_version_manifest` for semver registries like
npmmirror).

Expected: ruborist preload per-req drops from 55-66ms → ~30-40ms
(matching standalone), wall toward ~1.7s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cap=160 + inline parse pushed avg_conc to 119 — past the
per-source Cloudflare throttle threshold. Per-req inflated
55 ms → 93 ms; net wall flat at 2.14s.

cap=128 + inline parse: avg_conc target ~85-95 (matching standalone
manifest-bench cap=128 = 70-90 / 1.6-2.0s under similar Cloudflare
conditions). Inline parse alone (no spawn_blocking queue) plus
sane concurrency should land preload at ~1.7-1.8s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`find_workspaces_from_pkg` was reading every workspace's package.json
sequentially in a `for path in matched_paths { read_package_json(...).await }`
loop. Ant-design has ~200 workspace packages; at ~1 ms per single-file
async FS round-trip on CI runners that's ~150-200 ms of serial I/O —
the largest unmeasured chunk between preload completion and lockfile
write (hyperfine total p1 minus instrumented sub-phases).

Collect workspace paths from every glob pattern first, then dispatch
all `read_package_json` calls into a `FuturesUnordered` for parallel
execution. Each read is small (typical workspace package.json < 4 KB)
so completion order is irrelevant — just push results as they land.

Expected: ant-design p1_resolve hyperfine wall drops by 100-150 ms
(toward ~2.40s vs current 2.58s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
p1_resolve hyperfine still has ~80 ms of unmeasured wall after
parallel workspace reads (commit bf14995). Suspected: 2-3 MB
package-lock.json serialize + atomic-write-rename. Add per-step
timing log so we know which knob to turn (compact-json,
to_writer streaming, async fs::rename quirks, etc).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add timer covering find_root_path → read root package.json → engines
inject → graph init → root edges → workspace discovery → workspace
nodes/edges. This is the chunk between hyperfine start and
build_deps entry — currently uninstrumented and the residual ~85ms
gap source after lockfile timing showed save is only 11ms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Linter-applied formatting cleanup, no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Original cap was sized for the FuturesUnordered preload that
dispatched 128 simd_json parses through `spawn_blocking` in a
burst — letting the default 512 cap run gave bimodal wall (M2:
2.7s fast / 6.9s thrash). Capping at `worker_threads` eliminated
the thrash peak.

After commit f3f616d (inline parse) preload no longer uses the
blocking pool. The dominant consumer is now `cloner.rs` during
the install phase: every file's hardlink / clonefile / copy goes
through `spawn_blocking`, ~50000 short syscalls per ant-design
install. Each syscall is near-instant, so the cap rarely
backpressures, but cap=4 on CI does limit how fast cloner can
fire syscalls in parallel.

Raise cap to `max(worker_threads * 4, 32)`: enough headroom for
cloner to keep multiple syscalls in flight, low enough that the
historical thrash regime (hundreds of churning threads) stays
avoided. Pool is per-runtime; idle threads die after 10s.

Expected: small p3_cold_install improvement (current utoo 5.74s
vs bun 7.71s); preload phase unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A/B test: replace `entries.par_chunks(WRITE_CHUNK_SIZE).try_for_each`
with a plain sequential `for entry in &entries` loop. Each tarball
still runs in its own outer `rayon::spawn` task (cross-package
parallelism preserved); only the within-tarball write fan-out is
removed.

Goal: measure whether rayon's intra-package parallelism still earns
its keep after the worker-pool preload rewrite. Cross-package
parallelism alone may already saturate IO; if so, removing the
inner par_chunks cuts work-stealing futex traffic + thread sync
overhead with zero throughput cost.

If p3_cold_install regresses ≥0.3s → intra-package writes are
genuinely IO-bound across cores, restore par_chunks.
If p3 unchanged or improves → simpler sequential code wins.

This is a test commit. Will be reverted if regression measured.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`clone_dir` (Linux hardlink/copy path) was using
`tokio::task::spawn_blocking` per package — at default cap=4 on CI,
only 4 packages cloned at once, each running all file hardlinks
sequentially internally. ~3500 packages × N files per install all
funneled through that bounded pool.

Switch to the same pattern `extractor.rs` already uses:
- `rayon::spawn` per package replaces `spawn_blocking` (cross-package
  parallelism via rayon work-stealing — global pool, not capped at
  worker_threads)
- `par_chunks(CLONE_CHUNK_SIZE)` for the inner hardlink/copy loop
  (intra-package fan-out across cores; same chunk size = 32 as
  extractor)

Trade-offs:
- EXDEV `force_copy` latch is now per-chunk instead of global per
  clone — chunks each rediscover cross-device errors and fall back
  locally. A few extra hardlink-then-copy round-trips at chunk
  boundaries, acceptable for the rare cross-device install.
- Pool unification: tokio blocking pool now mostly idle (just git +
  http tarball + a few one-shot commands), rayon handles all the
  high-volume IO. Cuts the 3-pool fragmentation observed earlier.

Tested:
- Iter 1 of this loop (cap bump from N to max(N*4, 32)): no p3 win,
  p4 regressed → cap raise alone wasn't the answer.
- Iter 2 (drop intra-package par_chunks in extractor): p3 +3.67s,
  σ exploded 0.04 → 2.85s → intra-package fan-out is essential.
- This commit applies the same fan-out to clone_dir for the same
  reason.

macOS `clonefile` path (target_os = "macos") unchanged — clonefile
is a single syscall per file, different perf profile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- delete crates/manifest-bench (debug-only, never merged)
- tombi format crates/ruborist/Cargo.toml
- typos: unparseable → unparsable in bench/pm-bench.sh
@elrrrrrrr elrrrrrrr added benchmark Run pm-bench on PR bench-phases Run pm-bench-phases workflow labels Apr 27, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant performance optimizations to the dependency resolver and installer, targeting CPU overhead, memory allocations, and network efficiency. Key enhancements include refactoring manifest parsing to use lazy simd_json subtrees with memoization, integrating aws-lc-rs for faster TLS handshakes, and adopting lock-free queues for dependency management. The update also features a round-robin DNS resolver, parallelized workspace discovery, and fire-and-forget disk cache writes. Detailed diagnostic logging for HTTP and parsing has been added to monitor pipeline performance. Review feedback suggests further reducing allocations in the DNS rotation logic and addressing a thundering herd risk in manifest fetch deduplication.

Comment on lines +113 to +140
fn rotate_addrs(addrs: &[SocketAddr], offset: usize) -> Vec<SocketAddr> {
if addrs.is_empty() {
return Vec::new();
}
let rotate = |slice: &[SocketAddr]| -> Vec<SocketAddr> {
if slice.is_empty() {
return Vec::new();
}
let start = offset % slice.len();
slice[start..]
.iter()
.chain(&slice[..start])
.copied()
.collect()
};
let v6: Vec<SocketAddr> = addrs.iter().filter(|a| a.is_ipv6()).copied().collect();
let v4: Vec<SocketAddr> = addrs.iter().filter(|a| a.is_ipv4()).copied().collect();
let v6_rot = rotate(&v6);
let v4_rot = rotate(&v4);
// Preserve v6-first ordering if that's what the resolver gave us;
// Happy Eyeballs will still prefer v6 when it's reachable.
let v6_first = addrs.first().map(|a| a.is_ipv6()).unwrap_or(true);
if v6_first {
v6_rot.into_iter().chain(v4_rot).collect()
} else {
v4_rot.into_iter().chain(v6_rot).collect()
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The rotate_addrs function is on the hot path for connection establishment and currently performs up to 5 allocations per call (two for filtering families, two for rotating them, and one for the final collection). Given the PR's focus on reducing allocator pressure, this can be optimized to a single allocation by iterating over the input slice multiple times instead of creating intermediate vectors.

    fn rotate_addrs(addrs: &[SocketAddr], offset: usize) -> Vec<SocketAddr> {
        if addrs.is_empty() {
            return Vec::new();
        }
        let mut result = Vec::with_capacity(addrs.len());
        let v6_first = addrs[0].is_ipv6();

        let mut append_family = |is_v6: bool| {
            let family_count = addrs.iter().filter(|a| a.is_ipv6() == is_v6).count();
            if family_count == 0 {
                return;
            }
            let start_offset = offset % family_count;
            // First pass: from start_offset to end
            let mut current_count = 0;
            for addr in addrs.iter().filter(|a| a.is_ipv6() == is_v6) {
                if current_count >= start_offset {
                    result.push(*addr);
                }
                current_count += 1;
            }
            // Second pass: from beginning to start_offset
            current_count = 0;
            for addr in addrs.iter().filter(|a| a.is_ipv6() == is_v6) {
                if current_count < start_offset {
                    result.push(*addr);
                    current_count += 1;
                } else {
                    break;
                }
            }
        };

        if v6_first {
            append_family(true);
            append_family(false);
        } else {
            append_family(false);
            append_family(true);
        }
        result
    }

Comment on lines +234 to +249
let shared = self
.inflight
.get_or_init(name.to_string(), || async {
self.fetch_full_manifest_network(name).await.ok()
})
.await;

match shared {
Some(arc) => Ok((*arc).clone()),
None => {
// OnceMap clears the key on None, so the next caller
// retries the fetch. Retry once here with a fresh error
// so we surface a useful message to this caller.
self.fetch_full_manifest_network(name).await
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This implementation introduces a thundering herd problem when a manifest fetch fails. The OnceMap::get_or_init closure returns None on failure (via .ok()), which typically causes the OnceMap to clear the entry. Consequently, all concurrent callers waiting for the same package will receive None and proceed to execute fetch_full_manifest_network simultaneously at line 247.

To fix this, consider storing a Result<Arc<FullManifestResult>, Arc<RegistryError>> in the OnceMap so that failures are also deduplicated and shared among all waiters.

@github-actions
Copy link
Copy Markdown

📊 pm-bench-phases · a263293 · linux (ubuntu-latest)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 9.15s 0.14s 10.05s 9.87s 636M 325.2K
utoo-npm 12.65s 2.31s 11.33s 13.56s 1.27G 163.7K
utoo 8.80s 1.08s 10.25s 11.93s 2.11G 253.9K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 16.1K 17.1K 1.17G 6M 1.83G 1.72G 1M
utoo-npm 197.1K 172.7K 1.14G 6M 1.68G 1.68G 2M
utoo 112.6K 65.9K 1.13G 6M 1.68G 1.68G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 2.85s 0.93s 3.78s 1.15s 513M 190.6K
utoo-npm 5.59s 0.13s 5.02s 1.77s 431M 73.4K
utoo 4.27s 0.05s 4.13s 2.06s 1.37G 169.7K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 11.3K 3.6K 202M 3M 104M - 1M
utoo-npm 66.1K 2.6K 202M 2M 9M 5M 2M
utoo 82.7K 5.2K 197M 3M 7M 5M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 7.20s 0.86s 6.18s 9.53s 598M 202.5K
utoo-npm 7.56s 1.39s 5.54s 11.04s 932M 117.6K
utoo 9.74s 0.57s 5.50s 10.93s 840M 110.9K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 5.6K 6.7K 993M 4M 1.73G 1.73G 1M
utoo-npm 120.8K 78.0K 965M 3M 1.67G 1.67G 2M
utoo 142.4K 67.0K 966M 5M 1.67G 1.67G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.52s 0.08s 0.16s 2.51s 134M 31.0K
utoo-npm 2.48s 0.21s 0.61s 3.92s 82M 19.1K
utoo 2.07s 0.06s 0.41s 3.40s 62M 13.6K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 194 20 6K 33K 1.84G 1.73G 1M
utoo-npm 50.3K 21.7K 19K 15K 1.67G 1.67G 2M
utoo 16.4K 9.4K 17K 25K 1.68G 1.67G 2M

npmmirror.com

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 25.70s 6.24s 9.35s 9.92s 553M 397.7K
utoo-npm 21.52s 3.34s 8.00s 13.43s 719M 114.7K
utoo 12.80s 2.86s 7.22s 11.53s 770M 119.7K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 58.7K 5.6K 1.12G 10M 1.85G 1.73G 2M
utoo-npm 236.0K 103.1K 977M 8M 1.67G 1.68G 2M
utoo 153.3K 59.3K 983M 8M 1.67G 1.68G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 1.52s 0.09s 4.04s 1.12s 655M 188.4K
utoo-npm 3.18s 0.08s 1.47s 0.80s 75M 16.3K
utoo 0.86s 0.03s 0.87s 0.34s 81M 17.2K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 5.1K 6.2K 152M 2M 106M - 2M
utoo-npm 44.4K 1.1K 12M 2M - 4M 2M
utoo 16.4K 321 16M 2M - 4M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 18.30s 1.06s 5.85s 8.86s 248M 99.3K
utoo-npm 21.92s 0.87s 6.24s 12.20s 714M 88.9K
utoo 18.95s 6.97s 5.85s 10.98s 667M 84.0K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 36.7K 3.5K 998M 7M 1.73G 1.73G 2M
utoo-npm 195.2K 110.2K 965M 6M 1.67G 1.67G 2M
utoo 135.8K 61.1K 966M 6M 1.67G 1.67G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.19s 0.18s 0.20s 2.40s 136M 31.6K
utoo-npm 2.49s 0.20s 0.60s 3.91s 82M 19.6K
utoo 2.19s 0.10s 0.42s 3.45s 62M 13.6K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 383 22 7M 41K 1.88G 1.72G 2M
utoo-npm 48.2K 20.7K 59K 11K 1.67G 1.67G 2M
utoo 16.7K 8.9K 62K 12K 1.67G 1.67G 2M

@github-actions
Copy link
Copy Markdown

📊 pm-bench-phases · a263293 · mac (macos-latest)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 14.72s 0.20s 5.56s 14.61s 794M 51.3K
utoo-npm 13.94s 0.51s 7.49s 14.74s 900M 98.3K
utoo 13.62s 1.37s 6.95s 14.11s 1.96G 173.6K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 16.0K 142.6K - - 1.76G 1.91G 1M
utoo-npm 13.0K 362.5K - - 1.63G 1.86G 2M
utoo 9.8K 224.6K - - 1.63G 1.86G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 2.28s 0.02s 2.46s 1.00s 478M 31.2K
utoo-npm 4.67s 0.21s 3.80s 1.74s 546M 37.4K
utoo 4.31s 0.10s 3.55s 1.98s 1.62G 106.9K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 10 25.5K - - 110M - 1M
utoo-npm 16 77.2K - - 28M 5M 2M
utoo 36 91.5K - - 27M 5M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 14.79s 4.42s 3.17s 14.23s 520M 33.9K
utoo-npm 11.87s 3.13s 3.25s 12.93s 823M 80.7K
utoo 10.21s 0.42s 3.08s 12.75s 714M 79.8K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 5.4K 133.0K - - 1.70G 1.94G 1M
utoo-npm 1.4K 235.7K - - 1.60G 1.87G 2M
utoo 1.3K 156.6K - - 1.60G 1.87G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 4.17s 0.87s 0.10s 2.06s 52M 3.9K
utoo-npm 2.91s 0.13s 0.48s 2.47s 88M 6.6K
utoo 2.87s 0.40s 0.31s 2.16s 82M 5.9K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 16.7K 1.4K - - 1.86G 1.91G 1M
utoo-npm 12.4K 70.8K - - 1.60G 1.85G 2M
utoo 13.2K 19.1K - - 1.63G 1.85G 2M

npmmirror.com

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 25.29s 0.80s 5.66s 14.83s 583M 37.7K
utoo-npm 24.58s 5.15s 6.13s 16.64s 734M 75.2K
utoo 15.90s 2.48s 4.93s 14.12s 687M 74.4K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 14.3K 150.5K - - 1.79G 1.89G 2M
utoo-npm 4.0K 434.8K - - 1.61G 1.84G 2M
utoo 4.4K 256.9K - - 1.61G 1.87G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 2.54s 0.13s 2.35s 1.14s 534M 34.8K
utoo-npm 4.82s 0.04s 2.25s 1.34s 81M 5.9K
utoo 7.05s 8.48s 1.40s 0.57s 82M 6.0K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 8 30.2K - - 111M - 2M
utoo-npm 5 41.8K - - - 4M 2M
utoo 30 25.3K - - - 4M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 21.42s 0.84s 3.33s 14.06s 251M 16.7K
utoo-npm 27.86s 0.88s 4.54s 15.20s 650M 72.4K
utoo 25.61s 6.95s 4.03s 13.92s 747M 72.7K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 1.9K 137.0K - - 1.70G 1.94G 2M
utoo-npm 1.5K 374.5K - - 1.61G 1.87G 2M
utoo 1.3K 230.6K - - 1.61G 1.87G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.97s 0.49s 0.10s 1.96s 50M 3.8K
utoo-npm 3.65s 0.01s 0.52s 2.64s 97M 7.1K
utoo 3.59s 0.32s 0.33s 2.29s 87M 6.2K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 13.8K 1.1K - - 1.87G 1.91G 2M
utoo-npm 12.3K 72.3K - - 1.61G 1.83G 2M
utoo 13.3K 19.9K - - 1.61G 1.83G 2M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bench-phases Run pm-bench-phases workflow benchmark Run pm-bench on PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant