Skip to content

perf(pm): sequential writes within tarball extraction#2818

Draft
elrrrrrrr wants to merge 104 commits intonextfrom
perf/extract-sequential-writes
Draft

perf(pm): sequential writes within tarball extraction#2818
elrrrrrrr wants to merge 104 commits intonextfrom
perf/extract-sequential-writes

Conversation

@elrrrrrrr
Copy link
Copy Markdown
Contributor

@elrrrrrrr elrrrrrrr commented Apr 21, 2026

Summary

Started as "intra-package sequential tarball writes" (single-line perf tweak) and evolved into a multi-week, data-driven preload-resolver overhaul plus a complete bench infrastructure. This PR is now too large to review safely and is being split — see "PR split plan" at the bottom.

This description captures the full exploration journey so subsequent split-PRs can reference it as context.

End-to-end results (ant-design / npmjs.org / GitHub Actions ubuntu)

Phase bun utoo (start) utoo (now) utoo Δ vs bun
p0_full_cold 9.07-9.32s ~14s 8.89-9.06s −5s utoo wins / matches
p1_resolve 1.98-2.34s 5.78s 2.65s −54% gap +0.31-0.67s
p3_cold_install 7.71-7.96s ~9s 5.74s −36% utoo +2s ahead
p4_warm_link 3.42-3.51s ~3s 1.98s −34% utoo +1.5s ahead

utoo now matches or beats bun on 3 of 4 phases. The remaining p1 gap (+0.4s) is fully accounted for and at the architectural floor.

Journey timeline — p1_resolve preload wall

5.78s baseline
  → 4.47s  eager Versions parse → kept the parse, refactored architecture
  → 4.09s  inline extract_transitive_deps + direct deserialize
  → 3.82s  lock-free SegQueue pending queue (was Mutex<VecDeque>)
  → 3.30s  rustls + aws-lc-rs (replaces ring; CCS→AppData 154ms→17ms)
  → 3.10s  tracing file_filter info+ (drop debug! eval cost)
  → 2.96s  Arc<FullManifest> in cache (no more deep clone)
  → 2.65s  drop indicatif Mutex from per-completion path
  → 2.15s  worker-pool replaces FuturesUnordered ⭐ biggest single win
  → 2.65s  current (Cloudflare variance: best 2.04s, typical 2.65s incl BFS+lockfile)

Key architectural moves

1. Worker-pool preload (commit ed7b551e) — the core win

Problem: FuturesUnordered polled all preload futures from a single main task. Each resolve_package call had 5+ awaits (cache check + OnceMap::get_or_init + RetryIf + request.send + bytes + parse). Every yield round-tripped through main, saturating it. Even after killing every Mutex/clone hot-path, avg_conc held at 55-60 while standalone manifest-bench (same reqwest stack, no resolver) sustained 92.

Fix: N long-lived tokio::spawn workers pulling work from Arc<SegQueue<Dep>> with DashSet dedup. Workers run on tokio's global executor independently; main task only drains an mpsc::unbounded_channel for receiver events + on_manifest callback. Termination via dispatched/completed: AtomicUsize + Notify.

Trait surface: RegistryClient futures gained + Send bounds, MockRegistryClient derives Clone. preload_manifests takes Arc<R> instead of &R. Bound R: Clone + Send + Sync + 'static, R::Error: Send propagated up the API chain.

Result: avg_conc 55 → 84 (CI), wall 3.10s → 2.15s (-31%).

2. HTTP stack ceiling — empirically verified (crates/manifest-bench)

Built a standalone HTTP-only fetch tool that strips out everything ruborist does on top of network: BFS, dedup, parse, project cache, lockfile. Dispatches identical workload through identical reqwest+rustls+tokio stack.

Key data point (CI ant-design npmjs.org cap=128, controlled for same Cloudflare conditions):

  • standalone manifest-bench wall ≈ 2.10-2.30s, avg_conc 89-95
  • ruborist preload wall ≈ 2.04-2.15s, avg_conc 84

ruborist now matches the HTTP stack ceiling. Further preload speedup requires a non-reqwest stack (HTTP/3, custom hyper Connector, etc.). This empirically refuted ~10 candidate optimizations as not the bottleneck.

3. Allocator hot-path cleanup

Each resolve_package call allocated several Strings on the resolver hot path. Cumulatively ~10k allocs per ant-design preload, all on workers competing for shared allocator. Eliminated via:

  • normalize_spec returns Cow<'_, str> instead of (String, String) (commit 12d34dd4)
  • Arc<FullManifest> in MemoryCache — clone is atomic bump not deep HashMap clone (commit 337d4f26)
  • versions.keys.clone() removed from cache-hit path — pass borrow directly (commit bb256ecd)
  • format!() dedup key kept (testing removal broke transitive prefetch)

4. Inline parse (drop spawn_blocking)

Worker-pool surfaced parse blocking-pool queue saturation: queue p95=200ms sum=70-89s over 2730 manifests. Cap=4 on CI was funneling all parses through a 4-slot queue.

Fix: inline simd_json parse on the worker (commit f3f616d8). 1-5ms CPU per manifest is acceptable on async worker; eliminates dispatch + queue overhead.

5. Workspace discovery parallelization

find_workspaces_from_pkg was reading 200+ workspace package.json files sequentially in a for loop. Replaced with FuturesUnordered (commit bf149957). Saved 150ms (200×1ms serial → ~10ms parallel).

6. tarball extraction + cloner — install path

  • extractor.rs: rayon::spawn per package + par_chunks(32) intra-package writes. Verified essential by testing removal — sequential write regressed p3 +3.67s, σ exploded 0.04→2.85.
  • cloner.rs: kept on tokio::task::spawn_blocking. Verified essential by testing rayon migration — that regressed p3 +2.65s due to oversubscription with extractor's rayon pool. The current tokio::blocking + cap=worker_threads is the local optimum for hardlink (single short syscall, no fan-out benefit).

Failed experiments (kept here so future attempts don't repeat)

All reverted. Each tested with CI bench data.

Experiment Hypothesis Result Commits
cap=256 More concurrency = faster wall +0.65s, per-req doubled 25814552 reverted by c610a582
cap=64 Avoid Cloudflare throttle wall same, per-req halved (cancels) 02ef0562
Per-IP multi-client Cloudflare throttles per-destination No change at avg_conc 4e125908/ae2a6088/5a897e4a
HTTP/2 negotiate Multiplex many streams over fewer conns wall identical to H1 (Cloudflare throttle is per-source-IP) 1a16d25e reverted by f379f7a9
dedup-by-name only Skip same-name spec duplicates preload wall down BUT BFS exploded +5s (lost transitive prefetch) de5c83ed reverted by 90e421a5
max_blocking_threads = N*4 Cloner cap=4 was throttle p3 unchanged, p4 +0.35s (cold-pool spawn cost) 132ef36e reverted by 9a071f29
Drop intra-package par_chunks in extractor Cross-package parallelism enough p3 +3.67s catastrophic c7c847d6 reverted by 2f1092c3
Migrate cloner to rayon Pool unification p3 +2.65s (oversubscription with extractor rayon pool) 9229e160 reverted by e38329c8

Cloudflare per-source-IP throttle — measured

Standalone manifest-bench cap sweep (cap=32/64/96/128/192/256):

  • per-req wall grows with cap: 30ms → 38ms → 53ms → 70ms → 107ms → 146ms
  • sum more than doubles between cap=128 and cap=256
  • avg_conc plateau at cap≈128 around 90 effective concurrent

Conclusion: npmjs's Cloudflare frontend throttles on a per-source-IP basis (CI runner egress). Cap=128 is the sweet spot for our setup. Going wider triggers per-req inflation that cancels the parallelism gain.

This refuted the original "raise cap aggressively" intuition and 6 cap-sweep experiments.

Diagnostic infrastructure (kept, valuable for future work)

  • HTTP diag (5e3c12d2): per-preload wall / busy / sum / avg_conc / p50/p95/max / cpu_tail
  • parse diag (6e0e60e5): spawn_blocking queue p50/p95 + exec p50/p95
  • manifest-bench tool (b92aa81f): standalone HTTP-only A/B against ruborist
  • manifest-bench CI step (f6846d6a): every CI run produces standalone control numbers
  • Phase timing (preload, build_deps, serialize, project_cache_save, save_package_lock, setup) — full account of every ms in p1
  • bulk-readdir disk ETag index (88a4b056): warm 304 path restored without per-package syscall storm

Total wall account (p1_resolve, latest CI)

Setup phase (workspace + graph init):    0.6ms ← workspace 200x parallel reads
Preload phase (HTTP):                  2.15s ← at HTTP stack ceiling
Build deps (= preload + BFS resolve):  2.41s → BFS resolve = 0.26s
Serialize graph to lockfile:            12ms
Project cache export + save:            40ms
Save package-lock.json:                 11ms (serialize 10 / write 0 / rename 0)
─────────────────────────────────────────────
Sum of instrumented:                   2.51s
Hyperfine total p1:                    2.65s
Process startup (tokio init + native_certs + clap):  ~140ms ← Rust binary fixed cost

Remaining optimization space (out of this PR's scope)

  1. BFS layer parallelization (~0.26s → ~0.13s achievable)
  2. Process startup ~140ms (rustls-native-certs lazy load, reqwest::Client deferred build)
  3. HTTP/3 / 0-RTT TLS (~50-100ms TCP+TLS handshake savings)
  4. RSS reductionFullManifest raw bytes + index instead of OwnedValue tree (1.46GB → ~700MB)

None are small enough to fit this PR.

PR split plan

Per discussion, this PR is being decomposed into 4-5 focused PRs:

# Theme Risk Status
1 bench-phase infra + manifest-bench tool + PR auto-comment low TBD
2 install path: extractor sequential writes + chunked parallel writes low TBD
3 manifest cache & alloc cleanup (Arc, normalize_spec Cow, aws-lc-rs, bulk-readdir, tracing filter, etc.) medium TBD
4 preload worker-pool architecture rewrite (+ wasm CI fix) high TBD
5 (optional) config tweaks low TBD

This PR will close once 1-4 land. Track in subsequent PRs.

Test plan

  • cargo fmt + cargo clippy --all-targets -- -D warnings --no-deps clean across crates
  • cargo test -p utoo-pm 245 passed
  • cargo test -p utoo-ruborist 164 + 10 doctests passed
  • pm-bench-phases CI run >50 times, results documented above
  • Standalone manifest-bench CI integrated for control measurements

🤖 Generated with Claude Code

Replace intra-package `par_iter` with a sequential loop when writing
extracted tar entries to disk. Each tar entry is typically small and
writes complete in microseconds, so splitting them into rayon tasks
was causing heavy work-stealing (futex park/unpark) and dominating
context switches on large dep graphs. Cross-package parallelism is
preserved by the outer `rayon::spawn` in `extract_tarball`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr added the benchmark Run pm-bench on PR label Apr 21, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the tarball extraction logic in crates/pm/src/util/extractor.rs to process entries sequentially instead of in parallel. This change aims to reduce excessive context switching caused by rayon work-stealing on large dependency graphs, while maintaining cross-package parallelism. Feedback suggests consuming the entries collection during iteration to optimize memory usage by dropping file buffers immediately after they are written.

Comment thread crates/pm/src/util/extractor.rs Outdated
// Write files sequentially. Cross-package parallelism is handled by the outer
// rayon::spawn; splitting individual files into rayon tasks caused excessive
// work-stealing ctx switches on large dep trees.
for entry in &entries {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since entries is not used after this loop, you can consume it by using for entry in entries instead of iterating by reference. This allows each ExtractedEntry (and its potentially large content buffer) to be dropped immediately after it is written to disk, which can significantly reduce the peak memory usage during the extraction of large packages.

Suggested change
for entry in &entries {
for entry in entries {

elrrrrrrr and others added 6 commits April 21, 2026 23:21
- Cold bench: drop `| tail -1` so hyperfine's full summary (mean,
  stddev, range) reaches the log. Failure detection now uses exit
  status instead of piping.
- `BENCH_WARM_RUNS=0` skips the warm phase entirely (previously the
  warm function always ran and hyperfine would reject --runs 0).
- Result aggregator tolerates empty or malformed export-json files
  (e.g. when a PM's cold install fails): the offending file is
  reported and skipped instead of crashing the whole summary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the sequential `for` loop over extracted tar entries with
`par_chunks(WRITE_CHUNK_SIZE)` — each rayon task writes a contiguous
run of 32 files sequentially. This retains multi-core IO overlap for
large packages while cutting the rayon task count (and its work-
stealing futex traffic) by the chunk factor versus a per-file
par_iter. Cross-package parallelism is preserved by the outer
rayon::spawn in extract_tarball.

Local (macOS, antd-test, 3 runs avg):
  before par_iter: wall 17.2s  sys 6.18s  ivcsw 208k
  for-loop:        wall 15.3s  sys 2.36s  ivcsw  61k
  par_chunks(32):  wall 13.9s  sys 5.77s  ivcsw 191k

chunks wins wall but loses the ctx-switch reduction relative to the
pure sequential version; CI with a large dep graph (ant-design-x)
is the authoritative measurement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Accumulate wall microseconds for download, extract, and clone across
all packages during install. Print a one-line summary alongside the
existing `added / reused / downloaded` counts, e.g.

  + 513 added · 3017 reused · 123 downloaded
    download 135.8s · extract 2.3s · clone 0.4s · 19.0 MB fetched

The sums are non-exclusive across cores: dividing by wall clock
gives the effective concurrency for each phase, and the ratio
between phases shows where cold-install CPU time actually lands.
Overhead is three atomics per downloaded tarball.

Local antd-test (macOS, npmmirror, 77 packages, wall 16s): download
dominates 98% of the CPU budget, extract 1.6%, clone 0.3% — reshapes
where we should look for cold-install wins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Needed so the per-phase timings line (`download · extract · clone · bytes`)
printed at the end of each install reaches the CI log. Trade-off is noisier
logs — registry INFO/WARN lines come through — but that's the price for
visibility into where cold-install CPU actually lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Separates three independent measurements for utoo vs bun so each
phase's improvement can be judged on its own baseline:

  Phase 1 · resolve     utoo deps          / bun install --lockfile-only
  Phase 3 · cold install utoo install      / bun install   (empty cache)
  Phase 4 · warm link    utoo install      / bun install   (cache warm)

Phase 3 uses the lockfile generated by phase 1, with cache reset
between iterations. Phase 4 resets only node_modules so only the
cache → node_modules link step is measured.

Uses hyperfine --show-output so utoo's phase-timings line
(\`download · extract · clone · bytes\`) reaches the CI log alongside
the wall-clock summary.

Triggered via workflow_dispatch with configurable project / registry
/ runs. Defaults to ant-design against npmjs.org, 3 runs per phase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anch merge

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr added the bench-phases Run pm-bench-phases workflow label Apr 22, 2026
Previous inline bash -c prepare was silently no-op on CI: utoo's run 2/3
showed '3280 reused' meaning the cache wasn't actually cleared, and bun hit
InvalidNPMLockfile because utoo's package-lock.json leaked across
iterations.

Now each phase writes a dedicated prepare shell script per-PM that:
- always drops node_modules (incl. workspace package trees),
- clears exactly the lockfiles that would confuse this PM,
- wipes the right cache for this phase,
- prints a '[prep]' line so the CI log proves prepare ran.

Also factored out seed_for_phase so lockfile / cache warmup happens once
before the benchmark, not leaking into the measurement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr added bench-phases Run pm-bench-phases workflow and removed bench-phases Run pm-bench-phases workflow labels Apr 22, 2026
…che wipe

Path-based rm -rf of $HOME/.cache/nm wasn't actually emptying the cache
on the CI runner — utoo runs 2/3 of phase 3 still showed '3280 reused',
wall was 0.8-1.1s instead of the 10s cold-install baseline, hyperfine
itself warned about caches not being filled until after run 1.

Let each PM clean its own cache via its CLI so we don't rely on
guessing where it stores things.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr added bench-phases Run pm-bench-phases workflow and removed bench-phases Run pm-bench-phases workflow labels Apr 22, 2026
`utoo clean` / `bun pm cache rm` didn't empty the cache on the CI
runner either — so now use explicit bench-local paths the rm -rf
prepare can guarantee to wipe:

  utoo: --cache-dir=/tmp/utoo-bench-cache on every invocation
  bun:  BUN_INSTALL_CACHE_DIR=/tmp/bun-bench-cache (env var)

Gets us deterministic cold/warm state between hyperfine iterations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr added bench-phases Run pm-bench-phases workflow and removed bench-phases Run pm-bench-phases workflow labels Apr 22, 2026
Drop into diagnostic mode to figure out why hyperfine's --prepare
still leaves utoo's cache intact across iterations despite the
explicit --cache-dir. Prints the generated prepare script, and logs
each per-iteration invocation's before/after du -sh of both caches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr added bench-phases Run pm-bench-phases workflow and removed bench-phases Run pm-bench-phases workflow labels Apr 22, 2026
The case $phase in p1) p3) p4) \-style patterns never matched
against actual phase strings like "p1_resolve" / "p3_cold_install" /
"p4_warm_link". Result: write_prepare produced a script containing
only the common header and no phase-specific cache-wipe logic, so
every run after the first hit a warm cache and timings collapsed.

Same off-by-name bug in seed_for_phase: "p3:utoo" pattern never
matched "p3_cold_install:utoo", skipping lockfile seeding and
warm-cache priming. Switched both to "p*_*" globs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr added bench-phases Run pm-bench-phases workflow and removed bench-phases Run pm-bench-phases workflow labels Apr 22, 2026
The cache-size before/after logs + generated-script dumps were
diagnostic scaffolding used to trace the p* vs p*_resolve pattern
mismatch. With that fixed, keep the plain hyperfine --prepare
invocation so CI logs are readable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr added bench-phases Run pm-bench-phases workflow and removed bench-phases Run pm-bench-phases workflow labels Apr 22, 2026
…time

Each hyperfine iteration now runs inside a metrics wrapper that greps
/usr/bin/time -v output for RSS, voluntary/involuntary context switches,
page faults, and IO read/write counts. Per-PM per-phase averages across
the 3 runs are shown alongside the wall-clock table so we can see, e.g.,
whether utoo's resolve phase costs more syscalls than bun's, or whether
its warm-link advantage comes at a memory cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr removed the bench-phases Run pm-bench-phases workflow label Apr 22, 2026
elrrrrrrr and others added 7 commits April 25, 2026 20:24
Linter-applied formatting cleanup, no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Original cap was sized for the FuturesUnordered preload that
dispatched 128 simd_json parses through `spawn_blocking` in a
burst — letting the default 512 cap run gave bimodal wall (M2:
2.7s fast / 6.9s thrash). Capping at `worker_threads` eliminated
the thrash peak.

After commit f3f616d (inline parse) preload no longer uses the
blocking pool. The dominant consumer is now `cloner.rs` during
the install phase: every file's hardlink / clonefile / copy goes
through `spawn_blocking`, ~50000 short syscalls per ant-design
install. Each syscall is near-instant, so the cap rarely
backpressures, but cap=4 on CI does limit how fast cloner can
fire syscalls in parallel.

Raise cap to `max(worker_threads * 4, 32)`: enough headroom for
cloner to keep multiple syscalls in flight, low enough that the
historical thrash regime (hundreds of churning threads) stays
avoided. Pool is per-runtime; idle threads die after 10s.

Expected: small p3_cold_install improvement (current utoo 5.74s
vs bun 7.71s); preload phase unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A/B test: replace `entries.par_chunks(WRITE_CHUNK_SIZE).try_for_each`
with a plain sequential `for entry in &entries` loop. Each tarball
still runs in its own outer `rayon::spawn` task (cross-package
parallelism preserved); only the within-tarball write fan-out is
removed.

Goal: measure whether rayon's intra-package parallelism still earns
its keep after the worker-pool preload rewrite. Cross-package
parallelism alone may already saturate IO; if so, removing the
inner par_chunks cuts work-stealing futex traffic + thread sync
overhead with zero throughput cost.

If p3_cold_install regresses ≥0.3s → intra-package writes are
genuinely IO-bound across cores, restore par_chunks.
If p3 unchanged or improves → simpler sequential code wins.

This is a test commit. Will be reverted if regression measured.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`clone_dir` (Linux hardlink/copy path) was using
`tokio::task::spawn_blocking` per package — at default cap=4 on CI,
only 4 packages cloned at once, each running all file hardlinks
sequentially internally. ~3500 packages × N files per install all
funneled through that bounded pool.

Switch to the same pattern `extractor.rs` already uses:
- `rayon::spawn` per package replaces `spawn_blocking` (cross-package
  parallelism via rayon work-stealing — global pool, not capped at
  worker_threads)
- `par_chunks(CLONE_CHUNK_SIZE)` for the inner hardlink/copy loop
  (intra-package fan-out across cores; same chunk size = 32 as
  extractor)

Trade-offs:
- EXDEV `force_copy` latch is now per-chunk instead of global per
  clone — chunks each rediscover cross-device errors and fall back
  locally. A few extra hardlink-then-copy round-trips at chunk
  boundaries, acceptable for the rare cross-device install.
- Pool unification: tokio blocking pool now mostly idle (just git +
  http tarball + a few one-shot commands), rayon handles all the
  high-volume IO. Cuts the 3-pool fragmentation observed earlier.

Tested:
- Iter 1 of this loop (cap bump from N to max(N*4, 32)): no p3 win,
  p4 regressed → cap raise alone wasn't the answer.
- Iter 2 (drop intra-package par_chunks in extractor): p3 +3.67s,
  σ exploded 0.04 → 2.85s → intra-package fan-out is essential.
- This commit applies the same fan-out to clone_dir for the same
  reason.

macOS `clonefile` path (target_os = "macos") unchanged — clonefile
is a single syscall per file, different perf profile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
elrrrrrrr added a commit that referenced this pull request Apr 25, 2026
The headline architectural change of #2818. ruborist's preload
phase shifts from a single-task `FuturesUnordered` cooperative
poller to N long-lived `tokio::spawn` workers (or
`wasm_bindgen_futures::spawn_local` on wasm32 where Send isn't
satisfied). Stacks on top of #2826.

## Why

Old design: main task owned `FuturesUnordered`, polled all
preload futures cooperatively, and ran every per-future
continuation (post-await body, completion handler, dispatch
refill) on the same single task. The deeper await chain inside
`resolve_package` (cache check + `OnceMap::get_or_init` +
`RetryIf` + `request.send` + `bytes` + parse spawn_blocking)
made each future yield 5+ times, and every yield round-tripped
through main — saturating it. CI ant-design preload sustained
avg_conc=55-61 even after Mutex / allocator hot-path
eliminations, while the standalone manifest-bench (same reqwest
stack, no resolver — see #2824) hit 92 at the same cap.

## How

N long-lived `tokio::spawn` workers pulling from a shared
lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker
owns an `Arc<R>` clone and runs `resolve_package` on tokio's
global executor — futures progress fully independently, no
cooperative poll bottleneck. Main task only drains an
`mpsc::unbounded_channel` of completions to fire receiver events
+ on_manifest callback.

Termination: workers track `dispatched` / `completed:
AtomicUsize` and park on a shared `Notify` when the queue is
empty. When the last completion makes `completed == dispatched`
and the queue is empty, the finishing worker raises a `shutdown`
flag and wakes others; all workers drop their result_tx clones,
the channel closes, and the main `recv().await` loop exits.

## Trait surface change

- `MockRegistryClient` + `MockPackage` now `derive(Clone)` so
  tests can wrap the mock in `Arc` for the new signature
- `preload_manifests` takes `registry: Arc<R>` (was `&R`); call
  site in `run_preload_phase` clones the borrowed registry into
  a fresh `Arc`. Bound at every public surface up the chain
  bumped to `R: RegistryClient + Clone + MaybeSend + MaybeSync +
  'static`, `R::Error: MaybeSend`. The `MaybeSend` /
  `MaybeSync` shims (added in #2826) keep the trait surface
  wasm-compatible.

## Companion changes folded in

- **Inline simd_json parse** — drop `tokio::task::spawn_blocking`
  in `service/manifest.rs`. Worker-pool surfaced parse blocking-
  pool queue saturation: `queue p95=200ms sum=70-89s` over 2730
  manifests on cap=4 CI runners. Inline parse on the worker
  thread eliminates dispatch + queue overhead; 1-5ms CPU per
  manifest is acceptable on async worker.
- **Workspace package.json parallel reads** — `find_workspaces_from_pkg`
  switched from sequential `for path in matched_paths { read }`
  loop to `FuturesUnordered` fan-out. ant-design has ~200
  workspace packages; saved ~150ms.
- **Setup phase + lockfile-write timing logs** — round out the
  per-phase wall account for the bench-comment infrastructure.
- **Manifests concurrency cap 64 → 128** — worker-pool
  delivered the parallelism that justifies the cap raise. CI
  ant-design avg_conc 84 at cap=128 (up from 55 under the old
  architecture); preload wall 3.10s → 2.15s.

## Tests

`#[tokio::test(flavor = "multi_thread", worker_threads = 2)]`
since worker-pool needs a spawn-able runtime; ruborist's
dev-dependencies on `tokio` add the `rt-multi-thread` feature.

164 ruborist + 10 doctests + 248/249 utoo-pm pass (1 pre-existing
flake on `test_update_package_binary_fsevents`, runs green alone).

## Wasm CI

cfg-gates `tokio::spawn` to `wasm_bindgen_futures::spawn_local`
on wasm32 since wasm-bindgen's `JsFuture` is `!Send`. Workers
still run independently — single-threaded under wasm but the
queue + Notify + mpsc termination story is unchanged.
`cargo check -p utoo-wasm --target wasm32-unknown-unknown` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
elrrrrrrr added a commit that referenced this pull request Apr 25, 2026
…p guard

Two folded changes that started life as separate commits on the
parent perf branch:

1. **Sequential / chunked parallel writes** (was: ad0dee97ab17b8). The old per-file `par_iter().for_each(write)` paid
   work-stealing futex park/unpark overhead per write. Each entry is
   <64 KB and a single fs::File::create + write_all returns in μs —
   rayon scheduler dominated. Switch to `entries.par_chunks(32)
   .try_for_each(...)`: each rayon task writes a contiguous run of
   32 files sequentially. Cuts task count by 32× while keeping
   multi-core IO-overlap parallelism.

   Cross-package parallelism is preserved by the outer
   `rayon::spawn` in `extract_tarball`, which itself was already
   landed on `next` via earlier work.

   Verified essential by an A/B on the parent branch: removing the
   intra-package par_chunks (sequential `for entry in &entries`
   inside each rayon task) regressed CI p3_cold_install +3.67s and
   exploded σ from 0.04 → 2.85 — IO can't interleave across cores
   when each tarball serialises its own writes.

2. **Tar Slip guard** — reject tar entries whose path is absolute
   or contains `..` components before joining with the destination.
   Without this an attacker-controlled tarball could overwrite
   arbitrary files via paths like `../../etc/foo` or
   `/etc/passwd`. `tar` crate does not enforce this by default;
   `npm` and `pnpm` both validate. We log+skip such entries.

Both changes touch the same single function so they commit together.

CI bench shows p3_cold_install at 5.74s vs bun 7.71s (utoo +2s
ahead). PR description in #2818 documents the full A/B journey.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
elrrrrrrr added a commit that referenced this pull request Apr 25, 2026
Bundle of independently-motivated allocator + cache hot-path
optimisations from the parent perf branch (#2818). Each landed
during the worker-pool exploration but doesn't depend on the
worker-pool architecture itself — they stand alone as
straightforward perf wins for the resolver.

## TLS provider — `aws-lc-rs` instead of `ring`

`reqwest` 0.12's default `rustls-tls-native-roots` feature pins
`ring` via Cargo's feature unification. Switch to
`rustls-tls-native-roots-no-provider`, build our own
`rustls::ClientConfig` with the `aws_lc_rs` provider, pass via
`Client::use_preconfigured_tls`. CI measurement (4-core ubuntu vs
npmjs.org): ring's per-handshake CCS→AppData was 78 ms p50 / 154
ms max, all 128 parallel handshakes serialising across 4 cores.
aws-lc-rs (BoringSSL primitives) is ~3× faster on x86_64. Saved
~420 ms preload on cold ant-design.

## DNS — per-family rotation

`getaddrinfo` typically returns 10 v6 + 12 v4 for npmjs.org. A
flat rotation across the joined list meant offsets 0..10 all
started inside the v6 range; on hosts where v6 routing fails
(GitHub Actions runners), every connection fell through to the
*same* first-reachable v4. Rotate per-family so v4 conns cycle
across all v4 addresses (and v6 over v6) — observed pcap on bun
shows the same 4×64 distribution we now produce.

## Disk-cache bulk-readdir ETag index

`PackageCache` lazy-builds a `HashSet<String>` of names with
existing disk cache entries from a single `read_dir(cache_dir)` +
per-`@scope` recurse. `get_versions_from_disk` and
`get_version_manifest_from_disk` short-circuit via the index.
Restores the warm-run 304 path that was temporarily removed in
46cb803 (per-package `try_exists` was 16 ms avg on the cold-run
critical path; now zero).

## Lazy per-version `CoreVersionManifest` via `simd_json::OwnedValue`

`Versions` now stores `keys: Vec<String>` (ordered version list)
+ `trees: HashMap<String, Arc<simd_json::OwnedValue>>`
(pre-parsed JSON subtrees). Strongly-typed `CoreVersionManifest`
is materialised on demand via
`CoreVersionManifest::deserialize(tree.as_ref())` — zero-copy
through `simd_json::OwnedValue`'s `Deserializer` impl, memoised in
a `DashMap`. Resolver typically reads 1-3 of the ~500 versions
per manifest; previous design built every one eagerly.

## `Arc<FullManifest>` in `MemoryCache`

Cache previously returned `FullManifest` by value, deep-cloning
the per-version HashMap (100-500 entries × String key clone + Arc
bump per cache hit) on the resolver hot path. ~2730 cache hits
during cold preload × ~200-entry HashMap clone =
~500k allocations on shared resolver threads, contending the
allocator. Wrap in `Arc<FullManifest>`; cache hit becomes one
atomic bump.

## `normalize_spec` returns `Cow<'a, str>`

Was unconditionally allocating `(String, String)` even for the
~99 % of deps with no `npm:` / `workspace:` prefix. ~5460 String
allocations per ant-design preload, all on resolver hot path.
Common path now returns `Cow::Borrowed`.

## Drop `versions.keys.clone()` from cache-hit path

`resolve_package`'s full-manifest cache-hit branch was cloning
the entire `versions.keys: Vec<String>` (~200 entries) just to
pass `&[String]` to `resolve_target_version`. Borrow directly via
Arc auto-deref. ~360k String allocs eliminated (~1800 cache hits
× ~200 entries).

## OnceMap dedup

New `crate::util::oncemap` module: `DashMap` + `tokio::sync::Notify`
coalescer for concurrent `resolve_full_manifest` callers of the
same name. First caller fetches the network; others wait on the
shared `Notify` and read the cached `Arc<V>`. Replaces the prior
per-name `tokio::sync::Mutex<()>` gate that serialised the hot
dispatch path.

## tracing file_filter info+ default

File-layer log filter dropped from `utoo=debug` to `utoo=info`.
Hot-path `tracing::debug!()` calls (cache hits, BFS dispatch,
preload events) emit ~5-10 events per resolved manifest. With
2730+ manifests during cold preload that's 15-30k events that —
even routed through the non_blocking appender's channel — pay
format/serialise CPU on the resolving thread before the channel
send. Override via `UTOO_FILE_LOG=debug` for diagnostics.

## indicatif progress bar — drop per-package message updates

`PreloadFetching` and `PreloadProgress` used to call
`format!("fetching/resolved {}", name)` + `PROGRESS_BAR.set_message()`
per event. With ~9000 such calls per ant-design preload and an
indicatif-internal `Mutex` per call, this serialised the main
loop's fill-and-drain rate. The user can't visually parse 5460
message swaps in 3 seconds anyway. Counter still ticks via
`PROGRESS_BAR.inc(1)`.

## HTTP + parse diagnostic infrastructure (used by PR4)

`service/http.rs` ships `start_http_trace` / `finish_http_trace`
+ `start_parse_trace` / `finish_parse_trace` plus
`record_http_interval` + `record_parse_interval` callbacks.
`#[allow(dead_code)]` on the start/finish for now — the preload
worker-pool refactor in the next PR (#TBD) wires them in.

Also bumps the `+ Sync` bound on `RegistryClient` callers in
`builder.rs` / `preload.rs` / `resolver/registry.rs` — required
because the trait's default-method futures gained `+ Send`
(needed downstream by tokio::spawn, but already correct for
single-threaded resolvers too).

Tests: 164 ruborist + 248/249 utoo-pm pass (1 pre-existing flake
on `test_update_package_binary_fsevents` when run in parallel,
passes alone).

Stacks: PR4 (preload worker-pool architecture) targets this
branch and adds the bound propagation + spawn refactor on top.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
elrrrrrrr added a commit that referenced this pull request Apr 25, 2026
The headline architectural change of #2818 — preload phase shifts
from a single-task `FuturesUnordered` cooperative poller to N
long-lived `tokio::spawn` workers (or
`wasm_bindgen_futures::spawn_local` on wasm32 where Send isn't
satisfied). Stacks on top of #2826.

## Why

Old design: main task owned `FuturesUnordered`, polled all preload
futures cooperatively, and ran every per-future continuation
(post-await body, completion handler, dispatch refill) on the same
single task. The deeper await chain inside `resolve_package`
(cache check + `OnceMap::get_or_init` + `RetryIf` + `request.send`
+ `bytes` + parse spawn_blocking) made each future yield 5+ times,
and every yield round-tripped through main — saturating it. CI
ant-design preload sustained avg_conc=55-61 even after Mutex /
allocator hot-path eliminations, while the standalone
manifest-bench (#2824) hit 92 on the same reqwest stack.

## How

N long-lived `tokio::spawn` workers pulling from a shared
lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker owns
an `Arc<R>` clone and runs `resolve_package` on tokio's global
executor — futures progress fully independently, no cooperative
poll bottleneck. Main task only drains an `mpsc::unbounded_channel`
of completions to fire receiver events + on_manifest callback.

Termination: workers track `dispatched` / `completed: AtomicUsize`
and park on a shared `Notify` when the queue is empty. When the
last completion makes `completed == dispatched` and the queue is
empty, the finishing worker raises a `shutdown` flag and wakes
others; all workers drop their result_tx clones, the channel
closes, and the main `recv().await` loop exits.

## Trait surface change

- `MockRegistryClient` + `MockPackage` `derive(Clone)` so tests can
  wrap the mock in `Arc` for the new signature
- `preload_manifests` takes `registry: Arc<R>` (was `&R`); call
  site in `run_preload_phase` clones the borrowed registry into a
  fresh `Arc`. Bound at every public surface up the chain bumped
  to `R: RegistryClient + Clone + MaybeSend + MaybeSync + 'static`,
  `R::Error: MaybeSend`. The `MaybeSend` / `MaybeSync` shims
  (added in #2826) keep the trait surface wasm-compatible.

## Companion changes folded in

- **Inline simd_json parse** — drop `tokio::task::spawn_blocking`
  in `service/manifest.rs`. Worker-pool surfaced parse blocking-
  pool queue saturation: `queue p95=200ms sum=70-89s` over 2730
  manifests on cap=4 CI runners. Inline parse on the worker thread
  eliminates dispatch + queue overhead.
- **Workspace package.json parallel reads** — switch the per-pattern
  `for path in matched_paths` serial loop to `FuturesUnordered`
  fan-out. ant-design has ~200 workspace packages; saved ~150ms.
- **Setup phase + lockfile-write timing logs** — round out the
  per-phase wall account for the bench-comment infrastructure.
- **Manifests concurrency cap 64 → 128** — worker-pool delivers
  the parallelism that justifies the cap raise. CI ant-design
  avg_conc 84 at cap=128 (up from 55 under the old architecture);
  preload wall 3.10s → 2.15s.

## Wasm CI

cfg-gates `tokio::spawn` to `wasm_bindgen_futures::spawn_local`
on wasm32 since wasm-bindgen's `JsFuture` is `!Send`. Workers
still run independently — single-threaded under wasm but the
queue + Notify + mpsc termination story is unchanged.

`cargo check -p utoo-wasm --target wasm32-unknown-unknown` clean.

Tests: 164 ruborist + 10 doctests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
elrrrrrrr added a commit that referenced this pull request Apr 27, 2026
…p guard

Two folded changes that started life as separate commits on the
parent perf branch:

1. **Sequential / chunked parallel writes** (was: ad0dee97ab17b8). The old per-file `par_iter().for_each(write)` paid
   work-stealing futex park/unpark overhead per write. Each entry is
   <64 KB and a single fs::File::create + write_all returns in μs —
   rayon scheduler dominated. Switch to `entries.par_chunks(32)
   .try_for_each(...)`: each rayon task writes a contiguous run of
   32 files sequentially. Cuts task count by 32× while keeping
   multi-core IO-overlap parallelism.

   Cross-package parallelism is preserved by the outer
   `rayon::spawn` in `extract_tarball`, which itself was already
   landed on `next` via earlier work.

   Verified essential by an A/B on the parent branch: removing the
   intra-package par_chunks (sequential `for entry in &entries`
   inside each rayon task) regressed CI p3_cold_install +3.67s and
   exploded σ from 0.04 → 2.85 — IO can't interleave across cores
   when each tarball serialises its own writes.

2. **Tar Slip guard** — reject tar entries whose path is absolute
   or contains `..` components before joining with the destination.
   Without this an attacker-controlled tarball could overwrite
   arbitrary files via paths like `../../etc/foo` or
   `/etc/passwd`. `tar` crate does not enforce this by default;
   `npm` and `pnpm` both validate. We log+skip such entries.

Both changes touch the same single function so they commit together.

CI bench shows p3_cold_install at 5.74s vs bun 7.71s (utoo +2s
ahead). PR description in #2818 documents the full A/B journey.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
elrrrrrrr added a commit that referenced this pull request Apr 27, 2026
Bundle of independently-motivated allocator + cache hot-path
optimisations from the parent perf branch (#2818). Each landed
during the worker-pool exploration but doesn't depend on the
worker-pool architecture itself — they stand alone as
straightforward perf wins for the resolver.

## TLS provider — `aws-lc-rs` instead of `ring`

`reqwest` 0.12's default `rustls-tls-native-roots` feature pins
`ring` via Cargo's feature unification. Switch to
`rustls-tls-native-roots-no-provider`, build our own
`rustls::ClientConfig` with the `aws_lc_rs` provider, pass via
`Client::use_preconfigured_tls`. CI measurement (4-core ubuntu vs
npmjs.org): ring's per-handshake CCS→AppData was 78 ms p50 / 154
ms max, all 128 parallel handshakes serialising across 4 cores.
aws-lc-rs (BoringSSL primitives) is ~3× faster on x86_64. Saved
~420 ms preload on cold ant-design.

## DNS — per-family rotation

`getaddrinfo` typically returns 10 v6 + 12 v4 for npmjs.org. A
flat rotation across the joined list meant offsets 0..10 all
started inside the v6 range; on hosts where v6 routing fails
(GitHub Actions runners), every connection fell through to the
*same* first-reachable v4. Rotate per-family so v4 conns cycle
across all v4 addresses (and v6 over v6) — observed pcap on bun
shows the same 4×64 distribution we now produce.

## Disk-cache bulk-readdir ETag index

`PackageCache` lazy-builds a `HashSet<String>` of names with
existing disk cache entries from a single `read_dir(cache_dir)` +
per-`@scope` recurse. `get_versions_from_disk` and
`get_version_manifest_from_disk` short-circuit via the index.
Restores the warm-run 304 path that was temporarily removed in
46cb803 (per-package `try_exists` was 16 ms avg on the cold-run
critical path; now zero).

## Lazy per-version `CoreVersionManifest` via `simd_json::OwnedValue`

`Versions` now stores `keys: Vec<String>` (ordered version list)
+ `trees: HashMap<String, Arc<simd_json::OwnedValue>>`
(pre-parsed JSON subtrees). Strongly-typed `CoreVersionManifest`
is materialised on demand via
`CoreVersionManifest::deserialize(tree.as_ref())` — zero-copy
through `simd_json::OwnedValue`'s `Deserializer` impl, memoised in
a `DashMap`. Resolver typically reads 1-3 of the ~500 versions
per manifest; previous design built every one eagerly.

## `Arc<FullManifest>` in `MemoryCache`

Cache previously returned `FullManifest` by value, deep-cloning
the per-version HashMap (100-500 entries × String key clone + Arc
bump per cache hit) on the resolver hot path. ~2730 cache hits
during cold preload × ~200-entry HashMap clone =
~500k allocations on shared resolver threads, contending the
allocator. Wrap in `Arc<FullManifest>`; cache hit becomes one
atomic bump.

## `normalize_spec` returns `Cow<'a, str>`

Was unconditionally allocating `(String, String)` even for the
~99 % of deps with no `npm:` / `workspace:` prefix. ~5460 String
allocations per ant-design preload, all on resolver hot path.
Common path now returns `Cow::Borrowed`.

## Drop `versions.keys.clone()` from cache-hit path

`resolve_package`'s full-manifest cache-hit branch was cloning
the entire `versions.keys: Vec<String>` (~200 entries) just to
pass `&[String]` to `resolve_target_version`. Borrow directly via
Arc auto-deref. ~360k String allocs eliminated (~1800 cache hits
× ~200 entries).

## OnceMap dedup

New `crate::util::oncemap` module: `DashMap` + `tokio::sync::Notify`
coalescer for concurrent `resolve_full_manifest` callers of the
same name. First caller fetches the network; others wait on the
shared `Notify` and read the cached `Arc<V>`. Replaces the prior
per-name `tokio::sync::Mutex<()>` gate that serialised the hot
dispatch path.

## tracing file_filter info+ default

File-layer log filter dropped from `utoo=debug` to `utoo=info`.
Hot-path `tracing::debug!()` calls (cache hits, BFS dispatch,
preload events) emit ~5-10 events per resolved manifest. With
2730+ manifests during cold preload that's 15-30k events that —
even routed through the non_blocking appender's channel — pay
format/serialise CPU on the resolving thread before the channel
send. Override via `UTOO_FILE_LOG=debug` for diagnostics.

## indicatif progress bar — drop per-package message updates

`PreloadFetching` and `PreloadProgress` used to call
`format!("fetching/resolved {}", name)` + `PROGRESS_BAR.set_message()`
per event. With ~9000 such calls per ant-design preload and an
indicatif-internal `Mutex` per call, this serialised the main
loop's fill-and-drain rate. The user can't visually parse 5460
message swaps in 3 seconds anyway. Counter still ticks via
`PROGRESS_BAR.inc(1)`.

## HTTP + parse diagnostic infrastructure (used by PR4)

`service/http.rs` ships `start_http_trace` / `finish_http_trace`
+ `start_parse_trace` / `finish_parse_trace` plus
`record_http_interval` + `record_parse_interval` callbacks.
`#[allow(dead_code)]` on the start/finish for now — the preload
worker-pool refactor in the next PR (#TBD) wires them in.

Also bumps the `+ Sync` bound on `RegistryClient` callers in
`builder.rs` / `preload.rs` / `resolver/registry.rs` — required
because the trait's default-method futures gained `+ Send`
(needed downstream by tokio::spawn, but already correct for
single-threaded resolvers too).

Tests: 164 ruborist + 248/249 utoo-pm pass (1 pre-existing flake
on `test_update_package_binary_fsevents` when run in parallel,
passes alone).

Stacks: PR4 (preload worker-pool architecture) targets this
branch and adds the bound propagation + spawn refactor on top.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
elrrrrrrr added a commit that referenced this pull request Apr 27, 2026
The headline architectural change of #2818 — preload phase shifts
from a single-task `FuturesUnordered` cooperative poller to N
long-lived `tokio::spawn` workers (or
`wasm_bindgen_futures::spawn_local` on wasm32 where Send isn't
satisfied). Stacks on top of #2826.

## Why

Old design: main task owned `FuturesUnordered`, polled all preload
futures cooperatively, and ran every per-future continuation
(post-await body, completion handler, dispatch refill) on the same
single task. The deeper await chain inside `resolve_package`
(cache check + `OnceMap::get_or_init` + `RetryIf` + `request.send`
+ `bytes` + parse spawn_blocking) made each future yield 5+ times,
and every yield round-tripped through main — saturating it. CI
ant-design preload sustained avg_conc=55-61 even after Mutex /
allocator hot-path eliminations, while the standalone
manifest-bench (#2824) hit 92 on the same reqwest stack.

## How

N long-lived `tokio::spawn` workers pulling from a shared
lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker owns
an `Arc<R>` clone and runs `resolve_package` on tokio's global
executor — futures progress fully independently, no cooperative
poll bottleneck. Main task only drains an `mpsc::unbounded_channel`
of completions to fire receiver events + on_manifest callback.

Termination: workers track `dispatched` / `completed: AtomicUsize`
and park on a shared `Notify` when the queue is empty. When the
last completion makes `completed == dispatched` and the queue is
empty, the finishing worker raises a `shutdown` flag and wakes
others; all workers drop their result_tx clones, the channel
closes, and the main `recv().await` loop exits.

## Trait surface change

- `MockRegistryClient` + `MockPackage` `derive(Clone)` so tests can
  wrap the mock in `Arc` for the new signature
- `preload_manifests` takes `registry: Arc<R>` (was `&R`); call
  site in `run_preload_phase` clones the borrowed registry into a
  fresh `Arc`. Bound at every public surface up the chain bumped
  to `R: RegistryClient + Clone + MaybeSend + MaybeSync + 'static`,
  `R::Error: MaybeSend`. The `MaybeSend` / `MaybeSync` shims
  (added in #2826) keep the trait surface wasm-compatible.

## Companion changes folded in

- **Inline simd_json parse** — drop `tokio::task::spawn_blocking`
  in `service/manifest.rs`. Worker-pool surfaced parse blocking-
  pool queue saturation: `queue p95=200ms sum=70-89s` over 2730
  manifests on cap=4 CI runners. Inline parse on the worker thread
  eliminates dispatch + queue overhead.
- **Workspace package.json parallel reads** — switch the per-pattern
  `for path in matched_paths` serial loop to `FuturesUnordered`
  fan-out. ant-design has ~200 workspace packages; saved ~150ms.
- **Setup phase + lockfile-write timing logs** — round out the
  per-phase wall account for the bench-comment infrastructure.
- **Manifests concurrency cap 64 → 128** — worker-pool delivers
  the parallelism that justifies the cap raise. CI ant-design
  avg_conc 84 at cap=128 (up from 55 under the old architecture);
  preload wall 3.10s → 2.15s.

## Wasm CI

cfg-gates `tokio::spawn` to `wasm_bindgen_futures::spawn_local`
on wasm32 since wasm-bindgen's `JsFuture` is `!Send`. Workers
still run independently — single-threaded under wasm but the
queue + Notify + mpsc termination story is unchanged.

`cargo check -p utoo-wasm --target wasm32-unknown-unknown` clean.

Tests: 164 ruborist + 10 doctests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
elrrrrrrr added a commit that referenced this pull request Apr 27, 2026
- delete crates/manifest-bench (debug-only, never merged)
- tombi format crates/ruborist/Cargo.toml
- typos: unparseable → unparsable in bench/pm-bench.sh
@github-actions
Copy link
Copy Markdown

📊 pm-bench-phases · ec1b50b · linux (ubuntu-latest)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 9.54s 0.28s 10.18s 10.17s 653M 330.6K
utoo-npm 10.23s 0.20s 11.61s 13.29s 1.13G 159.4K
utoo 9.21s 1.07s 11.20s 12.27s 2.26G 260.4K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 17.3K 18.1K 1.16G 7M 1.83G 1.72G 1M
utoo-npm 174.2K 160.1K 1.14G 4M 1.68G 1.68G 2M
utoo 79.1K 40.4K 1.13G 5M 1.68G 1.68G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 2.37s 0.08s 3.87s 1.06s 483M 174.2K
utoo-npm 6.01s 0.60s 6.07s 1.09s 430M 74.5K
utoo 2.65s 0.05s 5.62s 1.97s 1.44G 193.4K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 12.0K 3.3K 200M 3M 104M - 1M
utoo-npm 68.8K 2.5K 202M 2M 9M 5M 2M
utoo 18.3K 15.1K 197M 3M 7M 5M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 7.47s 0.71s 6.14s 10.00s 595M 203.7K
utoo-npm 9.40s 1.22s 5.61s 12.10s 905M 122.3K
utoo 7.69s 2.82s 5.46s 10.86s 878M 107.1K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 6.7K 7.4K 993M 4M 1.73G 1.73G 1M
utoo-npm 153.9K 109.7K 965M 4M 1.67G 1.67G 2M
utoo 89.8K 43.0K 965M 3M 1.67G 1.67G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.35s 0.06s 0.23s 2.32s 135M 32.1K
utoo-npm 2.28s 0.18s 0.61s 3.91s 84M 19.6K
utoo 2.12s 0.04s 0.40s 3.42s 64M 13.7K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 382 32 7M 38K 1.88G 1.72G 1M
utoo-npm 53.0K 22.0K 21K 12K 1.67G 1.67G 2M
utoo 16.7K 8.9K 20K 10K 1.68G 1.67G 2M

npmmirror.com

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 28.79s 6.58s 9.42s 10.19s 523M 367.5K
utoo-npm 30.49s 13.81s 8.04s 14.54s 681M 116.6K
utoo 14.32s 0.63s 7.42s 12.27s 894M 132.4K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 92.9K 5.5K 1.12G 12M 1.85G 1.73G 2M
utoo-npm 250.4K 98.7K 978M 9M 1.67G 1.68G 2M
utoo 154.1K 61.6K 984M 9M 1.67G 1.68G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 1.58s 0.04s 4.00s 1.11s 552M 185.7K
utoo-npm 6.66s 1.17s 2.12s 0.60s 74M 16.1K
utoo 1.12s 0.10s 1.15s 0.38s 88M 18.6K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 5.5K 5.9K 151M 3M 106M - 2M
utoo-npm 47.6K 622 13M 2M - 4M 2M
utoo 15.1K 917 17M 3M - 4M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 30.52s 19.81s 5.92s 9.44s 238M 95.5K
utoo-npm 44.38s 33.35s 6.23s 12.96s 612M 108.7K
utoo 20.24s 3.22s 5.80s 11.57s 663M 98.1K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 69.2K 3.4K 999M 9M 1.73G 1.73G 2M
utoo-npm 198.8K 99.0K 984M 7M 1.67G 1.67G 2M
utoo 137.1K 49.8K 968M 7M 1.67G 1.67G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.32s 0.05s 0.22s 2.33s 135M 31.3K
utoo-npm 2.57s 0.15s 0.63s 3.98s 84M 19.7K
utoo 2.13s 0.30s 0.43s 3.43s 65M 14.3K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 759 24 7M 86K 1.88G 1.72G 2M
utoo-npm 55.3K 23.0K 39K 12K 1.67G 1.67G 2M
utoo 16.5K 9.4K 40K 12K 1.67G 1.67G 2M

@github-actions
Copy link
Copy Markdown

📊 pm-bench-phases · ec1b50b · mac (macos-latest)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 17.84s 3.14s 6.29s 19.27s 793M 51.2K
utoo-npm 23.09s 0.97s 11.05s 27.49s 970M 97.7K
utoo 18.90s 1.47s 9.54s 22.98s 1.97G 176.9K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 16.9K 145.3K - - 1.76G 1.91G 1M
utoo-npm 13.2K 381.6K - - 1.63G 1.83G 2M
utoo 4.4K 216.5K - - 1.63G 1.88G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 1.97s 0.12s 2.26s 0.95s 505M 32.9K
utoo-npm 4.86s 0.13s 3.99s 2.05s 542M 36.7K
utoo 3.00s 0.13s 3.92s 2.06s 1.62G 107.3K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 10 24.0K - - 110M - 1M
utoo-npm 13 78.9K - - 28M 5M 2M
utoo 42 46.7K - - 27M 5M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 20.05s 3.20s 4.01s 22.14s 531M 34.5K
utoo-npm 18.67s 1.75s 4.82s 24.00s 737M 80.8K
utoo 12.40s 2.43s 3.74s 17.48s 718M 77.9K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 4.8K 138.4K - - 1.70G 1.94G 1M
utoo-npm 1.5K 242.7K - - 1.61G 1.83G 2M
utoo 1.3K 154.1K - - 1.61G 1.83G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 5.12s 0.62s 0.11s 2.25s 48M 3.7K
utoo-npm 4.00s 0.35s 0.57s 2.92s 91M 6.8K
utoo 3.86s 0.57s 0.37s 2.56s 82M 5.9K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 15.6K 876 - - 1.86G 1.90G 1M
utoo-npm 13.0K 73.4K - - 1.61G 1.82G 2M
utoo 13.7K 20.1K - - 1.63G 1.82G 2M

npmmirror.com

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 56.79s 22.62s 6.67s 18.61s 556M 36.0K
utoo-npm 63.80s 40.68s 8.94s 24.20s 641M 74.0K
utoo 29.08s 7.31s 7.43s 23.56s 719M 79.0K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 13.8K 175.9K - - 1.79G 1.90G 2M
utoo-npm 4.1K 472.4K - - 1.61G 1.87G 2M
utoo 1.9K 285.5K - - 1.61G 1.87G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 35.23s 3.97s 2.99s 1.86s 499M 32.5K
utoo-npm 27.66s 16.35s 2.53s 1.55s 80M 5.8K
utoo 10.49s 10.44s 1.62s 0.71s 92M 6.6K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 49 37.1K - - 113M - 2M
utoo-npm 15 50.4K - - - 4M 2M
utoo 31 28.4K - - - 4M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 23.13s 0.12s 3.99s 19.21s 269M 17.8K
utoo-npm 36.33s 2.43s 5.41s 19.41s 699M 76.8K
utoo 32.64s 0.65s 5.45s 20.26s 679M 77.9K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 1.8K 156.7K - - 1.64G 1.91G 2M
utoo-npm 1.6K 334.1K - - 1.60G 1.83G 2M
utoo 1.3K 251.7K - - 1.60G 1.83G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 4.82s 0.82s 0.15s 2.45s 53M 4.0K
utoo-npm 5.29s 1.06s 0.81s 4.04s 88M 6.5K
utoo 5.60s 0.06s 0.51s 3.54s 85M 6.1K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 14.6K 4.5K - - 1.87G 1.93G 2M
utoo-npm 12.4K 78.9K - - 1.61G 1.87G 2M
utoo 13.1K 20.8K - - 1.61G 1.87G 2M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bench-phases Run pm-bench-phases workflow benchmark Run pm-bench on PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant