Skip to content

fix(git): cap default zstd-threads to 4 in server config#281

Draft
worstell wants to merge 1 commit intomainfrom
worstell/cap-zstd-threads-default
Draft

fix(git): cap default zstd-threads to 4 in server config#281
worstell wants to merge 1 commit intomainfrom
worstell/cap-zstd-threads-default

Conversation

@worstell
Copy link
Copy Markdown
Contributor

Problem

The git strategy's zstd-threads config defaulted to 0, which expands to runtime.NumCPU() at compression/decompression time (see client/archive.go).

On hosts with many cores (e.g. 32–64 vCPU), every snapshot subprocess invokes zstd -T<NumCPU>, and each zstd worker holds several MB of buffers. That's already a few hundred MB per snapshot subprocess.

This compounds badly with cachewd's snapshot scheduling: scheduleSnapshotJobs submits three periodic jobs per warm mirror (snapshot-periodic, lfs-snapshot-periodic, mirror-snapshot-periodic), and on startup all three fire immediately for every mirror discovered by warmExistingRepos. Total transient memory during warm-up scales as mirrors × 3 × cores, easily reaching tens of GB on a many-mirror server.

Change

Cap the server-side default to 4 threads. Operators who want the legacy behaviour can set zstd-threads = 0 (or any other value) explicitly in their HCL config.

Why 4?

4 threads is enough to keep zstd off the critical path for typical mirror sizes without giving any single snapshot the ability to spike memory and CPU across the whole node. It's a conservative starting point — operators sizing for very large monorepos can still tune up.

What this does NOT change

The CLI flags in cmd/cachew/{main,git}.go keep 0 as their default, because cachew save/restore/git clone are short-lived single operations where using all cores is the right behaviour.

Follow-ups (separate PRs)

  • Stagger the initial fire of the three periodic snapshot jobs per repo so they don't all kick off simultaneously on warm-up.
  • Consider a global semaphore around concurrent snapshot generation to bound total in-flight zstd memory regardless of zstd-threads setting.

The git strategy's zstd-threads config defaulted to 0, which expands to
runtime.NumCPU() at compression/decompression time. On high-core hosts
(e.g. 32-64 vCPU nodes) this means each snapshot subprocess spawns one
zstd worker per core, with each worker holding several MB of buffers.

cachewd schedules three periodic jobs per warm mirror (snapshot,
lfs-snapshot, mirror-snapshot) and they fire concurrently across all
mirrors at startup. Combined with all-cores zstd, total transient memory
during warm-up scales as repos x 3 x cores, easily reaching tens of GB
on a many-mirror server and contributing to OOM kills.

Cap the server-side default to 4 threads, which is enough to keep zstd
off the critical path without blowing per-process memory. Operators that
want the legacy behaviour can still set zstd-threads = 0 explicitly.

The CLI's zstd-threads flags (cmd/cachew) keep 0 as their default
because CLI invocations are short-lived single operations where using
all cores is the right behaviour.

Amp-Thread-ID: https://ampcode.com/threads/T-019ddaf9-b3be-7604-90b5-18566c260d1f
Co-authored-by: Amp <amp@ampcode.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant