Hey guys...this one's a doosy, I'm sorry. Trying to push us to use AI more (whom this issue's author clearly is), and frankly, I never would have found or understood this one otherwise. Airing on the side of too much info rather than too little.
Summary
The official Helm chart (oci://ghcr.io/triggerdotdev/charts/trigger) pins the Bitnami ClickHouse subchart to clickhouse-9.3.7, which in turn pins bitnami/clickhouse:25.6.1-debian-12-r0. Under sustained ingest, ClickHouse 25.6.1 hits a memory-tracker accounting bug that causes the global memory counter to overflow to ~7 EiB (≈ 2^63), at which point every query — reads and writes — is rejected by OvercommitTracker until the pod is restarted. The trigger.dev webapp dashboard surfaces this as "Unable to load your task runs", and event/run telemetry stops being persisted.
The condition is not self-clearing. The host has plenty of free memory (RSS ~2 GiB out of a 21.6 GiB limit), but ClickHouse's internal accounting is wedged. Only an in-process restart resets it.
Environment
- Trigger.dev Helm chart:
4.0.5 (also reproduced on 4.4.5 since both pin the same subchart — see below)
- App image:
ghcr.io/triggerdotdev/trigger.dev:v4.4.4
- Kubernetes: EKS, single-shard ClickHouse statefulset (
trigger-clickhouse-shard0-0), ~20 GiB PVC
- ClickHouse resources: requests
13.5Gi / limits 24Gi (chart values block)
- Workload: typical Trigger.dev v4 ingest — task runs + trace events flowing through
task_runs_v2, task_events_v2, and the raw_task_runs_payload_v1 staging table
Reproduction
This was a production incident, not an isolated synthetic repro, but the trigger seems to be sustained write pressure with concurrent background merges on raw_task_runs_payload_v1. Once the merge tasks start failing with MEMORY_LIMIT_EXCEEDED, retries pile up and never recover.
What we observed
ClickHouse pod logs (representative, repeating thousands of times per minute):
Code: 241. DB::Exception: (total) memory limit exceeded:
would use 7.00 EiB (attempt to allocate chunk of 4.00 MiB bytes),
current RSS: 2.12 GiB, maximum: 21.60 GiB.
OvercommitTracker decision: Query was selected to stop by OvercommitTracker.
(MEMORY_LIMIT_EXCEEDED)
... while reading from part .../raw_task_runs_payload_v1/...
... in query: INSERT INTO trigger_dev.task_runs_v2 ...
... in query: INSERT INTO trigger_dev.task_events_v2 ...
The 7.00 EiB figure is the giveaway — that's ~2^63 bytes, i.e. a signed-integer overflow in ClickHouse's global memory tracker. The actual RSS is ~2 GiB.
Webapp logs (during the event):
EventRepo.DynamicFlushScheduler Error attempting to flush batch
consecutiveFailures: 19438
table: trigger_dev.task_events_v2
error: InsertError: (total) memory limit exceeded ... 7.00 EiB ...
The webapp accumulated ~1.75M backlogged events that had to drain after we restarted ClickHouse. UI showed "Unable to load your task runs" because the dashboard's reads against task_runs_v2 were rejected by the same tracker.
Root cause (best assessment)
This is a known class of ClickHouse memory-tracker accounting bug — a free() is double-counted (or an alloc() is missed) in one of the hot paths (background merges, async inserts, materialized views), the global atomic counter goes negative, the unsigned interpretation looks like ~9 EiB, and OvercommitTracker rejects every subsequent allocation. There are multiple upstream ClickHouse commits in 25.7+ touching memory-tracker accuracy and overflow handling.
Mitigation (workaround)
kubectl rollout restart statefulset/trigger-clickhouse-shard0 -n trigger
Recovery is immediate — error rate goes from ~11k/min to ~0 within seconds of the pod coming back up. Webapp consecutiveFailures drops from ~47k to single digits as soon as ClickHouse is reachable again.
Suggested resolution (chart-side)
The trigger Helm chart's Bitnami CH subchart pin (charts/clickhouse-9.3.7 → bitnami/clickhouse:25.6.1-debian-12-r0) hasn't moved since trigger@4.0.5 and is still the same in trigger@4.4.5. Bumping the Bitnami subchart pin in hosting/k8s/helm/Chart.yaml to a current version (latest Bitnami CH chart on main ships 25.7.5-debian-12-r0) would pull in upstream tracker fixes for everyone running self-hosted, without each operator having to override clickhouse.image.tag themselves.
Operators currently on trigger@4.0.5–4.4.5 are exposed to this regardless of chart version, since the subchart pin is unchanged.
Suggested resolution (operational hardening, optional)
A few small additions would make this much less painful even if the underlying bug isn't fully fixed:
- A liveness probe that exercises the query path (e.g.
clickhouse-client -q "SELECT 1") on the bundled CH StatefulSet. The current TCP/HTTP probe stays green when the pod is wedged — a query-based probe would let Kubernetes auto-restart the pod within a minute. Today an operator has to notice the user-visible failure first.
- Webapp circuit breaker / backlog cap on
EventRepo.DynamicFlushScheduler. When consecutiveFailures crosses some threshold, sample/drop trace events instead of accumulating millions of items in memory. Losing 5 minutes of partial trace data is preferable to a 1.75M-item backlog that takes hours to drain post-recovery and increases the chance of re-tripping the tracker.
- Document the failure mode and recovery in the self-hosting docs, so other operators recognize "Unable to load your task runs" +
MEMORY_LIMIT_EXCEEDED ... 7 EiB as a single condition with a known one-line fix.
Happy to test a chart bump on our end and report back, or open a small PR against hosting/k8s/helm/ for the subchart bump if helpful.
— Filed by an operator running self-hosted Trigger.dev v4 on EKS
Hey guys...this one's a doosy, I'm sorry. Trying to push us to use AI more (whom this issue's author clearly is), and frankly, I never would have found or understood this one otherwise. Airing on the side of too much info rather than too little.
Summary
The official Helm chart (
oci://ghcr.io/triggerdotdev/charts/trigger) pins the Bitnami ClickHouse subchart toclickhouse-9.3.7, which in turn pinsbitnami/clickhouse:25.6.1-debian-12-r0. Under sustained ingest, ClickHouse 25.6.1 hits a memory-tracker accounting bug that causes the global memory counter to overflow to ~7 EiB (≈2^63), at which point every query — reads and writes — is rejected byOvercommitTrackeruntil the pod is restarted. The trigger.dev webapp dashboard surfaces this as "Unable to load your task runs", and event/run telemetry stops being persisted.The condition is not self-clearing. The host has plenty of free memory (RSS ~2 GiB out of a 21.6 GiB limit), but ClickHouse's internal accounting is wedged. Only an in-process restart resets it.
Environment
4.0.5(also reproduced on4.4.5since both pin the same subchart — see below)ghcr.io/triggerdotdev/trigger.dev:v4.4.4trigger-clickhouse-shard0-0), ~20 GiB PVC13.5Gi/ limits24Gi(chart values block)task_runs_v2,task_events_v2, and theraw_task_runs_payload_v1staging tableReproduction
This was a production incident, not an isolated synthetic repro, but the trigger seems to be sustained write pressure with concurrent background merges on
raw_task_runs_payload_v1. Once the merge tasks start failing withMEMORY_LIMIT_EXCEEDED, retries pile up and never recover.What we observed
ClickHouse pod logs (representative, repeating thousands of times per minute):
The
7.00 EiBfigure is the giveaway — that's~2^63bytes, i.e. a signed-integer overflow in ClickHouse's global memory tracker. The actual RSS is ~2 GiB.Webapp logs (during the event):
The webapp accumulated ~1.75M backlogged events that had to drain after we restarted ClickHouse. UI showed "Unable to load your task runs" because the dashboard's reads against
task_runs_v2were rejected by the same tracker.Root cause (best assessment)
This is a known class of ClickHouse memory-tracker accounting bug — a
free()is double-counted (or analloc()is missed) in one of the hot paths (background merges, async inserts, materialized views), the global atomic counter goes negative, the unsigned interpretation looks like ~9 EiB, andOvercommitTrackerrejects every subsequent allocation. There are multiple upstream ClickHouse commits in 25.7+ touching memory-tracker accuracy and overflow handling.Mitigation (workaround)
Recovery is immediate — error rate goes from
~11k/minto~0within seconds of the pod coming back up. WebappconsecutiveFailuresdrops from~47kto single digits as soon as ClickHouse is reachable again.Suggested resolution (chart-side)
The trigger Helm chart's Bitnami CH subchart pin (
charts/clickhouse-9.3.7→bitnami/clickhouse:25.6.1-debian-12-r0) hasn't moved sincetrigger@4.0.5and is still the same intrigger@4.4.5. Bumping the Bitnami subchart pin inhosting/k8s/helm/Chart.yamlto a current version (latest Bitnami CH chart onmainships25.7.5-debian-12-r0) would pull in upstream tracker fixes for everyone running self-hosted, without each operator having to overrideclickhouse.image.tagthemselves.Operators currently on
trigger@4.0.5–4.4.5are exposed to this regardless of chart version, since the subchart pin is unchanged.Suggested resolution (operational hardening, optional)
A few small additions would make this much less painful even if the underlying bug isn't fully fixed:
clickhouse-client -q "SELECT 1") on the bundled CH StatefulSet. The current TCP/HTTP probe stays green when the pod is wedged — a query-based probe would let Kubernetes auto-restart the pod within a minute. Today an operator has to notice the user-visible failure first.EventRepo.DynamicFlushScheduler. WhenconsecutiveFailurescrosses some threshold, sample/drop trace events instead of accumulating millions of items in memory. Losing 5 minutes of partial trace data is preferable to a 1.75M-item backlog that takes hours to drain post-recovery and increases the chance of re-tripping the tracker.MEMORY_LIMIT_EXCEEDED ... 7 EiBas a single condition with a known one-line fix.Happy to test a chart bump on our end and report back, or open a small PR against
hosting/k8s/helm/for the subchart bump if helpful.— Filed by an operator running self-hosted Trigger.dev v4 on EKS