Skip to content

Surface readTreehash errors instead of swallowing#118

Open
sbalabanov wants to merge 1 commit into
fix-cache-goroutine-ctxfrom
readtreehash-surface-errors
Open

Surface readTreehash errors instead of swallowing#118
sbalabanov wants to merge 1 commit into
fix-cache-goroutine-ctxfrom
readtreehash-surface-errors

Conversation

@sbalabanov

Copy link
Copy Markdown
Contributor

Summary

readTreehash collapsed two very different cases — a legitimate cache miss (treehash not yet computed) and a real infra failure (e.g. storage backend rejecting a missing-TTL request) — into the same return value: "". Callers couldn't distinguish them, so an outage that disabled the cache showed up only as silent recomputation, visible only by tailing logs.

Changes

  1. readTreehash now returns (string, error):

    • NotFound("", nil): the expected miss, callers fall through to recompute
    • any other failure → ("", err): wrapped with the storage key for context
    • success → (treehash, nil)
  2. Pre-cache lookup in GetChangedTargets surfaces the error as a request failure tagged failureReasonTreehashRead / ErrorTypeInfra. Storage breakage now becomes a visible request-level failure rather than silent degradation, so dashboards and on-call alerts can see it.

  3. Cache-write goroutine can't return synchronously, so it logs at Error level (was Warn) and abandons the write. The cache opportunity is lost but the failure stays loud.

  4. New failureReasonTreehashRead failure-reason constant for metric tagging.

Tests

  • TestReadTreehash covers the four signature outcomes: NotFound miss, generic storage error, nil response, success.
  • TestGetChangedTargets_TreehashReadError replaces the old GetGraphError test (which only "worked" because errors were swallowed and the request still reached graph fetch) and asserts the request fails with the new failureReasonTreehashRead classification.

Test plan

  • `go build ./...`
  • `go test ./controller/...`
  • `./tools/bazel test //controller/...`
  • CI green

Stacking

Based on `fix-cache-goroutine-ctx` (#115). Merge order: #115 first, then this. Once #115 lands, this PR's base should be retargeted to `main`.

🤖 Generated with Claude Code

@sbalabanov sbalabanov requested review from a team as code owners June 15, 2026 23:54
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@sbalabanov sbalabanov force-pushed the readtreehash-surface-errors branch from d0f73b9 to fa8e46d Compare June 16, 2026 00:05
@sbalabanov sbalabanov force-pushed the fix-cache-goroutine-ctx branch from f7d9283 to 6bd1655 Compare June 16, 2026 16:05
@sbalabanov sbalabanov force-pushed the readtreehash-surface-errors branch from fa8e46d to 793d755 Compare June 16, 2026 16:07
@sbalabanov sbalabanov force-pushed the fix-cache-goroutine-ctx branch from 6bd1655 to 12fbcd7 Compare June 16, 2026 18:28
readTreehash previously logged a warning and returned "" for any error,
collapsing two very different cases into one signal: a legitimate cache
miss (treehash not yet computed) and a real infra failure (e.g. storage
backend rejecting a missing-TTL request). The caller couldn't tell them
apart, so an outage that disabled the cache showed up only as silent
recomputation — visible only by tailing logs.

Change the signature to (string, error):
  - NotFound          -> ("", nil): expected miss, callers fall through
  - any other failure -> ("", err): wrapped with the storage key
  - success           -> (treehash, nil)

Pre-cache lookup in GetChangedTargets now surfaces the error as a
request failure tagged failureReasonTreehashRead / ErrorTypeInfra so
storage breakage becomes a visible request-level failure rather than
silent degradation.

The cache-write goroutine can't return synchronously, so it logs at
Error level and abandons the write — losing the cache opportunity but
keeping the failure loud.

Tests:
  - TestReadTreehash covers all four signature outcomes.
  - TestGetChangedTargets_TreehashReadError (replaces the old
    GetGraphError test, which only worked because errors were swallowed)
    asserts the request fails with the new failure_reason classification.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sbalabanov sbalabanov force-pushed the readtreehash-surface-errors branch from 793d755 to 5b7fe0b Compare June 16, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants