Skip to content

Comments

perf(operator): parallelize secret reads, remove monitorPod goroutine…#656

Open
syntaxsdev wants to merge 5 commits intoambient-code:mainfrom
syntaxsdev:feat/operator-enhancements
Open

perf(operator): parallelize secret reads, remove monitorPod goroutine…#656
syntaxsdev wants to merge 5 commits intoambient-code:mainfrom
syntaxsdev:feat/operator-enhancements

Conversation

@syntaxsdev
Copy link

  • Replace sequential secret/config reads with errgroup-based parallel fetches, reducing API round-trips when creating sessions under load
  • Remove monitorPod goroutine-per-session pattern; controller-runtime reconciler handles pod status via its watch/requeue loop
  • Eliminate redundant CR re-GET in ReconcilePendingSession; use the object already provided by the controller-runtime informer cache
  • Switch gauge callbacks in otel_metrics.go to use the informer cache instead of a cluster-wide LIST against the API server every 30s
  • Replace clearAnnotation GET+UPDATE (2 API calls) with a single JSON merge patch

@syntaxsdev
Copy link
Author

hold off

@github-actions

This comment has been minimized.

@syntaxsdev syntaxsdev marked this pull request as ready for review February 19, 2026 14:56
@syntaxsdev syntaxsdev force-pushed the feat/operator-enhancements branch from 842bda9 to 1a7c0b5 Compare February 19, 2026 16:05
@github-actions
Copy link
Contributor

github-actions bot commented Feb 19, 2026

Claude Code Review

Summary

This PR parallelizes operator secret reads with errgroup, removes the monitorPod goroutine-per-session pattern in favour of controller-runtime's work queue, switches OTel gauge callbacks to the informer cache, and replaces a GET+UPDATE annotation clear with a single JSON merge patch. The direction is architecturally correct and the load-test validation (30 sessions in ~21s, zero errors) is good evidence. However, there is one critical regression worth examining before merge.

Issues by Severity

🚫 Blocker Issues

None identified.

🔴 Critical Issues

1. Inactivity auto-stop is silently removed from the Running phase

shouldAutoStop / triggerInactivityStop were previously called on every monitorPod tick (every 5 seconds). After this PR neither function appears anywhere in the production reconciliation path. reconcileRunning in reconcile_phases.go:279-344 checks: pod existence, desired-phase annotation, generation drift, and token refresh — but not inactivity.

inactivity.go itself still contains the full implementation, and its package comment still references monitorPod:

// for agentic sessions. The operator's monitorPod loop calls shouldAutoStop()
// on each tick to check whether a session has exceeded its configured inactivity timeout

Consequence: sessions will now run indefinitely regardless of inactivityTimeoutSeconds in the session spec or ProjectSettings. This breaks the auto-stop feature introduced in the previous commit.

The fix is straightforward — call shouldAutoStop / triggerInactivityStop from reconcileRunning (or export them and call from the reconciler). Since reconcileRunning already requeueing every 30 seconds this is compatible with the new architecture.

🟡 Major Issues

2. Non-standard import grouping in sessions.go

sync was moved out of the stdlib block into its own blank-line-separated group:

    "time"           // stdlib group ends here

    "sync"           // ← isolated stdlib import

    "ambient-code-operator/internal/config"

goimports requires stdlib imports to be in a single group. The CI go-lint.yml workflow runs gofmt/golangci-lint, which will flag this. Running goimports -w components/operator/internal/handlers/sessions.go before push would resolve it.

3. Fragile error classification by string matching

In sessions.go (Stage 2 evaluation, ~line 600):

if strings.Contains(err.Error(), "runner token") {
    // set Failed status
}
return err

The error returned by the writeGroup goroutine is fmt.Errorf("failed to provision runner token for session %s: %v", name, err). If the message text changes the gate silently stops setting status before returning the error, leaving the session stuck. A typed sentinel error (e.g. var ErrTokenProvision = errors.New("token provision failed")) or wrapping with a custom type would be more robust.

4. Stale package comment in inactivity.go

The package doc still describes monitorPod which no longer exists. Stale comments create confusion for maintainers and were explicitly called out in the existing CLAUDE.md review checklist.

🔵 Minor Issues

5. errgroup used where sync.WaitGroup is semantically cleaner for Stage 1

All readGroup goroutines return nil unconditionally (errors go into result variables). errgroup.WithContext creates a context that is cancelled on first non-nil return — a signal that is never sent. Using sync.WaitGroup for readGroup would be more honest about intent. errgroup remains appropriate for writeGroup where goroutines can return real errors.

// readGroup, readCtx := errgroup.WithContext(context.TODO())
// → all goroutines always return nil, readCtx is never cancelled
var wg sync.WaitGroup
wg.Add(N)
// ...
wg.Wait()

6. cachedClient package-level global is not test-isolated

cachedClient in otel_metrics.go is set once via InitMetrics and never reset. In parallel tests (or if InitMetrics is called multiple times) this is racy. Consider accepting the client as a parameter on countSessionsByPhase rather than storing it in package state.

7. Pod-exists check moved before runner-secrets check changes error order

Previously, a missing ambient-runner-secrets secret would fail before even checking pod existence. Now pod existence is checked first (early return). This changes observable behaviour slightly: operators checking for the "SecretsMissing" condition on sessions created while the pod somehow already exists will no longer see that condition. Likely not an issue in practice but worth noting for test coverage.

Positive Highlights

  • Parallel reads with proper result aggregation — the mu-protected variable pattern for collecting errgroup results is idiomatic and correct.
  • JSON merge patch for clearAnnotation — reducing 2 API calls (GET + UPDATE) to 1 patch call is both faster and safer (no resource-version conflict possible).
  • Informer cache for OTel gauges — eliminating a cluster-wide LIST every 30 seconds is the right architectural choice, and the graceful fallback to config.DynamicClient is appreciated.
  • Stage separation with ASCII section headers── Stage 1: Parallel reads ── / ── Stage 2: Parallel writes ── significantly improves readability in a 500+ line function.
  • Net -185 lines — 470 removed, 285 added; the goroutine-per-pod model eliminated cleanly.
  • Load-test evidence in commit message — empirical validation (30 sessions, ~21s, 20 concurrent reconcilers, zero errors) is exactly the kind of evidence that justifies a performance PR.
  • go.mod cleanup — correctly promoting golang.org/x/sync from indirect to direct dependency.

Recommendations

Priority order:

  1. Fix the inactivity regression — export shouldAutoStop / triggerInactivityStop (or equivalent) and call them from reconcileRunning. A 30-second requeue interval is sufficient granularity for inactivity detection.
  2. Fix import grouping — run goimports -w on sessions.go to avoid CI failures.
  3. Update inactivity.go package comment — replace the reference to monitorPod with a reference to reconcileRunning.
  4. Replace string-match error classification — use a sentinel or typed error for token provisioning failures in the writeGroup.Wait() block.
  5. (Optional) Swap readGroup to sync.WaitGroup for clarity, keep errgroup only for writeGroup.

🤖 Generated with Claude Code


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

…s, use informer cache

- Replace sequential secret/config reads with errgroup-based parallel
  fetches, reducing API round-trips when creating sessions under load
- Remove monitorPod goroutine-per-session pattern; controller-runtime
  reconciler handles pod status via its watch/requeue loop
- Eliminate redundant CR re-GET in ReconcilePendingSession; use the
  object already provided by the controller-runtime informer cache
- Switch gauge callbacks in otel_metrics.go to use the informer cache
  instead of a cluster-wide LIST against the API server every 30s
- Replace clearAnnotation GET+UPDATE (2 API calls) with a single
  JSON merge patch

Validated with load-test-sessions.sh: 30 sessions reach Running in
~21s with 20 concurrent reconcilers, zero errors.
@syntaxsdev syntaxsdev force-pushed the feat/operator-enhancements branch from 1a7c0b5 to d761f07 Compare February 19, 2026 17:07
@github-actions
Copy link
Contributor

github-actions bot commented Feb 19, 2026

Claude Code Review

Summary

This PR delivers valuable performance work: parallelizing operator secret reads with errgroup, eliminating the goroutine-per-session monitorPod pattern, switching OTel gauge callbacks to the controller-runtime informer cache, and reducing annotation clears from 2 API calls to 1 patch. The load-test result (30 sessions in ~21s, 20 concurrent reconcilers, zero errors) is solid empirical backing.

Files changed: 5 | Net: +285 / -470 lines

One critical regression and three style/robustness issues require attention before merge.


Issues by Severity

🔴 Critical Issues

1. Inactivity auto-stop is removed with no replacement

shouldAutoStop and triggerInactivityStop (defined in internal/handlers/inactivity.go) were previously called every 5 seconds inside monitorPod. After this PR they have zero call sites in the production reconciliation path — confirmed by code inspection.

The reconcileRunning function (internal/controller/reconcile_phases.go:279–344) handles pod existence, desired-phase annotation, generation drift, and token refresh (correctly delegates to handlers.EnsureFreshRunnerToken — ✅) but does not call shouldAutoStop or triggerInactivityStop.

Both functions are unexported (shouldAutoStop, triggerInactivityStop), so reconcile_phases.go (in the controller package) cannot call them directly without an exported wrapper.

Consequence: inactivityTimeoutSeconds from the session spec and ProjectSettings is silently ignored. Sessions will run indefinitely until manually stopped or until the pod exits. This regresses the feature introduced in the previous commit (#651).

The inactivity.go package comment still references the removed monitorPod:

// for agentic sessions. The operator's monitorPod loop calls shouldAutoStop()
// on each tick to check whether a Running session has exceeded its configured
// inactivity timeout, and triggerInactivityStop() to initiate the shutdown.

Fix: Export ShouldAutoStop / TriggerInactivityStop (or add a single CheckAndHandleInactivity wrapper) and call it from reconcileRunning. The existing 30-second requeue interval is adequate granularity for inactivity detection.


🟡 Major Issues

2. sync stdlib import orphaned in its own group

In internal/handlers/sessions.go lines 14–21, the "sync" import was moved into its own blank-line-separated group between the stdlib block and the project import block:

    "time"           // ← stdlib group ends here

    "sync"           // ← isolated (wrong group)

    "ambient-code-operator/internal/config"

goimports (and by extension golangci-lint) requires all stdlib imports in a single group. The CI go-lint.yml workflow enforces this and will fail. Running goimports -w components/operator/internal/handlers/sessions.go resolves it.

3. Fragile error classification by string matching in writeGroup.Wait()

if err := writeGroup.Wait(); err != nil {
    if strings.Contains(err.Error(), "runner token") {  // ← brittle
        statusPatch.SetField("phase", "Failed")
        statusPatch.AddCondition(...)
        _ = statusPatch.Apply()
    }
    return err
}

The error text is constructed as fmt.Errorf("failed to provision runner token for session %s: %v", ...). If the message changes, the strings.Contains gate silently stops setting the Failed status before returning the error, leaving the session stuck in the previous phase with no condition explaining why.

A sentinel error or custom type would be more robust:

var errTokenProvision = fmt.Errorf("token provision failed")
// In goroutine: return fmt.Errorf("%w: %v", errTokenProvision, err)
// In check:    if errors.Is(err, errTokenProvision) { ... }

4. Stale package-doc in inactivity.go

The package comment references monitorPod which no longer exists. Per the CLAUDE.md pre-commit checklist ("Logging: Structured logs with relevant context"), stale documentation that describes removed behaviour should be updated along with the code change.


🔵 Minor Issues

5. errgroup semantics misleading for Stage 1 reads

All readGroup goroutines unconditionally return nil (errors captured in result variables instead). errgroup.WithContext creates a cancellable context intended to propagate first-failure cancellation — a signal that is never triggered here. sync.WaitGroup would communicate intent more accurately. errgroup remains appropriate for writeGroup where goroutines do return real errors.

6. cachedClient package-level global is not test-isolated

cachedClient in otel_metrics.go is set once via InitMetrics and never reset. Parallel tests that call InitMetrics multiple times could race on this variable. Passing the client as a parameter into countSessionsByPhase rather than storing it in package state would eliminate this concern.

7. Missing status update for vertex secret copy failure

If copySecretToNamespace fails inside writeGroup (distinct from the "secret not found" case handled in Stage 1), writeGroup.Wait() returns an error that does not contain "runner token", so no status condition is set before return err. The session is left without a diagnostic condition. This was also the behaviour in the original sequential code, so it is not a regression — but the refactor is a good opportunity to close this gap.


Positive Highlights

  • JSON merge patch for clearAnnotation — replaces GET + full UPDATE (2 calls, subject to resource-version conflicts) with a single atomic PATCH. Clean and correct.
  • Informer cache for OTel gauges — eliminating a cluster-wide LIST every 30 seconds is the right architectural choice; the graceful fallback to config.DynamicClient is a sensible safety net.
  • Parallel read stage with proper mutex protection — the mu-protected result-variable pattern is idiomatic; the Stage 1 / Stage 2 section headers with ASCII separators make a dense function significantly more readable.
  • Token refresh correctly wiredreconcileRunning delegates to handlers.EnsureFreshRunnerToken (line 338 of reconcile_phases.go), so that part of monitorPod's responsibility is already covered.
  • go.mod hygiene — correctly promoting golang.org/x/sync from indirect to direct dependency.
  • Load-test evidence in the commit message — empirical validation (30 sessions, ~21s, zero errors) is exactly the right justification for a performance change.
  • getContainerStatusByName retained — confirmed still used in internal/handlers/reconciler.go:275; not dead code.

Recommendations (priority order)

  1. Fix inactivity regression — Export ShouldAutoStop / TriggerInactivityStop and call from reconcileRunning. Update inactivity.go package comment to reference reconcileRunning.
  2. Fix import groupinggoimports -w components/operator/internal/handlers/sessions.go (will unblock CI).
  3. Replace string-match error gate — Use a sentinel or errors.Is pattern for runner-token errors in writeGroup.Wait().
  4. (Optional) Swap readGroup to sync.WaitGroup to better communicate intent.
  5. (Optional) Accept countSessionsByPhase client as parameter rather than package-level global.

Review performed using repository standards from CLAUDE.md, .claude/context/backend-development.md, .claude/context/security-standards.md, .claude/patterns/error-handling.md, .claude/patterns/k8s-client-usage.md, and direct inspection of reconcile_phases.go, inactivity.go, and the full PR diff.

🤖 Generated with Claude Code


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 19, 2026

Claude Code Review

Summary

This PR delivers meaningful performance wins for the operator under load: parallel secret reads with errgroup, single JSON merge patch for annotation clearing, informer cache for OTel metrics, and the removal of ~232 lines of goroutine-per-session monitoring code. The architectural direction is correct — moving from a hand-rolled polling loop to the controller-runtime reconciliation model. However, the removal of monitorPod without fully porting its pod completion logic into reconcileRunning introduces a critical regression: sessions whose pods complete will get stuck in Running phase indefinitely.


Issues by Severity

🚫 Blocker Issues

1. reconcileRunning never handles pod completion — sessions will be stuck in Running forever

monitorPod was responsible for detecting pod.Status.Phase == PodSucceeded / PodFailed and calling UpdateSessionFromPodStatus to transition the session. That function is now only called from reconcileCreating (reconcile_phases.go:235).

When a running pod transitions to Succeeded:

  1. The pod watch predicate fires (it triggers on phase changes — agenticsession_controller.go:265)
  2. The reconciler calls reconcileRunning
  3. reconcileRunning fetches the pod, finds it in Succeeded state — and then does nothing with that information
  4. Session stays Running and requeues after 30 seconds, repeating forever

And if the pod eventually gets garbage-collected (Kubernetes TTL), reconcileRunning will then see errors.IsNotFound and call ResetToPending (reconcile_phases.go:295-298), causing the session to restart rather than complete.

The fix is to add the pod-completion check at the top of reconcileRunning (after the pod fetch), mirroring what reconcileCreating does:

// Handle terminal pod phases
if pod.Status.Phase == corev1.PodSucceeded || pod.Status.Phase == corev1.PodFailed {
    if err := handlers.UpdateSessionFromPodStatus(ctx, session, pod); err != nil {
        return ctrl.Result{RequeueAfter: 5 * time.Second}, err
    }
    return ctrl.Result{}, nil
}

The load test (30 sessions reach Running in ~21s) validates pod creation throughput but does not exercise this completion path.


🔴 Critical Issues

2. Original error from regenerateRunnerToken is silently discarded

In sessions.go (Stage 2 write goroutine):

if err := regenerateRunnerToken(sessionNamespace, name, currentObj); err != nil {
    return fmt.Errorf("failed to provision runner token for session %s: %w", name, ErrTokenProvision)
    // ↑ actual err is never included — swallowed
}

The actual error from regenerateRunnerToken is thrown away. The sentinel is used for type checking via stderrors.Is, which is correct, but the diagnostic context is lost. Should be:

return fmt.Errorf("%w for session %s: %v", ErrTokenProvision, name, err)

This will preserve both the sentinel (for errors.Is matching in writeGroup.Wait()) and the cause.


🟡 Major Issues

3. Pod watch predicate only fires on pod.Status.Phase changes, not container-state changes

podPredicate.UpdateFunc (agenticsession_controller.go:259-265) only enqueues when the pod phase changes. Container-level error states like ImagePullBackOff and CrashLoopBackOff occur within the Pending pod phase (phase stays Pending while the container waits). reconcileCreating handles these via UpdateSessionFromPodStatus, but only if reconciliation is triggered.

With the old monitorPod polling at 5 second intervals, these errors would be caught within one tick. With the new model, they only surface if the pod's phase changes or the 30-second RequeueAfter fires (for reconcileCreating it's 2 seconds). Consider adding container-status changes to the predicate to maintain responsiveness:

UpdateFunc: func(e event.TypedUpdateEvent[*corev1.Pod]) bool {
    if !strings.HasSuffix(e.ObjectNew.Name, "-runner") {
        return false
    }
    if e.ObjectOld.Status.Phase != e.ObjectNew.Status.Phase {
        return true
    }
    // Also trigger on container waiting/terminated state changes
    return containerStateChanged(e.ObjectOld, e.ObjectNew)
},

🔵 Minor Issues

4. InitMetrics variadic signature is a leaky abstraction

func InitMetrics(ctx context.Context, c ...client.Client) (func(), error)

The variadic signature was chosen to avoid a breaking change, but it hides an important initialization dependency. A named parameter with a nil default (or a separate SetCachedClient(c client.Client) function) would make the optional nature explicit. As-is, c[0] is silently ignored if nil.

5. ErrTokenProvision sentinel is defined 200+ lines below its first use

var ErrTokenProvision = stderrors.New("token provision failed") sits near the bottom of the file (sessions.go:2232), but it's used in the Stage 2 write group near line 775. Go compiles this fine, but moving the sentinel to the top of the file (near other package-level vars) improves readability.

6. Missing observability: "found integration secrets" log dropped

The old code logged "Found %s secret in %s, will inject as env vars" when integrationSecretsExist was true. That log is absent from the parallel read goroutine. For operators debugging secret injection, this log was useful. Trivial to add after readGroup.Wait().

7. context.TODO() for both errgroup roots

Both readGroup and writeGroup are anchored to context.TODO(). This means cancellation / deadlines from callers cannot propagate. handleAgenticSessionEvent doesn't receive a context today, so this is not a regression — but it's worth noting as a future improvement target when the function signature gets a context parameter.


Positive Highlights

  • Parallel read stage is well-designed: All GETs behind a single errgroup, shared sync.Mutex, errors captured in named result vars, _ = readGroup.Wait() with clear comment explaining why it's always nil. Clean.
  • clearAnnotation → JSON merge patch: Reducing a GET+UPDATE (with a full object round-trip) to a single MergePatchType with null is exactly the right approach. The annotationKey is properly format-quoted (%q) preventing JSON injection.
  • golang.org/x/sync promoted from indirect to direct: Correct housekeeping.
  • Informer cache for OTel gauge callbacks: Eliminating a cluster-wide LIST every 30 seconds is a meaningful reduction in API server load at scale. The fallback to DynamicClient for non-cached scenarios is sensible.
  • ShouldAutoStop / TriggerInactivityStop export: The rename + export is the right way to expose these to the controller package without a circular dependency.
  • Test coverage: The inactivity_test.go updates correctly follow the rename.
  • monitoredPods map removal: The controller-runtime work queue deduplicates events natively, so the hand-rolled mutex+map was redundant. Good cleanup.
  • Load test validation ("30 sessions reach Running in ~21s with 20 concurrent reconcilers, zero errors") gives confidence in the parallel creation path.

Recommendations

Priority order:

  1. [Blocker] Add pod-completion handling to reconcileRunning — call UpdateSessionFromPodStatus when pod.Status.Phase is Succeeded or Failed, and handle the deletion + phase transition. Without this, sessions will never leave Running.
  2. [Critical] Preserve the actual error from regenerateRunnerToken when wrapping with ErrTokenProvision.
  3. [Major] Extend the pod watch predicate to fire on container state changes (not just pod phase) to preserve responsiveness to ImagePullBackOff and CrashLoopBackOff.
  4. Supplement the load test to cover the full session lifecycle (Pending → Creating → Running → Completed/Failed) before merge.

🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 19, 2026

Claude Code Review

Summary

PR #656 is a focused operator performance improvement that eliminates goroutine-per-session monitoring, parallelizes secret reads/writes using errgroup, switches OTel gauge callbacks to the informer cache, and reduces clearAnnotation from 2 API calls to 1 via JSON merge-patch. The architectural direction (controller-runtime reconciler replacing goroutines) is correct. Several issues warrant attention before merge.


Issues by Severity

Blocker Issues

None.


Critical Issues

1. writeGroup error handling is inconsistent with its own comment (sessions.go:710-719)

The comment says Vertex copy failure is fatal; token failure is fatal but only the ErrTokenProvision path updates session status to Failed. If copySecretToNamespace fails (e.g., RBAC denied), the session silently re-enters the retry loop with no user-visible Failed phase. Sessions with a persistent vertex secret permission problem will spin forever in Pending.

Fix: Apply the same Failed status update for any non-retriable writeGroup error, or clearly separate retriable vs. non-retriable errors.


Major Issues

2. context.TODO() for both errgroups breaks cancellation (sessions.go:513, 653)

handleAgenticSessionEvent is called from ReconcilePendingSession, which receives a context.Context from the controller-runtime reconciler. Using context.TODO() means operator shutdown or reconciler timeout will not cancel in-flight Kubernetes API calls during the parallel stages. Under a graceful shutdown this can delay pod termination. This is pre-existing in the function signature but newly amplified by the errgroup pattern.

Fix: Thread the reconciler ctx through handleAgenticSessionEvent and use it as the base context for both errgroups.

3. runnerTokenCheckErr condition inverts semantics (sessions.go:696)

errors.IsNotFound(runnerTokenCheckErr) is true only when the secret definitively does not exist, which is the expected path to provision. But the outer if !runnerTokenExists already fires on any non-nil error, including transient errors. If runnerTokenCheckErr is a transient error (not NotFound), the else-if branch just logs a warning and skips provisioning so pod creation proceeds without a runner token.

Fix: Check errors.IsNotFound alongside !runnerTokenExists and return a retriable error for transient failures instead of silently proceeding.


Minor Issues

4. Discarded readGroup.Wait() return misleads linters (sessions.go:593)

All goroutines intentionally return nil, so discarding the Wait() return is correct. However the blank identifier suppresses the linter on a line that does matter. Consider adding a nolint:errcheck comment for linter tooling clarity.

5. cachedClient package-level variable lacks synchronization (otel_metrics.go:30, 59-61)

Safe in production (set once at startup, then read-only). In tests that call InitMetrics concurrently, the race detector will fire. Consider a sync.Once or passing the client directly to countSessionsByPhase to avoid the global state.

6. clearAnnotation now always makes an API call (helpers.go:230-238)

The old implementation returned early if the annotation key was absent (avoiding a round-trip). The new merge-patch always sends a PATCH even when the annotation does not exist. Functionally correct and idempotent but generates unnecessary API traffic on steady-state reconcile paths.

7. Reconcile requeue interval for inactivity is 30s (reconcile_phases.go:355)

If the configured inactivity timeout is short (e.g., 60s), sessions could overrun by up to 30s before being stopped. Consider documenting the expected detection precision in the inactivity timeout configuration.

8. Missing tests for new behaviour

  • clearAnnotation merge-patch logic (no test validates the new single-call path).
  • metricsEnabled() guard (no test verifies nil-safety when OTEL_EXPORTER_OTLP_ENDPOINT is unset).
  • Parallel read/write stages in handleAgenticSessionEvent (no unit test for the errgroup path).

inactivity_test.go was correctly updated; similar coverage would strengthen the parallel-fetch code.


Positive Highlights

  • Architecture: Removing goroutine-per-session monitoring eliminates an entire class of goroutine-leak risk and aligns with controller-runtime's event-driven model. This is the right direction.
  • Parallel reads: The errgroup and mutex pattern for collecting results is idiomatic Go and applied correctly. All goroutines return nil, errors are stored in closed-over variables, and evaluation happens after Wait().
  • Informer cache for metrics: cachedClient.List in gauge callbacks avoids a cluster-wide LIST every 30s, a meaningful API server load reduction at scale.
  • metricsEnabled() guard: Prevents nil-pointer panics when metrics are not configured. Clean, minimal fix.
  • Merge-patch for clearAnnotation: Correct use of null in a JSON merge-patch to remove an annotation key in a single call. Proper escaping prevents JSON injection.
  • ErrTokenProvision sentinel: Correct use of %w wrapping and errors.Is for typed error detection in errgroup result chains.
  • TODO annotations: The TODO(controller-runtime-migration) comments accurately capture the transitional state and help future contributors understand what to clean up.
  • go.mod hygiene: Promoting golang.org/x/sync from indirect to direct correctly reflects actual usage.

Recommendations

  1. (Critical) Fix writeGroup error handling to consistently set phase=Failed for any non-retriable failure, not just ErrTokenProvision.
  2. (Major) Fix the runnerTokenCheckErr condition inversion to avoid proceeding without a token when a transient read error occurs.
  3. (Major) Thread a real context.Context through handleAgenticSessionEvent so errgroup API calls respect reconciler cancellation.
  4. (Minor) Document inactivity-timeout precision in operator configuration, or log a warning when the requeue interval exceeds half the configured timeout.
  5. (Minor) Add a sync.Once around cachedClient assignment to keep the race detector clean in tests.

🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 19, 2026

Claude Code Review

Summary

This PR delivers meaningful operator performance improvements: parallel secret/config reads via errgroup, removal of the monitorPod goroutine-per-session in favor of the controller-runtime reconcile loop, informer-cache-backed OTel gauge queries, a JSON merge-patch for annotation removal, and nil-guard checks on all Record* metric functions. The overall direction is architecturally sound. A few correctness issues and behavioral inconsistencies are worth addressing before merge.


Issues by Severity

🔴 Critical Issues

1. Root cause swallowed in token provision error (sessions.go)

The actual error from regenerateRunnerToken is discarded. The goroutine does:

return fmt.Errorf("failed to provision runner token for session %s: %w", name, ErrTokenProvision)

It wraps the sentinel but not err — the real failure reason (RBAC, network, API server) is silently dropped from both the log and the status condition message. Fix:

return fmt.Errorf("failed to provision runner token for session %s: %w: %w", name, ErrTokenProvision, err)

or log the original error before returning.


2. Transient API errors on runner-secrets check treated as "secret missing" (sessions.go)

When runnerSecretsExist is false, the evaluation block returns fmt.Errorf("runner secret %s missing...") regardless of whether runnerSecretsCheckErr is a NotFound or a transient network/RBAC error. A transient failure marks the session as Failed with no retry opportunity. The token-provision logic in this same PR handles this correctly (non-NotFound -> returns a retriable error). The runner-secrets check should apply the same pattern:

if !runnerSecretsExist {
    if runnerSecretsCheckErr != nil && !errors.IsNotFound(runnerSecretsCheckErr) {
        // Transient — let the reconciler retry
        return fmt.Errorf("transient error checking runner secret %s: %w", runnerSecretsName, runnerSecretsCheckErr)
    }
    // Definitively missing — fail the session
    statusPatch.AddCondition(...)
    _ = statusPatch.Apply()
    return fmt.Errorf("runner secret %s missing in namespace %s", runnerSecretsName, sessionNamespace)
}

🟡 Major Issues

3. context.TODO() still used extensively inside handleAgenticSessionEvent (sessions.go)

handleAgenticSessionEvent now accepts a ctx context.Context parameter, and the parallel read/write groups correctly use readCtx/writeCtx. However, dozens of subsequent Get, Create, Delete, Patch calls in the same function still pass context.TODO(). If the reconcile context is cancelled (timeout, shutdown), those calls will not respect it. Threading ctx through consistently completes the intent of this PR.


4. cachedClient is mutable package-level state in otel_metrics.go

The variadic InitMetrics(ctx, c ...client.Client) pattern works at runtime (single initialisation at startup) but makes the package hard to unit-test: tests touching countSessionsByPhase indirectly will either need a live cluster or fall through to the DynamicClient path. A SetMetricsCacheClient(c client.Client) setter, or passing the client directly to countSessionsByPhase, would improve testability. Not blocking, but worth noting given the project standard of table-driven tests.


5. Silent write cancellation when writeCtx is cancelled mid-copy (sessions.go)

writeGroup uses errgroup.WithContext(ctx). If the token-provision goroutine returns an error, writeCtx is cancelled. In-progress copySecretToNamespace goroutines (which create their own 30-second timeout from writeCtx) will fail with context.Canceled, but since those goroutines always return nil, the cancellation is silently discarded. The session is correctly marked Failed due to the token error, but adding a log line inside the copy goroutines when copyCtx.Err() != nil would make the abandonment visible.


:blue_circle: Minor Issues

6. Integration-secrets "found" log removed (sessions.go)

The original code logged "Found %s secret in %s, will inject as env vars" when integration secrets were present. The new code omits this, reducing observability into which optional secrets are active for a session.

7. TriggerInactivityStop still uses context.TODO() (inactivity.go)

obj, err := config.DynamicClient.Resource(gvr).Namespace(namespace).Get(context.TODO(), name, v1.GetOptions{})

Now that this function is called from the controller reconcile loop that provides a real context, threading it through would be consistent.

8. _ = readGroup.Wait() discards errors silently

The comment "All goroutines return nil; errors are in result vars" is a runtime invariant the compiler cannot enforce. A future contributor adding an erroring goroutine might not notice Wait() is discarded. Prefer:

if err := readGroup.Wait(); err != nil {
    log.Printf("Unexpected error from readGroup: %v", err) // should not happen
}

Positive Highlights

  • Goroutine leak eliminated: Removing monitorPod + monitoredPods map is the most impactful correctness fix in this PR. The old pattern had inherent races (duplicate goroutines, goroutines not exiting on operator restart) that the controller-runtime watch/requeue loop handles correctly.
  • Parallel reads with errgroup: Clean implementation. The mutex-protected result vars pattern is idiomatic, and the fallback path for environments without a cached client is well-handled.
  • JSON merge-patch for annotation removal: Eliminates the GET+UPDATE race on object resourceVersion — a real correctness improvement, not just a performance one.
  • metricsEnabled() guard: Consistent nil-guard across all Record* functions prevents panics when OTEL_EXPORTER_OTLP_ENDPOINT is unset.
  • Informer-cache for OTel gauges: Avoiding a cluster-wide LIST every 30 seconds is exactly the right optimization for a frequently-called gauge callback.
  • Removing redundant re-GET in ReconcilePendingSession: The informer delivers an up-to-date object; re-fetching it was wasted latency with no correctness benefit.
  • ErrTokenProvision sentinel: The pattern is right — a typed sentinel enables errors.Is matching for the caller's reason logic. Just needs the real error chained (issue Outcome: Reduce Refinement Time with agent System #1).
  • Moving InitMetrics after manager creation in main.go: Necessary to pass mgr.GetClient() — correct sequencing.

Recommendations

Priority order for pre-merge fixes:

  1. Fix root-cause swallowing in the token provision goroutine (issue Outcome: Reduce Refinement Time with agent System #1 — 2-line fix).
  2. Fix transient-error handling for runner-secrets check to allow retries (issue Epic: RAT Architecture & Design #2 — ~5-line change).
  3. Thread ctx through the remaining context.TODO() callsites inside handleAgenticSessionEvent (issue Epic: Data Source Integration #3 — mechanical).
  4. Address the silent write-cancellation logging gap (issue Epic: Jira Integration & Workflow #5 — add log lines).
  5. Issues Epic: AI Agent Development #4, Epic: Testing & Validation #6, Test: Automation Workflow Validation #7, Test: Updated Workflow Validation #8 can be deferred to a follow-up PR.

🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 19, 2026

Claude Code Review

Summary

This PR delivers meaningful operator performance improvements through four complementary changes: parallelizing secret/config reads with errgroup, removing the goroutine-per-session monitorPod pattern in favor of controller-runtime's requeue loop, eliminating a redundant CR re-GET by trusting the informer cache, and reducing the OTel metrics gauge callback from a cluster-wide LIST to an informer cache read. The clearAnnotation merge-patch optimization is a clean bonus. The code is well-documented, the migration rationale is clear, and the established K8s operator patterns are followed correctly.


Issues by Severity

🚫 Blocker Issues

None.


🔴 Critical Issues

1. Double %w verb requires Go 1.20+ for full wrapping semantics

sessions.go (line ~702):

return fmt.Errorf("failed to provision runner token for session %s: %w: %w", name, ErrTokenProvision, err)

In Go 1.13–1.19, fmt.Errorf silently ignores the second %w and treats it as %v. This means err (the root cause) is not accessible via errors.Unwrap or errors.As on older Go toolchains. The errors.Is(err, ErrTokenProvision) check on line ~718 still works (first %w is correctly wrapped), but debugging is harder because the underlying error is not traversable.

Recommendation: Verify go.mod specifies go 1.20 or later, or rewrite as:

return fmt.Errorf("%w: failed to provision runner token for session %s: %w", ErrTokenProvision, name, err)

…keeping ErrTokenProvision as the sentinel and err as the cause, with the Is check working correctly at the call site.


🟡 Major Issues

2. No new tests for the parallel read/write paths

The most complex new behavior — the two errgroup stages with mutex-protected result variables and the sentinel-error-based write-group failure handler — has no new unit tests. The existing inactivity_test.go updates are correct and appreciated, but the parallelization logic is the highest-risk change in this PR.

Recommendation: Add table-driven tests covering at minimum:

  • Stage 1: one read goroutine fails transiently (non-IsNotFound) → expect retriable error propagated
  • Stage 1: vertex secret definitively missing when Vertex enabled → expect Failed phase
  • Stage 2: token provision fails → expect TokenProvisionFailed reason in status
  • Stage 2: write group success → expect normal pod creation flow

3. ReconcilePendingSession ignores its appConfig parameter

reconciler.go line 27:

func ReconcilePendingSession(ctx context.Context, session *unstructured.Unstructured, appConfig *config.Config) error {
    return handleAgenticSessionEvent(ctx, session)  // appConfig is dropped
}

Inside handleAgenticSessionEvent, config.LoadConfig() is called again (line ~484), defeating the purpose of injecting appConfig. This causes an extra config load per reconciliation and the signature is misleading.

Recommendation: Either pass appConfig through to handleAgenticSessionEvent (requires signature change), or remove the parameter from ReconcilePendingSession until the refactor is done. The TODO comment is appreciated but the parameter should not be silently dropped.


🔵 Minor Issues

4. context.TODO() pervasive in helpers called from the new ctx-aware path

handleAgenticSessionEvent now accepts ctx context.Context, but many helpers it calls internally still use context.TODO() (e.g., mutateAgenticSessionStatus, updateAnnotations, ensureSessionIsInteractive). This means cancellation/deadline propagation is lost for the bulk of the reconciliation work.

The TODO comment in the file acknowledges this is transitional, but it's worth tracking. Consider a follow-up issue to thread ctx into these helpers — otherwise the new ctx parameter provides false assurance of proper context propagation.

5. Confusing no-op statements on unreachable paths

sessions.go lines ~455 and ~460:

_ = "Pending" // phase reset handled by status update

These no-op assignments are remnants of earlier refactoring and convey no information. They're harmless but add confusion when reading the Creating-phase recovery logic.

6. Mutex scope includes log.Printf in integration secrets goroutine

sessions.go (integration secrets goroutine):

mu.Lock()
integrationSecretsExist = (err == nil)
mu.Unlock()
if err != nil && !errors.IsNotFound(err) {
    log.Printf("Error checking for %s secret in %s: %v", integrationSecretsName, sessionNamespace, err)
}

This is correct — the log is outside the lock. Just noting it's consistent with the other goroutines (runner token goroutine logs while holding the lock on one read path). Minor style inconsistency, no functional issue.

7. cachedClient package-level global may cause races in tests

otel_metrics.go:

var cachedClient client.Client

This package-level global is set once in production, but if metrics tests ever call InitMetrics concurrently (e.g., parallel test packages), there would be a data race. Not a production issue, but worth noting for future test harness work.

8. otlpmetricgrpc.WithInsecure() comment

otel_metrics.go line 84:

otlpmetricgrpc.WithInsecure(), // Use TLS in production

This comment has existed in the codebase, but it's worth flagging: if this code is deployed to production with OTEL_EXPORTER_OTLP_ENDPOINT set, metrics will be exported unencrypted. Consider making TLS opt-out explicit (via OTEL_EXPORTER_INSECURE=true) rather than always-insecure.


Positive Highlights

  • errgroup pattern is correctly implemented. The mutex protects result variables, goroutines return nil so the group itself never cancels, and errors are evaluated after Wait(). The writeCtx-scoped timeouts for secret copies are a nice touch.
  • clearAnnotation merge patch is idiomatic and correct. The %q format verb safely quotes the annotation key, preventing JSON injection.
  • Backward-compatible InitMetrics variadic signature is clean. Callers not passing a client continue to work without modification.
  • Inactivity check in reconcileRunning properly double-checks the live CR inside TriggerInactivityStop() before setting the annotation, avoiding the race where the user sent a message between the check and the trigger.
  • ErrTokenProvision sentinel enables structured error classification at the writeGroup.Wait() call site without needing string matching.
  • Removal of monitoredPods map and monitorPod goroutines is a significant architectural cleanup. This eliminates N long-lived goroutines (one per Running session) and replaces them with the controller-runtime requeue mechanism — correct and resource-efficient.
  • metricsEnabled() guard is a clean, idiomatic fix for the nil-pointer panic class when OTLP is unconfigured.
  • Inline TODO comments clearly explain the transitional architecture and what remains to be done, which makes the dual-path (legacy + controller-runtime) easier to reason about.

Recommendations

  1. Verify go.mod specifies go 1.20+ or adjust the double-%w pattern (Critical Outcome: Reduce Refinement Time with agent System #1).
  2. Add tests for the new errgroup stages before this merges — especially the error classification paths (Major Epic: RAT Architecture & Design #2).
  3. Remove or thread the unused appConfig parameter in ReconcilePendingSession (Major Epic: Data Source Integration #3).
  4. Open a follow-up issue to thread ctx through the helper functions (Minor Epic: AI Agent Development #4) — the existing TODO is a good start but an issue ensures it's tracked.

Overall this is a well-reasoned, focused performance PR that follows established operator patterns. Items 1–3 should be addressed before merge; the rest can be tracked as follow-ups.

🤖 Reviewed by Claude Code (claude-sonnet-4-6)


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant