perf(operator): parallelize secret reads, remove monitorPod goroutine…#656
perf(operator): parallelize secret reads, remove monitorPod goroutine…#656syntaxsdev wants to merge 5 commits intoambient-code:mainfrom
Conversation
syntaxsdev
commented
Feb 18, 2026
- Replace sequential secret/config reads with errgroup-based parallel fetches, reducing API round-trips when creating sessions under load
- Remove monitorPod goroutine-per-session pattern; controller-runtime reconciler handles pod status via its watch/requeue loop
- Eliminate redundant CR re-GET in ReconcilePendingSession; use the object already provided by the controller-runtime informer cache
- Switch gauge callbacks in otel_metrics.go to use the informer cache instead of a cluster-wide LIST against the API server every 30s
- Replace clearAnnotation GET+UPDATE (2 API calls) with a single JSON merge patch
|
hold off |
This comment has been minimized.
This comment has been minimized.
842bda9 to
1a7c0b5
Compare
Claude Code ReviewSummaryThis PR parallelizes operator secret reads with Issues by Severity🚫 Blocker IssuesNone identified. 🔴 Critical Issues1. Inactivity auto-stop is silently removed from the Running phase
Consequence: sessions will now run indefinitely regardless of The fix is straightforward — call 🟡 Major Issues2. Non-standard import grouping in
"time" // stdlib group ends here
"sync" // ← isolated stdlib import
"ambient-code-operator/internal/config"
3. Fragile error classification by string matching In if strings.Contains(err.Error(), "runner token") {
// set Failed status
}
return errThe error returned by the writeGroup goroutine is 4. Stale package comment in The package doc still describes 🔵 Minor Issues5. All // readGroup, readCtx := errgroup.WithContext(context.TODO())
// → all goroutines always return nil, readCtx is never cancelled
var wg sync.WaitGroup
wg.Add(N)
// ...
wg.Wait()6.
7. Pod-exists check moved before runner-secrets check changes error order Previously, a missing Positive Highlights
RecommendationsPriority order:
🤖 Generated with Claude Code 🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |
…s, use informer cache - Replace sequential secret/config reads with errgroup-based parallel fetches, reducing API round-trips when creating sessions under load - Remove monitorPod goroutine-per-session pattern; controller-runtime reconciler handles pod status via its watch/requeue loop - Eliminate redundant CR re-GET in ReconcilePendingSession; use the object already provided by the controller-runtime informer cache - Switch gauge callbacks in otel_metrics.go to use the informer cache instead of a cluster-wide LIST against the API server every 30s - Replace clearAnnotation GET+UPDATE (2 API calls) with a single JSON merge patch Validated with load-test-sessions.sh: 30 sessions reach Running in ~21s with 20 concurrent reconcilers, zero errors.
1a7c0b5 to
d761f07
Compare
Claude Code ReviewSummaryThis PR delivers valuable performance work: parallelizing operator secret reads with Files changed: 5 | Net: +285 / -470 lines One critical regression and three style/robustness issues require attention before merge. Issues by Severity🔴 Critical Issues1. Inactivity auto-stop is removed with no replacement
The Both functions are unexported ( Consequence: The // for agentic sessions. The operator's monitorPod loop calls shouldAutoStop()
// on each tick to check whether a Running session has exceeded its configured
// inactivity timeout, and triggerInactivityStop() to initiate the shutdown.Fix: Export 🟡 Major Issues2. In "time" // ← stdlib group ends here
"sync" // ← isolated (wrong group)
"ambient-code-operator/internal/config"
3. Fragile error classification by string matching in if err := writeGroup.Wait(); err != nil {
if strings.Contains(err.Error(), "runner token") { // ← brittle
statusPatch.SetField("phase", "Failed")
statusPatch.AddCondition(...)
_ = statusPatch.Apply()
}
return err
}The error text is constructed as A sentinel error or custom type would be more robust: var errTokenProvision = fmt.Errorf("token provision failed")
// In goroutine: return fmt.Errorf("%w: %v", errTokenProvision, err)
// In check: if errors.Is(err, errTokenProvision) { ... }4. Stale package-doc in The package comment references 🔵 Minor Issues5. All 6.
7. Missing status update for vertex secret copy failure If Positive Highlights
Recommendations (priority order)
Review performed using repository standards from 🤖 Generated with Claude Code 🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |
Claude Code ReviewSummaryThis PR delivers meaningful performance wins for the operator under load: parallel secret reads with Issues by Severity🚫 Blocker Issues1.
When a running pod transitions to
And if the pod eventually gets garbage-collected (Kubernetes TTL), The fix is to add the pod-completion check at the top of // Handle terminal pod phases
if pod.Status.Phase == corev1.PodSucceeded || pod.Status.Phase == corev1.PodFailed {
if err := handlers.UpdateSessionFromPodStatus(ctx, session, pod); err != nil {
return ctrl.Result{RequeueAfter: 5 * time.Second}, err
}
return ctrl.Result{}, nil
}The load test ( 🔴 Critical Issues2. Original error from In if err := regenerateRunnerToken(sessionNamespace, name, currentObj); err != nil {
return fmt.Errorf("failed to provision runner token for session %s: %w", name, ErrTokenProvision)
// ↑ actual err is never included — swallowed
}The actual error from return fmt.Errorf("%w for session %s: %v", ErrTokenProvision, name, err)This will preserve both the sentinel (for 🟡 Major Issues3. Pod watch predicate only fires on
With the old UpdateFunc: func(e event.TypedUpdateEvent[*corev1.Pod]) bool {
if !strings.HasSuffix(e.ObjectNew.Name, "-runner") {
return false
}
if e.ObjectOld.Status.Phase != e.ObjectNew.Status.Phase {
return true
}
// Also trigger on container waiting/terminated state changes
return containerStateChanged(e.ObjectOld, e.ObjectNew)
},🔵 Minor Issues4. func InitMetrics(ctx context.Context, c ...client.Client) (func(), error)The variadic signature was chosen to avoid a breaking change, but it hides an important initialization dependency. A named parameter with a 5.
6. Missing observability: "found integration secrets" log dropped The old code logged 7. Both Positive Highlights
RecommendationsPriority order:
🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |
Claude Code ReviewSummaryPR #656 is a focused operator performance improvement that eliminates goroutine-per-session monitoring, parallelizes secret reads/writes using errgroup, switches OTel gauge callbacks to the informer cache, and reduces clearAnnotation from 2 API calls to 1 via JSON merge-patch. The architectural direction (controller-runtime reconciler replacing goroutines) is correct. Several issues warrant attention before merge. Issues by SeverityBlocker IssuesNone. Critical Issues1. writeGroup error handling is inconsistent with its own comment (sessions.go:710-719) The comment says Vertex copy failure is fatal; token failure is fatal but only the ErrTokenProvision path updates session status to Failed. If copySecretToNamespace fails (e.g., RBAC denied), the session silently re-enters the retry loop with no user-visible Failed phase. Sessions with a persistent vertex secret permission problem will spin forever in Pending. Fix: Apply the same Failed status update for any non-retriable writeGroup error, or clearly separate retriable vs. non-retriable errors. Major Issues2. context.TODO() for both errgroups breaks cancellation (sessions.go:513, 653) handleAgenticSessionEvent is called from ReconcilePendingSession, which receives a context.Context from the controller-runtime reconciler. Using context.TODO() means operator shutdown or reconciler timeout will not cancel in-flight Kubernetes API calls during the parallel stages. Under a graceful shutdown this can delay pod termination. This is pre-existing in the function signature but newly amplified by the errgroup pattern. Fix: Thread the reconciler ctx through handleAgenticSessionEvent and use it as the base context for both errgroups. 3. runnerTokenCheckErr condition inverts semantics (sessions.go:696) errors.IsNotFound(runnerTokenCheckErr) is true only when the secret definitively does not exist, which is the expected path to provision. But the outer if !runnerTokenExists already fires on any non-nil error, including transient errors. If runnerTokenCheckErr is a transient error (not NotFound), the else-if branch just logs a warning and skips provisioning so pod creation proceeds without a runner token. Fix: Check errors.IsNotFound alongside !runnerTokenExists and return a retriable error for transient failures instead of silently proceeding. Minor Issues4. Discarded readGroup.Wait() return misleads linters (sessions.go:593) All goroutines intentionally return nil, so discarding the Wait() return is correct. However the blank identifier suppresses the linter on a line that does matter. Consider adding a nolint:errcheck comment for linter tooling clarity. 5. cachedClient package-level variable lacks synchronization (otel_metrics.go:30, 59-61) Safe in production (set once at startup, then read-only). In tests that call InitMetrics concurrently, the race detector will fire. Consider a sync.Once or passing the client directly to countSessionsByPhase to avoid the global state. 6. clearAnnotation now always makes an API call (helpers.go:230-238) The old implementation returned early if the annotation key was absent (avoiding a round-trip). The new merge-patch always sends a PATCH even when the annotation does not exist. Functionally correct and idempotent but generates unnecessary API traffic on steady-state reconcile paths. 7. Reconcile requeue interval for inactivity is 30s (reconcile_phases.go:355) If the configured inactivity timeout is short (e.g., 60s), sessions could overrun by up to 30s before being stopped. Consider documenting the expected detection precision in the inactivity timeout configuration. 8. Missing tests for new behaviour
inactivity_test.go was correctly updated; similar coverage would strengthen the parallel-fetch code. Positive Highlights
Recommendations
🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |
Claude Code ReviewSummaryThis PR delivers meaningful operator performance improvements: parallel secret/config reads via Issues by Severity🔴 Critical Issues1. Root cause swallowed in token provision error ( The actual error from return fmt.Errorf("failed to provision runner token for session %s: %w", name, ErrTokenProvision)It wraps the sentinel but not return fmt.Errorf("failed to provision runner token for session %s: %w: %w", name, ErrTokenProvision, err)or log the original error before returning. 2. Transient API errors on runner-secrets check treated as "secret missing" ( When if !runnerSecretsExist {
if runnerSecretsCheckErr != nil && !errors.IsNotFound(runnerSecretsCheckErr) {
// Transient — let the reconciler retry
return fmt.Errorf("transient error checking runner secret %s: %w", runnerSecretsName, runnerSecretsCheckErr)
}
// Definitively missing — fail the session
statusPatch.AddCondition(...)
_ = statusPatch.Apply()
return fmt.Errorf("runner secret %s missing in namespace %s", runnerSecretsName, sessionNamespace)
}🟡 Major Issues3.
4. The variadic 5. Silent write cancellation when
:blue_circle: Minor Issues6. Integration-secrets "found" log removed ( The original code logged 7. obj, err := config.DynamicClient.Resource(gvr).Namespace(namespace).Get(context.TODO(), name, v1.GetOptions{})Now that this function is called from the controller reconcile loop that provides a real context, threading it through would be consistent. 8. The comment if err := readGroup.Wait(); err != nil {
log.Printf("Unexpected error from readGroup: %v", err) // should not happen
}Positive Highlights
RecommendationsPriority order for pre-merge fixes:
🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |
Claude Code ReviewSummaryThis PR delivers meaningful operator performance improvements through four complementary changes: parallelizing secret/config reads with Issues by Severity🚫 Blocker IssuesNone. 🔴 Critical Issues1. Double
return fmt.Errorf("failed to provision runner token for session %s: %w: %w", name, ErrTokenProvision, err)In Go 1.13–1.19, Recommendation: Verify return fmt.Errorf("%w: failed to provision runner token for session %s: %w", ErrTokenProvision, name, err)…keeping 🟡 Major Issues2. No new tests for the parallel read/write paths The most complex new behavior — the two Recommendation: Add table-driven tests covering at minimum:
3.
func ReconcilePendingSession(ctx context.Context, session *unstructured.Unstructured, appConfig *config.Config) error {
return handleAgenticSessionEvent(ctx, session) // appConfig is dropped
}Inside Recommendation: Either pass 🔵 Minor Issues4.
The TODO comment in the file acknowledges this is transitional, but it's worth tracking. Consider a follow-up issue to thread 5. Confusing no-op statements on unreachable paths
_ = "Pending" // phase reset handled by status updateThese no-op assignments are remnants of earlier refactoring and convey no information. They're harmless but add confusion when reading the Creating-phase recovery logic. 6. Mutex scope includes
mu.Lock()
integrationSecretsExist = (err == nil)
mu.Unlock()
if err != nil && !errors.IsNotFound(err) {
log.Printf("Error checking for %s secret in %s: %v", integrationSecretsName, sessionNamespace, err)
}This is correct — the log is outside the lock. Just noting it's consistent with the other goroutines (runner token goroutine logs while holding the lock on one read path). Minor style inconsistency, no functional issue. 7.
var cachedClient client.ClientThis package-level global is set once in production, but if metrics tests ever call 8.
otlpmetricgrpc.WithInsecure(), // Use TLS in productionThis comment has existed in the codebase, but it's worth flagging: if this code is deployed to production with Positive Highlights
Recommendations
Overall this is a well-reasoned, focused performance PR that follows established operator patterns. Items 1–3 should be addressed before merge; the rest can be tracked as follow-ups. 🤖 Reviewed by Claude Code (claude-sonnet-4-6) 🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |