Problem
`buildRunningPlan` (planner.go:708) only builds a `NodeUpdate` plan when `spec.Image != status.CurrentImage`. StatefulSets use `OnDelete` update strategy (noderesource.go:204), so changes to pod template env, args, mounts, securityContext, or volumeMounts do not propagate to existing running pods unless the operator manually deletes pods or bumps `spec.Image`.
Surfaced during cross-review of #234 (HOME env var rollout) by the platform-engineer agent.
Impact
Any controller upgrade that changes `PodTemplateSpec` without touching the image is silently no-op on existing Running nodes. Examples that already hit this:
Relevant experts
- kubernetes-specialist (planner / plan-builder semantics)
- platform-engineer (rollout shape, operator playbook implications)
Proposed approach
Two options surfaced in review:
- Extend `buildRunningPlan` to detect pod template hash drift. Hash the materialized pod template after `buildSidecarMainContainer` etc., persist to `status.podTemplateHash` after a successful rollout, drift-detect on next reconcile. Build the same `NodeUpdate` plan shape (apply-statefulset → replace-pod → observe-image → mark-ready) — `replace-pod` is already the right primitive.
- Introduce a new `RestartNode` plan type with a smaller task set (apply-statefulset → replace-pod → mark-ready, no observe-image since the image didn't change). Cleaner separation but more code.
Option 1 is the smaller change and reuses validated machinery. Option 2 is cleaner semantically.
Acceptance criteria
Out of scope
- Re-architecting plan types beyond what's needed for pod template drift detection
- Per-node force-restart annotation as user-facing operator escape hatch (separate UX work)
References
Problem
`buildRunningPlan` (planner.go:708) only builds a `NodeUpdate` plan when `spec.Image != status.CurrentImage`. StatefulSets use `OnDelete` update strategy (noderesource.go:204), so changes to pod template env, args, mounts, securityContext, or volumeMounts do not propagate to existing running pods unless the operator manually deletes pods or bumps `spec.Image`.
Surfaced during cross-review of #234 (HOME env var rollout) by the platform-engineer agent.
Impact
Any controller upgrade that changes `PodTemplateSpec` without touching the image is silently no-op on existing Running nodes. Examples that already hit this:
Relevant experts
Proposed approach
Two options surfaced in review:
Option 1 is the smaller change and reuses validated machinery. Option 2 is cleaner semantically.
Acceptance criteria
Out of scope
References