Skip to content

Trigger NodeUpdate plan on pod template hash drift (not just spec.image) #235

@bdchatham

Description

@bdchatham

Problem

`buildRunningPlan` (planner.go:708) only builds a `NodeUpdate` plan when `spec.Image != status.CurrentImage`. StatefulSets use `OnDelete` update strategy (noderesource.go:204), so changes to pod template env, args, mounts, securityContext, or volumeMounts do not propagate to existing running pods unless the operator manually deletes pods or bumps `spec.Image`.

Surfaced during cross-review of #234 (HOME env var rollout) by the platform-engineer agent.

Impact

Any controller upgrade that changes `PodTemplateSpec` without touching the image is silently no-op on existing Running nodes. Examples that already hit this:

Relevant experts

  • kubernetes-specialist (planner / plan-builder semantics)
  • platform-engineer (rollout shape, operator playbook implications)

Proposed approach

Two options surfaced in review:

  1. Extend `buildRunningPlan` to detect pod template hash drift. Hash the materialized pod template after `buildSidecarMainContainer` etc., persist to `status.podTemplateHash` after a successful rollout, drift-detect on next reconcile. Build the same `NodeUpdate` plan shape (apply-statefulset → replace-pod → observe-image → mark-ready) — `replace-pod` is already the right primitive.
  2. Introduce a new `RestartNode` plan type with a smaller task set (apply-statefulset → replace-pod → mark-ready, no observe-image since the image didn't change). Cleaner separation but more code.

Option 1 is the smaller change and reuses validated machinery. Option 2 is cleaner semantically.

Acceptance criteria

  • Pod template change without image change builds a plan that rolls existing pods
  • No regression on actual image rollouts (currentImage tracking still works)
  • Status surface (`status.podTemplateHash` or equivalent) is documented
  • PR 3 (`/sei` → `/.sei` flip) blocked on this landing first

Out of scope

  • Re-architecting plan types beyond what's needed for pod template drift detection
  • Per-node force-restart annotation as user-facing operator escape hatch (separate UX work)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions