Problem
buildRunningPlan (internal/planner/planner.go:708-716) detects only image drift + sidecar re-approval. Adding spec.sidecar.tls to a Running SeiNode is silently ignored. The TLS provisioning tasks (ApplySidecarCert, ApplyRBACProxyConfig) live only in buildBasePlan (internal/planner/planner.go:531-535) — they run once, during Pending → Running init. The doc comment on SidecarConfig.TLS (api/v1alpha1/common_types.go:195-198) calls this "Init-only", today truthfully but unintendedly.
The only path to enable TLS on a Running node today is kubectl delete seinode <name> per child to force the SND to recreate it. Operationally hostile for fleet rollouts.
Impact
Blocks the prod TLS rollout. The per-chain CA issuers landed in sei-protocol/platform#545 establish the trust hierarchy; this controller gap prevents wiring SeiNodes to them without a manual per-pod delete dance. arctic-1, atlantic-2, pacific-1 archive + syncer + node fleets all need TLS enabled; manual delete-and-recreate per child does not scale and surrenders the SND's orchestration.
Relevant experts
kubernetes-specialist — plan task ordering, cert-manager async race, single-patch reconcile model
platform-engineer — developer ergonomics (spec edit is the contract)
Proposed approach
Design doc: docs/design-seinode-sidecar-tls-toggle-lld.md (in repo). One-page LLD; not duplicated here.
Summary:
- Extend
buildRunningPlan to compute both imageDrift and tlsDrift flags and dispatch into a single buildNodeUpdatePlan(node, imageDrift, tlsDrift)
- The pod-cycle middle (
ApplyStatefulSet → ApplyService → ReplacePod) is shared. The plan conditionally prepends ApplySidecarCert → WaitForSidecarTLSSecret → ApplyRBACProxyConfig when TLS drift, and conditionally appends ObserveImage / ObserveSidecarTLS before MarkReady
- New task
WaitForSidecarTLSSecret — load-bearing; without it the pod crash-loops on volume mount while cert-manager issues async
- One condition (
ConditionNodeUpdateInProgress) covers both image rollout and TLS toggle. reason discriminates: UpdateStarted / TLSToggleStarted / UpdateAndTLSToggleStarted
- New status field
currentSidecarTLS *SidecarTLSStatus ({issuerName, issuerKind}) — mirrors the currentImage pattern
- Co-drift case (image + TLS in same edit) handled in one plan, one pod cycle, both observers stamp before plan terminates
- Drop the now-incorrect "Init-only" doc comment on
SidecarConfig.TLS in the same PR
- Enable-only for this PR; disable path (
tls: set → nil) deferred to a follow-up
Acceptance criteria
Out of scope
- Disable path (
tls: set → nil): adds DeleteSidecarCert + DeleteRBACProxyConfig cleanup tasks and a spec=nil, current!=nil branch in sidecarTLSDrift. ~30 LOC. Defer to a follow-up issue. Risk: spec.sidecar.tls = nil on a TLS-enabled Running node is silently ignored until SeiNode delete cascades the Cert + ConfigMap. Tracked.
- Generalize drift detector to other
spec.sidecar subfields (Image, Port, Resources). Today none triggers drift handling. Pattern is a tlsDrift-style flag — no plan-shape changes needed when added.
- Cert rotation as an explicit feature. Issuer swap is covered organically by the drift detector (
IssuerName change triggers a NodeUpdate plan).
- SND-level maxUnavailable tuning for fleet rollouts. SND orchestration already gates concurrent child reconciles — confirmed in design doc §7, but operator validation is a deploy-time concern, not a controller code change.
Design choice: one plan, one condition
A parallel SidecarReprovision plan + a separate SidecarTLSToggleInProgress condition were considered and rejected. The kube-rbac-proxy container shares a pod with seid; there is no "cycle just the sidecar" path. Both plans would share the entire pod-cycle middle, and the implicit co-drift handling ("image takes precedence, regenerates StatefulSet with proxy") leaves status.currentSidecarTLS unstamped, triggering a redundant second pod cycle. Dashboard isolation between image rollout and TLS toggle is delivered via condition.reason. See LLD §0.1.
References
docs/design-seinode-sidecar-tls-toggle-lld.md — full LLD with code shapes
- sei-protocol/platform#545 — per-chain internal CA issuers (now merged); establishes the prod trust hierarchy this work consumes
- seictl#165 — original "Init-only TLS toggle" gap report
internal/planner/planner.go:708-716 — buildRunningPlan, the function to extend
internal/planner/planner.go:743-770 — buildNodeUpdatePlan, the function to extend with drift flags
Problem
buildRunningPlan(internal/planner/planner.go:708-716) detects only image drift + sidecar re-approval. Addingspec.sidecar.tlsto a Running SeiNode is silently ignored. The TLS provisioning tasks (ApplySidecarCert,ApplyRBACProxyConfig) live only inbuildBasePlan(internal/planner/planner.go:531-535) — they run once, during Pending → Running init. The doc comment onSidecarConfig.TLS(api/v1alpha1/common_types.go:195-198) calls this "Init-only", today truthfully but unintendedly.The only path to enable TLS on a Running node today is
kubectl delete seinode <name>per child to force the SND to recreate it. Operationally hostile for fleet rollouts.Impact
Blocks the prod TLS rollout. The per-chain CA issuers landed in sei-protocol/platform#545 establish the trust hierarchy; this controller gap prevents wiring SeiNodes to them without a manual per-pod delete dance. arctic-1, atlantic-2, pacific-1 archive + syncer + node fleets all need TLS enabled; manual delete-and-recreate per child does not scale and surrenders the SND's orchestration.
Relevant experts
kubernetes-specialist— plan task ordering, cert-manager async race, single-patch reconcile modelplatform-engineer— developer ergonomics (spec edit is the contract)Proposed approach
Design doc:
docs/design-seinode-sidecar-tls-toggle-lld.md(in repo). One-page LLD; not duplicated here.Summary:
buildRunningPlanto compute bothimageDriftandtlsDriftflags and dispatch into a singlebuildNodeUpdatePlan(node, imageDrift, tlsDrift)ApplyStatefulSet → ApplyService → ReplacePod) is shared. The plan conditionally prependsApplySidecarCert → WaitForSidecarTLSSecret → ApplyRBACProxyConfigwhen TLS drift, and conditionally appendsObserveImage/ObserveSidecarTLSbeforeMarkReadyWaitForSidecarTLSSecret— load-bearing; without it the pod crash-loops on volume mount while cert-manager issues asyncConditionNodeUpdateInProgress) covers both image rollout and TLS toggle.reasondiscriminates:UpdateStarted/TLSToggleStarted/UpdateAndTLSToggleStartedcurrentSidecarTLS *SidecarTLSStatus({issuerName, issuerKind}) — mirrors thecurrentImagepatternSidecarConfig.TLSin the same PRtls: set → nil) deferred to a follow-upAcceptance criteria
status.currentSidecarTLSfield +SidecarTLSStatustype added; CRD regenerated viamake manifests generatebuildRunningPlancomputesimageDrift+tlsDriftflags; both feedbuildNodeUpdatePlansidecarTLSDriftreturns false whenspec.sidecar.tls == nil(enable-only); matrix unit tests cover spec/current/issuer combinationsbuildNodeUpdatePlan(node, imageDrift, tlsDrift)conditionally prepends cert pre-tasks and conditionally appendsObserveImage/ObserveSidecarTLSnodeUpdateReasonreturns one ofUpdateStarted/TLSToggleStarted/UpdateAndTLSToggleStartedTaskTypeWaitForSidecarTLSSecret+TaskTypeObserveSidecarTLSimplemented; both registered ininternal/task/task.godeserializer mapWaitForSidecarTLSSecretpolls for non-emptytls.crt; returns transient error until readyObserveSidecarTLSstampsstatus.currentSidecarTLSto matchspec.sidecar.tlsclassifyPlanrecognizes TLS-only and TLS+image plans for metrics labelsStatus().PatchwithMergeFromWithOptimisticLock{}at reconcile endSidecarConfig.TLS:8443; subsequent reconciles are no-opsOut of scope
tls: set → nil): addsDeleteSidecarCert+DeleteRBACProxyConfigcleanup tasks and aspec=nil, current!=nilbranch insidecarTLSDrift. ~30 LOC. Defer to a follow-up issue. Risk:spec.sidecar.tls = nilon a TLS-enabled Running node is silently ignored until SeiNode delete cascades the Cert + ConfigMap. Tracked.spec.sidecarsubfields (Image, Port, Resources). Today none triggers drift handling. Pattern is atlsDrift-style flag — no plan-shape changes needed when added.IssuerNamechange triggers a NodeUpdate plan).Design choice: one plan, one condition
A parallel
SidecarReprovisionplan + a separateSidecarTLSToggleInProgresscondition were considered and rejected. The kube-rbac-proxy container shares a pod withseid; there is no "cycle just the sidecar" path. Both plans would share the entire pod-cycle middle, and the implicit co-drift handling ("image takes precedence, regenerates StatefulSet with proxy") leavesstatus.currentSidecarTLSunstamped, triggering a redundant second pod cycle. Dashboard isolation between image rollout and TLS toggle is delivered viacondition.reason. See LLD §0.1.References
docs/design-seinode-sidecar-tls-toggle-lld.md— full LLD with code shapesinternal/planner/planner.go:708-716—buildRunningPlan, the function to extendinternal/planner/planner.go:743-770—buildNodeUpdatePlan, the function to extend with drift flags