Skip to content

Sidecar TLS toggle on Running SeiNodes — extend drift detection + reprovision plan #247

@bdchatham

Description

@bdchatham

Problem

buildRunningPlan (internal/planner/planner.go:708-716) detects only image drift + sidecar re-approval. Adding spec.sidecar.tls to a Running SeiNode is silently ignored. The TLS provisioning tasks (ApplySidecarCert, ApplyRBACProxyConfig) live only in buildBasePlan (internal/planner/planner.go:531-535) — they run once, during Pending → Running init. The doc comment on SidecarConfig.TLS (api/v1alpha1/common_types.go:195-198) calls this "Init-only", today truthfully but unintendedly.

The only path to enable TLS on a Running node today is kubectl delete seinode <name> per child to force the SND to recreate it. Operationally hostile for fleet rollouts.

Impact

Blocks the prod TLS rollout. The per-chain CA issuers landed in sei-protocol/platform#545 establish the trust hierarchy; this controller gap prevents wiring SeiNodes to them without a manual per-pod delete dance. arctic-1, atlantic-2, pacific-1 archive + syncer + node fleets all need TLS enabled; manual delete-and-recreate per child does not scale and surrenders the SND's orchestration.

Relevant experts

  • kubernetes-specialist — plan task ordering, cert-manager async race, single-patch reconcile model
  • platform-engineer — developer ergonomics (spec edit is the contract)

Proposed approach

Design doc: docs/design-seinode-sidecar-tls-toggle-lld.md (in repo). One-page LLD; not duplicated here.

Summary:

  • Extend buildRunningPlan to compute both imageDrift and tlsDrift flags and dispatch into a single buildNodeUpdatePlan(node, imageDrift, tlsDrift)
  • The pod-cycle middle (ApplyStatefulSet → ApplyService → ReplacePod) is shared. The plan conditionally prepends ApplySidecarCert → WaitForSidecarTLSSecret → ApplyRBACProxyConfig when TLS drift, and conditionally appends ObserveImage / ObserveSidecarTLS before MarkReady
  • New task WaitForSidecarTLSSecret — load-bearing; without it the pod crash-loops on volume mount while cert-manager issues async
  • One condition (ConditionNodeUpdateInProgress) covers both image rollout and TLS toggle. reason discriminates: UpdateStarted / TLSToggleStarted / UpdateAndTLSToggleStarted
  • New status field currentSidecarTLS *SidecarTLSStatus ({issuerName, issuerKind}) — mirrors the currentImage pattern
  • Co-drift case (image + TLS in same edit) handled in one plan, one pod cycle, both observers stamp before plan terminates
  • Drop the now-incorrect "Init-only" doc comment on SidecarConfig.TLS in the same PR
  • Enable-only for this PR; disable path (tls: set → nil) deferred to a follow-up

Acceptance criteria

  • status.currentSidecarTLS field + SidecarTLSStatus type added; CRD regenerated via make manifests generate
  • buildRunningPlan computes imageDrift + tlsDrift flags; both feed buildNodeUpdatePlan
  • sidecarTLSDrift returns false when spec.sidecar.tls == nil (enable-only); matrix unit tests cover spec/current/issuer combinations
  • buildNodeUpdatePlan(node, imageDrift, tlsDrift) conditionally prepends cert pre-tasks and conditionally appends ObserveImage / ObserveSidecarTLS
  • nodeUpdateReason returns one of UpdateStarted / TLSToggleStarted / UpdateAndTLSToggleStarted
  • TaskTypeWaitForSidecarTLSSecret + TaskTypeObserveSidecarTLS implemented; both registered in internal/task/task.go deserializer map
  • WaitForSidecarTLSSecret polls for non-empty tls.crt; returns transient error until ready
  • ObserveSidecarTLS stamps status.currentSidecarTLS to match spec.sidecar.tls
  • classifyPlan recognizes TLS-only and TLS+image plans for metrics labels
  • Single-patch status flush still holds — all new mutations accumulate in-memory, one Status().Patch with MergeFromWithOptimisticLock{} at reconcile end
  • "Init-only" doc comment removed from SidecarConfig.TLS
  • Unit tests: drift detector matrix, plan composition across (image-only, tls-only, both), idempotency mid-flight, condition reason discrimination
  • envtest: full reconcile from drift detection to condition clear
  • End-to-end on harbor: enable TLS on a running SeiNode via spec edit; pod cycles; serves on :8443; subsequent reconciles are no-ops

Out of scope

  • Disable path (tls: set → nil): adds DeleteSidecarCert + DeleteRBACProxyConfig cleanup tasks and a spec=nil, current!=nil branch in sidecarTLSDrift. ~30 LOC. Defer to a follow-up issue. Risk: spec.sidecar.tls = nil on a TLS-enabled Running node is silently ignored until SeiNode delete cascades the Cert + ConfigMap. Tracked.
  • Generalize drift detector to other spec.sidecar subfields (Image, Port, Resources). Today none triggers drift handling. Pattern is a tlsDrift-style flag — no plan-shape changes needed when added.
  • Cert rotation as an explicit feature. Issuer swap is covered organically by the drift detector (IssuerName change triggers a NodeUpdate plan).
  • SND-level maxUnavailable tuning for fleet rollouts. SND orchestration already gates concurrent child reconciles — confirmed in design doc §7, but operator validation is a deploy-time concern, not a controller code change.

Design choice: one plan, one condition

A parallel SidecarReprovision plan + a separate SidecarTLSToggleInProgress condition were considered and rejected. The kube-rbac-proxy container shares a pod with seid; there is no "cycle just the sidecar" path. Both plans would share the entire pod-cycle middle, and the implicit co-drift handling ("image takes precedence, regenerates StatefulSet with proxy") leaves status.currentSidecarTLS unstamped, triggering a redundant second pod cycle. Dashboard isolation between image rollout and TLS toggle is delivered via condition.reason. See LLD §0.1.

References

  • docs/design-seinode-sidecar-tls-toggle-lld.md — full LLD with code shapes
  • sei-protocol/platform#545 — per-chain internal CA issuers (now merged); establishes the prod trust hierarchy this work consumes
  • seictl#165 — original "Init-only TLS toggle" gap report
  • internal/planner/planner.go:708-716buildRunningPlan, the function to extend
  • internal/planner/planner.go:743-770buildNodeUpdatePlan, the function to extend with drift flags

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions