Skip to content

Enable TLS on existing Running SeiNodes via in-place plan #262

@bdchatham

Description

@bdchatham

Problem

#258 landed the externalized-cert + immutable-spec model: spec.sidecar.tls is fixed at creation, and toggling TLS on an existing SeiNode requires delete + recreate with PVC retention (LLD §5). The deferral was deliberate at ~10 nodes — but the operational pain is real per-node, and the externalized-cert foundation #258 built actually makes in-place enable much simpler than the original PR #254 attempt was.

LLD §6 explicitly flagged this as "re-open as a separate design if the fleet scale or operational model changes." This issue is that reopen.

Why this is easier than PR #254 was

The original in-place toggle PR was rejected because:

  1. Controller was minting cert-manager Certificate resources → coupled cert lifecycle to plan lifecycle (drift detector + observer + wait task + cert-manager-Ready terminal logic).
  2. IssuerName/IssuerKind were operator-spec-supplied → security gap (Sidecar TLS trust-anchor hardening: pin Issuer + pod-attestation observer #251).
  3. classifyPlan task-list-sniffing got brittle as new plan types accumulated.

Under #258's foundation, all three of those are dissolved:

  • Cert is operator-provisioned externally; controller doesn't create it.
  • No Issuer fields in spec; trust-anchor selection lives entirely outside the controller's blast radius.
  • The preflight already validates the cert; the plan can rely on SidecarTLSSecretReady=True as a precondition rather than reproducing validation logic inline.

The residual surface is a focused planner extension plus one transport-mode consistency fix.

Proposed approach

  1. Relax CEL. Allow only the tls: nil → tls: {secretName: X} transition. Still block disable (set → nil) and swap (X → Y); those are the genuinely irreversible operations that should require delete + recreate. The current rule:

    (!has(oldSelf.sidecar) || !has(oldSelf.sidecar.tls)) ? (!has(self.sidecar) || !has(self.sidecar.tls)) : (...equality...)
    

    becomes:

    (!has(oldSelf.sidecar) || !has(oldSelf.sidecar.tls)) ? true : (has(self.sidecar) && has(self.sidecar.tls) && self.sidecar.tls == oldSelf.sidecar.tls)
    

    First branch flips from "require self also has no TLS" to "accept anything" — that's the enable case. Second branch unchanged.

  2. Reintroduce Status.CurrentSidecarTLS mirror. Just {SecretName string} now — way simpler than PR feat(planner): TLS toggle on Running SeiNodes via NodeUpdate plan #254's {IssuerName, IssuerKind} struct. Source of truth for what TLS state the live pod is actually serving (vs. what spec wants).

  3. Drift detector. Narrow trigger: Spec.Sidecar.TLS != nil && Status.CurrentSidecarTLS == nil && ConditionSidecarTLSSecretReady == True. Only fires when (a) operator added TLS post-creation, (b) preflight already validated the Secret. Disable / swap remain CEL-blocked so the detector never sees those cases.

  4. Plan composition. Extend buildNodeUpdatePlan to compose ApplyRBACProxyConfig → ApplyStatefulSet → ApplyService → ReplacePod → ObserveSidecarTLS → MarkReady when TLS-enable drift fires. Image drift and TLS-enable drift co-compose in a single pod cycle (same pattern PR feat(planner): TLS toggle on Running SeiNodes via NodeUpdate plan #254 explored, but without the cert-creation pre-tasks).

  5. ObserveSidecarTLS task. Stamps Status.CurrentSidecarTLS = {SecretName: Spec.Sidecar.TLS.SecretName} after rollout convergence. Mirrors ObserveImage.

  6. Transport-mode source switch. This is the load-bearing fix that PR feat(planner): TLS toggle on Running SeiNodes via NodeUpdate plan #254 missed: SidecarURLForNode currently reads Spec.Sidecar.TLS to decide HTTP-vs-HTTPS. Mid-rollout, spec disagrees with the live pod, so controller→sidecar calls fail (controller tries HTTPS against an HTTP-only pod, or vice versa). Switch the function to read from Status.CurrentSidecarTLS so the controller's client uses whatever transport the rollout has actually landed.

  7. classifyPlan label. Add node-update-tls-enable for plans triggered by TLS-enable drift only; node-update (image-only) and the co-drift case (node-update-tls-enable+image-update?) follow.

  8. Tests. Drift-detector matrix, plan composition across (image-only / tls-enable-only / both), CEL allows-enable / blocks-disable / blocks-swap, transport-mode-source switch under each plan phase.

Estimated size: ~250–400 LOC.

Acceptance criteria

  • CEL relaxed; envtest confirms enable allowed, disable + swap rejected
  • Status.CurrentSidecarTLS field + DeepCopy regen
  • sidecarTLSEnableDrift detector in internal/planner/planner.go
  • ObserveSidecarTLS task + registry entry + paramsForTaskType entry
  • buildNodeUpdatePlan composes pre-tasks (RBAC ConfigMap) and observer when TLS drift present
  • SidecarURLForNode reads from status, not spec — verified by a test that exercises the mid-rollout state
  • classifyPlan emits the new label(s)
  • Unit tests for drift matrix + plan composition
  • envtest for CEL transitions
  • End-to-end on harbor: add spec.sidecar.tls.secretName to a Running SeiNode with a pre-provisioned Secret; pod cycles; serves on :8443

Out of scope (deferred)

  • Disable path (tls: set → nil). Still requires delete + recreate. Same as Sidecar TLS disable path on Running SeiNodes #250 was — reopen if it becomes load-bearing.
  • Swap path (tls: A → tls: B). Issuer-rotation use case; same reasoning as disable. Operationally rare.
  • SND-level rolling enable orchestrated by the parent controller. Once per-SeiNode enable works, the SND can drive it by editing the template; this issue does not address SND-level orchestration.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions