Skip to content

test(helm): add Helm e2e harness with lint matrix and label-gated CI#1159

Open
TaylorMutch wants to merge 8 commits intomainfrom
tmutch/kube-e2e
Open

test(helm): add Helm e2e harness with lint matrix and label-gated CI#1159
TaylorMutch wants to merge 8 commits intomainfrom
tmutch/kube-e2e

Conversation

@TaylorMutch
Copy link
Copy Markdown
Collaborator

@TaylorMutch TaylorMutch commented May 4, 2026

Summary

  • Adds tasks/scripts/helm-e2e.sh and e2e:helm* mise tasks that bootstrap a k3d cluster, build images via docker buildx, deploy via Helm, and run the existing Rust and Python e2e suites against the Kubernetes compute driver
  • Consolidates Helm values overlays from the chart root into deploy/helm/openshell/ci/ and adds a helm:lint matrix that validates the chart against every ci/values-*.yaml variant (cert-manager, gateway, keycloak, skaffold, tls-disabled)
  • Adds helm-lint.yml workflow that runs the lint matrix on PRs touching deploy/helm/**
  • Adds branch-helm-e2e.yml — a label-gated workflow (test:e2e-helm) that runs Helm E2E (rust) and Helm E2E (python) as parallel jobs, and wires the gate into e2e-gate.yml and e2e-label-help.yml

Related Issue

Builds on #1158 (k3d local-dev environment). Subsumes the previously-stacked PR #1162 (label-gated CI workflow), now folded into this branch.

Changes

Helm e2e harness and tasks

  • tasks/scripts/helm-e2e.sh — new script: preflight → reuse/create k3d cluster → docker build gateway+supervisor → k3d image import → helm upgrade --install → wait for PKI secrets → port-forward → register gateway → poll health → run suites → cleanup trap
  • tasks/helm.tomlhelm:lint expanded to loop over all ci/values-*.yaml; e2e:helm, e2e:helm:rust, e2e:helm:python, e2e:helm:cert-manager tasks added

Chart layout

  • deploy/helm/openshell/ci/ — new directory; values-skaffold.yaml, values-cert-manager.yaml, values-gateway.yaml, values-keycloak.yaml moved here from the chart root; values-tls-disabled.yaml added for lint coverage
  • deploy/helm/openshell/.helmignore — simplified to ci/ wildcard
  • deploy/helm/openshell/skaffold.yaml — updated valuesFiles paths to ci/

CI workflows

  • .github/workflows/helm-lint.yml — new workflow: triggers on deploy/helm/** path changes, runs mise run helm:lint in the CI container
  • .github/workflows/branch-helm-e2e.yml — new label-gated workflow: gates on test:e2e-helm, runs Helm E2E (rust) and Helm E2E (python) as parallel jobs on linux-amd64-cpu8 (60-min timeout each); privileged container with Docker socket for k3d
  • .github/workflows/e2e-gate.yml — adds Branch Helm E2E to the workflow_run trigger and a helm-e2e gate check
  • .github/workflows/e2e-label-help.yml — extends label handling to post the correct next-step comment when test:e2e-helm is applied

Docs

  • .agents/skills/helm-dev-environment/SKILL.md — updated paths and added helm-e2e.sh to the key files table

Design notes

  • No separate image build jobs in CI: helm-e2e.sh builds gateway and supervisor images internally via docker buildx build --load and imports them into k3d — simpler than the branch-e2e.yml pattern
  • mise install --locked provisions k3d, helm, and kubectl from mise.toml; no CI image changes needed
  • git config safe.directory is required because helm-e2e.sh calls git rev-parse to derive the default cluster name, and GHA container user/UID mismatch causes git to refuse the workspace otherwise
  • Cluster names must fit k3d's 32-character limit: workflow uses helm-e2e-${run_id}-{rust,python} and the local-dev derivation truncates the branch suffix to 18 chars, leaving headroom under the limit

Testing

  • mise run helm:lint — 6 variants, all passing locally
  • HELM_E2E_KEEP_CLUSTER=1 mise run e2e:helm:rust — full Rust suite passes against a fresh k3d cluster
  • helm-lint.yml workflow verified firing on GHA (PR ci: add helm lint workflow triggered on helm chart changes #1160)
  • Apply test:e2e-helm to this PR and verify the label-help comment posts correctly
  • Verify Branch Helm E2E fires and both Helm E2E (rust) and Helm E2E (python) jobs run to green
  • Verify E2E Gate posts a Helm E2E check that goes green once both jobs pass
  • Tests that rely on host.openshell.internal host-network access (graphql_l7, forward_proxy_l7_bypass allow, host_gateway_alias reach/inference tests) are skipped on the Kubernetes path; these require Docker-native networking not available in k3d pods

Checklist

  • Follows conventional commits format
  • mise run helm:lint passes
  • Helm e2e Rust suite passes locally
  • GHA helm lint workflow verified
  • No secrets or credentials committed
  • Skill documentation updated

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@TaylorMutch TaylorMutch marked this pull request as ready for review May 4, 2026 23:58
@TaylorMutch TaylorMutch requested a review from a team as a code owner May 4, 2026 23:58
@TaylorMutch TaylorMutch force-pushed the kube-support/local-dev/tmutch branch from 5900603 to 9d78426 Compare May 5, 2026 17:56
Base automatically changed from kube-support/local-dev/tmutch to main May 5, 2026 20:42
* feat: add kubernetes local-dev environment

* Add support for grpcRoute from Kubernetes Gateway API spec
* Add pkiInitJob to initialize mTLS resources
* Add sshHandshake init job
* Test integration with Envoy Gateway
* Add keycloak integration testing with Skaffold

* docs(helm-dev-environment): document TLS toggle and mTLS port-forward setup

Add a TLS behaviour section explaining that values-skaffold.yaml disables
TLS by default, and a port-forward connection guide covering both plaintext
and mTLS modes with the exact commands to extract client certs from the
cluster PKI secret.

* chore(helm): clarify TLS toggle in values-skaffold.yaml

* chore(helm): remove leftover cert-manager references

* feat(helm): restore cert-manager PKI support alongside pkiInitJob

Re-add the openshell.issuerSelfSigned helper, the mutual-exclusion guard
in pki-hook.yaml, and the certManager condition in the statefulset volume
mount. Add server.disableTls: false to values-cert-manager.yaml so the
overlay correctly overrides the skaffold dev default. Tested end-to-end
with cert-manager issuing mTLS certs and sandbox create over port-forward.

* fix(helm): fix port-forward collision and pki idempotency check

Use port 8090 for direct port-forward to avoid colliding with the k3d
LB binding on 8080 when Envoy Gateway is active.

Check both server and client TLS secrets before skipping PKI generation.
Previously only the server secret was checked, which would silently skip
generation if a partial cleanup left one half of the pair behind. Now
emits a clear error with a recovery command when partial state is detected.

* feat(helm): add lint matrix and Helm e2e test harness

Consolidates values overlays into deploy/helm/openshell/ci/, adds a
helm:lint matrix task that validates all configuration variants, and
introduces a helm-e2e.sh script that creates a k3d cluster, builds
images via docker buildx, deploys via Helm, and runs the Rust and
Python e2e suites. Tests that require Docker-native host networking
(host.openshell.internal SSRF) are skipped on the Kubernetes path.

* ci: add helm lint workflow triggered on helm chart changes

* ci: add helm lint workflow triggered on helm chart changes

* chore: trigger helm lint CI test

* Revert "chore: trigger helm lint CI test"

This reverts commit 6b6b0a5.

* ci: add Branch Helm E2E workflow with test:e2e-helm gate
@TaylorMutch TaylorMutch added the test:e2e-helm Requires Helm end-to-end coverage label May 5, 2026
CI run ids combined with the openshell-helm-e2e- prefix exceeded k3d's
32-character cluster-name limit (e.g. openshell-helm-e2e-25403379605-python
is 37 chars). Shorten the workflow prefix to helm-e2e- and tighten the
local-dev suffix truncation so both paths stay under the limit.
@TaylorMutch TaylorMutch changed the title feat(helm): add lint matrix and Helm e2e test harness test(helm): add Helm e2e harness with lint matrix and label-gated CI May 5, 2026
The Helm e2e jobs were rebuilding gateway and supervisor images from
source inside each container, duplicating the work docker-build.yml
already does on every PR. Add build-gateway and build-supervisor
reusable-workflow calls (linux/amd64 to match the runner) and have the
e2e jobs pull the resulting GHCR images via a new HELM_E2E_IMAGE_TAG
env var. The local-dev buildx path is preserved as the fallback when
the tag is unset, so 'mise run e2e:helm:*' still works without CI.
When helm-k3s-local.sh runs inside a Docker container that mounts the
host's docker socket (e.g., a GitHub Actions `container:` job), k3d
creates the cluster on the host's daemon and publishes the API server
on `0.0.0.0:<port>` of the host. From inside the CI container that
address is unreachable, so kubectl (and helm OpenAPI validation) fail
with 'dial tcp 0.0.0.0:<port>: connect: connection refused'.

After merging the kubeconfig, detect that we're in a container via
/.dockerenv and rewrite the server URL to the default-route gateway
(which routes to the docker host on standard sibling-container setups).
The API cert isn't signed for the gateway IP, so also mark the cluster
insecure-skip-tls-verify and clear the embedded CA — CI-only path; the
local-dev case where 0.0.0.0 already works is unchanged.
PR #1037 added include_str!("../../../providers/*.yaml") in
crates/openshell-providers/src/profiles.rs, but the BUILD_FROM_SOURCE=1
path of Dockerfile.images only COPY's Cargo.toml/Cargo.lock, crates/,
and proto/. With providers/ missing the cargo build inside the rust-
builder stage fails to read the embedded YAML. The release path is
unaffected because it copies pre-built binaries from
deploy/docker/.build/prebuilt-binaries/.

This breaks 'mise run e2e:helm:*' and any other workflow that builds
images from source via this Dockerfile (e.g., the local helm-e2e
harness). Add 'COPY providers/ providers/' alongside the other source
inputs.
The CI container (ghcr.io/nvidia/openshell/ci:latest) does not have the
`ip` command installed, so the kubeconfig-rewrite block exited 127 with
`set -euo pipefail`. Read the default gateway directly from
/proc/net/route instead — that file is always present on Linux and
needs no extra package. Decode the gateway field as a little-endian
32-bit hex string into dotted decimal.
The previous attempts to make the in-container kubectl reach the host's
k3d API server kept hitting tooling gaps (missing iproute2, gawk-only
strtonum). Step back and follow the conventional pattern instead:

- Drop the `container:` block from the helm-e2e jobs and run on the
  bare runner. Install mise via `curl https://mise.run | sh`.
- Use `helm/kind-action` to provision a kind cluster on the runner.
  Because the workflow steps run on the runner directly, the kind API
  server is reachable through the standard kubeconfig the action writes.
- Add HELM_E2E_SKIP_CLUSTER and HELM_E2E_IMAGE_LOADER env vars to
  helm-e2e.sh so it can drive the existing flow against either a self-
  managed k3d cluster (default; what 'mise run e2e:helm:*' uses locally)
  or a caller-managed kind cluster (CI). Image loading switches between
  'k3d image import' and 'kind load docker-image' accordingly.
- Revert the in-container kubeconfig-rewrite hacks in helm-k3s-local.sh;
  they are no longer needed once CI runs on the bare runner.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e-helm Requires Helm end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant