test(helm): add Helm e2e harness with lint matrix and label-gated CI by TaylorMutch · Pull Request #1159 · NVIDIA/OpenShell

TaylorMutch · 2026-05-04T23:57:05Z

Summary

Adds tasks/scripts/helm-e2e.sh and e2e:helm* mise tasks that bootstrap a k3d cluster, build images via docker buildx, deploy via Helm, and run the existing Rust and Python e2e suites against the Kubernetes compute driver
Consolidates Helm values overlays from the chart root into deploy/helm/openshell/ci/ and adds a helm:lint matrix that validates the chart against every ci/values-*.yaml variant (cert-manager, gateway, keycloak, skaffold, tls-disabled)
Adds helm-lint.yml workflow that runs the lint matrix on PRs touching deploy/helm/**
Adds branch-helm-e2e.yml — a label-gated workflow (test:e2e-helm) that runs Helm E2E (rust) and Helm E2E (python) as parallel jobs, and wires the gate into e2e-gate.yml and e2e-label-help.yml

Related Issue

Builds on #1158 (k3d local-dev environment). Subsumes the previously-stacked PR #1162 (label-gated CI workflow), now folded into this branch.

Changes

Helm e2e harness and tasks

tasks/scripts/helm-e2e.sh — new script: preflight → reuse/create k3d cluster → docker build gateway+supervisor → k3d image import → helm upgrade --install → wait for PKI secrets → port-forward → register gateway → poll health → run suites → cleanup trap
tasks/helm.toml — helm:lint expanded to loop over all ci/values-*.yaml; e2e:helm, e2e:helm:rust, e2e:helm:python, e2e:helm:cert-manager tasks added

Chart layout

deploy/helm/openshell/ci/ — new directory; values-skaffold.yaml, values-cert-manager.yaml, values-gateway.yaml, values-keycloak.yaml moved here from the chart root; values-tls-disabled.yaml added for lint coverage
deploy/helm/openshell/.helmignore — simplified to ci/ wildcard
deploy/helm/openshell/skaffold.yaml — updated valuesFiles paths to ci/

CI workflows

.github/workflows/helm-lint.yml — new workflow: triggers on deploy/helm/** path changes, runs mise run helm:lint in the CI container
.github/workflows/branch-helm-e2e.yml — new label-gated workflow: gates on test:e2e-helm, runs Helm E2E (rust) and Helm E2E (python) as parallel jobs on linux-amd64-cpu8 (60-min timeout each); privileged container with Docker socket for k3d
.github/workflows/e2e-gate.yml — adds Branch Helm E2E to the workflow_run trigger and a helm-e2e gate check
.github/workflows/e2e-label-help.yml — extends label handling to post the correct next-step comment when test:e2e-helm is applied

Docs

.agents/skills/helm-dev-environment/SKILL.md — updated paths and added helm-e2e.sh to the key files table

Design notes

No separate image build jobs in CI: helm-e2e.sh builds gateway and supervisor images internally via docker buildx build --load and imports them into k3d — simpler than the branch-e2e.yml pattern
mise install --locked provisions k3d, helm, and kubectl from mise.toml; no CI image changes needed
git config safe.directory is required because helm-e2e.sh calls git rev-parse to derive the default cluster name, and GHA container user/UID mismatch causes git to refuse the workspace otherwise
Cluster names must fit k3d's 32-character limit: workflow uses helm-e2e-${run_id}-{rust,python} and the local-dev derivation truncates the branch suffix to 18 chars, leaving headroom under the limit

Testing

mise run helm:lint — 6 variants, all passing locally
HELM_E2E_KEEP_CLUSTER=1 mise run e2e:helm:rust — full Rust suite passes against a fresh k3d cluster
helm-lint.yml workflow verified firing on GHA (PR ci: add helm lint workflow triggered on helm chart changes #1160)
Apply test:e2e-helm to this PR and verify the label-help comment posts correctly
Verify Branch Helm E2E fires and both Helm E2E (rust) and Helm E2E (python) jobs run to green
Verify E2E Gate posts a Helm E2E check that goes green once both jobs pass
Tests that rely on host.openshell.internal host-network access (graphql_l7, forward_proxy_l7_bypass allow, host_gateway_alias reach/inference tests) are skipped on the Kubernetes path; these require Docker-native networking not available in k3d pods

Checklist

Follows conventional commits format
mise run helm:lint passes
Helm e2e Rust suite passes locally
GHA helm lint workflow verified
No secrets or credentials committed
Skill documentation updated

copy-pr-bot · 2026-05-04T23:57:08Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

* feat: add kubernetes local-dev environment * Add support for grpcRoute from Kubernetes Gateway API spec * Add pkiInitJob to initialize mTLS resources * Add sshHandshake init job * Test integration with Envoy Gateway * Add keycloak integration testing with Skaffold * docs(helm-dev-environment): document TLS toggle and mTLS port-forward setup Add a TLS behaviour section explaining that values-skaffold.yaml disables TLS by default, and a port-forward connection guide covering both plaintext and mTLS modes with the exact commands to extract client certs from the cluster PKI secret. * chore(helm): clarify TLS toggle in values-skaffold.yaml * chore(helm): remove leftover cert-manager references * feat(helm): restore cert-manager PKI support alongside pkiInitJob Re-add the openshell.issuerSelfSigned helper, the mutual-exclusion guard in pki-hook.yaml, and the certManager condition in the statefulset volume mount. Add server.disableTls: false to values-cert-manager.yaml so the overlay correctly overrides the skaffold dev default. Tested end-to-end with cert-manager issuing mTLS certs and sandbox create over port-forward. * fix(helm): fix port-forward collision and pki idempotency check Use port 8090 for direct port-forward to avoid colliding with the k3d LB binding on 8080 when Envoy Gateway is active. Check both server and client TLS secrets before skipping PKI generation. Previously only the server secret was checked, which would silently skip generation if a partial cleanup left one half of the pair behind. Now emits a clear error with a recovery command when partial state is detected. * feat(helm): add lint matrix and Helm e2e test harness Consolidates values overlays into deploy/helm/openshell/ci/, adds a helm:lint matrix task that validates all configuration variants, and introduces a helm-e2e.sh script that creates a k3d cluster, builds images via docker buildx, deploys via Helm, and runs the Rust and Python e2e suites. Tests that require Docker-native host networking (host.openshell.internal SSRF) are skipped on the Kubernetes path. * ci: add helm lint workflow triggered on helm chart changes * ci: add helm lint workflow triggered on helm chart changes * chore: trigger helm lint CI test * Revert "chore: trigger helm lint CI test" This reverts commit 6b6b0a5. * ci: add Branch Helm E2E workflow with test:e2e-helm gate

CI run ids combined with the openshell-helm-e2e- prefix exceeded k3d's 32-character cluster-name limit (e.g. openshell-helm-e2e-25403379605-python is 37 chars). Shorten the workflow prefix to helm-e2e- and tighten the local-dev suffix truncation so both paths stay under the limit.

The Helm e2e jobs were rebuilding gateway and supervisor images from source inside each container, duplicating the work docker-build.yml already does on every PR. Add build-gateway and build-supervisor reusable-workflow calls (linux/amd64 to match the runner) and have the e2e jobs pull the resulting GHCR images via a new HELM_E2E_IMAGE_TAG env var. The local-dev buildx path is preserved as the fallback when the tag is unset, so 'mise run e2e:helm:*' still works without CI.

When helm-k3s-local.sh runs inside a Docker container that mounts the host's docker socket (e.g., a GitHub Actions `container:` job), k3d creates the cluster on the host's daemon and publishes the API server on `0.0.0.0:<port>` of the host. From inside the CI container that address is unreachable, so kubectl (and helm OpenAPI validation) fail with 'dial tcp 0.0.0.0:<port>: connect: connection refused'. After merging the kubeconfig, detect that we're in a container via /.dockerenv and rewrite the server URL to the default-route gateway (which routes to the docker host on standard sibling-container setups). The API cert isn't signed for the gateway IP, so also mark the cluster insecure-skip-tls-verify and clear the embedded CA — CI-only path; the local-dev case where 0.0.0.0 already works is unchanged.

PR #1037 added include_str!("../../../providers/*.yaml") in crates/openshell-providers/src/profiles.rs, but the BUILD_FROM_SOURCE=1 path of Dockerfile.images only COPY's Cargo.toml/Cargo.lock, crates/, and proto/. With providers/ missing the cargo build inside the rust- builder stage fails to read the embedded YAML. The release path is unaffected because it copies pre-built binaries from deploy/docker/.build/prebuilt-binaries/. This breaks 'mise run e2e:helm:*' and any other workflow that builds images from source via this Dockerfile (e.g., the local helm-e2e harness). Add 'COPY providers/ providers/' alongside the other source inputs.

The CI container (ghcr.io/nvidia/openshell/ci:latest) does not have the `ip` command installed, so the kubeconfig-rewrite block exited 127 with `set -euo pipefail`. Read the default gateway directly from /proc/net/route instead — that file is always present on Linux and needs no extra package. Decode the gateway field as a little-endian 32-bit hex string into dotted decimal.

The previous attempts to make the in-container kubectl reach the host's k3d API server kept hitting tooling gaps (missing iproute2, gawk-only strtonum). Step back and follow the conventional pattern instead: - Drop the `container:` block from the helm-e2e jobs and run on the bare runner. Install mise via `curl https://mise.run | sh`. - Use `helm/kind-action` to provision a kind cluster on the runner. Because the workflow steps run on the runner directly, the kind API server is reachable through the standard kubeconfig the action writes. - Add HELM_E2E_SKIP_CLUSTER and HELM_E2E_IMAGE_LOADER env vars to helm-e2e.sh so it can drive the existing flow against either a self- managed k3d cluster (default; what 'mise run e2e:helm:*' uses locally) or a caller-managed kind cluster (CI). Image loading switches between 'k3d image import' and 'kind load docker-image' accordingly. - Revert the in-container kubeconfig-rewrite hacks in helm-k3s-local.sh; they are no longer needed once CI runs on the bare runner.

TaylorMutch marked this pull request as ready for review May 4, 2026 23:58

TaylorMutch requested a review from a team as a code owner May 4, 2026 23:58

This was referenced May 5, 2026

ci: add helm lint workflow triggered on helm chart changes #1160

Merged

ci: add Branch Helm E2E workflow with test:e2e-helm gate #1162

Merged

TaylorMutch force-pushed the kube-support/local-dev/tmutch branch from 5900603 to 9d78426 Compare May 5, 2026 17:56

TaylorMutch requested review from derekwaynecarr, maxamillion and mrunalp as code owners May 5, 2026 17:56

TaylorMutch force-pushed the tmutch/kube-e2e branch from d15cafe to e362228 Compare May 5, 2026 18:10

Base automatically changed from kube-support/local-dev/tmutch to main May 5, 2026 20:42

test(e2e): Add a Helm specific e2e harness and linting workflow

82f9730

TaylorMutch force-pushed the tmutch/kube-e2e branch from e362228 to 82f9730 Compare May 5, 2026 21:07

TaylorMutch added the test:e2e-helm Requires Helm end-to-end coverage label May 5, 2026

TaylorMutch changed the title ~~feat(helm): add lint matrix and Helm e2e test harness~~ test(helm): add Helm e2e harness with lint matrix and label-gated CI May 5, 2026

TaylorMutch added 5 commits May 5, 2026 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(helm): add Helm e2e harness with lint matrix and label-gated CI#1159

test(helm): add Helm e2e harness with lint matrix and label-gated CI#1159
TaylorMutch wants to merge 8 commits intomainfrom
tmutch/kube-e2e

TaylorMutch commented May 4, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TaylorMutch commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Helm e2e harness and tasks

Chart layout

CI workflows

Docs

Design notes

Testing

Checklist

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TaylorMutch commented May 4, 2026 •

edited

Loading