test(helm): add Helm e2e harness with lint matrix and label-gated CI#1159
Open
TaylorMutch wants to merge 8 commits intomainfrom
Open
test(helm): add Helm e2e harness with lint matrix and label-gated CI#1159TaylorMutch wants to merge 8 commits intomainfrom
TaylorMutch wants to merge 8 commits intomainfrom
Conversation
This was referenced May 5, 2026
5900603 to
9d78426
Compare
d15cafe to
e362228
Compare
e362228 to
82f9730
Compare
* feat: add kubernetes local-dev environment * Add support for grpcRoute from Kubernetes Gateway API spec * Add pkiInitJob to initialize mTLS resources * Add sshHandshake init job * Test integration with Envoy Gateway * Add keycloak integration testing with Skaffold * docs(helm-dev-environment): document TLS toggle and mTLS port-forward setup Add a TLS behaviour section explaining that values-skaffold.yaml disables TLS by default, and a port-forward connection guide covering both plaintext and mTLS modes with the exact commands to extract client certs from the cluster PKI secret. * chore(helm): clarify TLS toggle in values-skaffold.yaml * chore(helm): remove leftover cert-manager references * feat(helm): restore cert-manager PKI support alongside pkiInitJob Re-add the openshell.issuerSelfSigned helper, the mutual-exclusion guard in pki-hook.yaml, and the certManager condition in the statefulset volume mount. Add server.disableTls: false to values-cert-manager.yaml so the overlay correctly overrides the skaffold dev default. Tested end-to-end with cert-manager issuing mTLS certs and sandbox create over port-forward. * fix(helm): fix port-forward collision and pki idempotency check Use port 8090 for direct port-forward to avoid colliding with the k3d LB binding on 8080 when Envoy Gateway is active. Check both server and client TLS secrets before skipping PKI generation. Previously only the server secret was checked, which would silently skip generation if a partial cleanup left one half of the pair behind. Now emits a clear error with a recovery command when partial state is detected. * feat(helm): add lint matrix and Helm e2e test harness Consolidates values overlays into deploy/helm/openshell/ci/, adds a helm:lint matrix task that validates all configuration variants, and introduces a helm-e2e.sh script that creates a k3d cluster, builds images via docker buildx, deploys via Helm, and runs the Rust and Python e2e suites. Tests that require Docker-native host networking (host.openshell.internal SSRF) are skipped on the Kubernetes path. * ci: add helm lint workflow triggered on helm chart changes * ci: add helm lint workflow triggered on helm chart changes * chore: trigger helm lint CI test * Revert "chore: trigger helm lint CI test" This reverts commit 6b6b0a5. * ci: add Branch Helm E2E workflow with test:e2e-helm gate
CI run ids combined with the openshell-helm-e2e- prefix exceeded k3d's 32-character cluster-name limit (e.g. openshell-helm-e2e-25403379605-python is 37 chars). Shorten the workflow prefix to helm-e2e- and tighten the local-dev suffix truncation so both paths stay under the limit.
The Helm e2e jobs were rebuilding gateway and supervisor images from source inside each container, duplicating the work docker-build.yml already does on every PR. Add build-gateway and build-supervisor reusable-workflow calls (linux/amd64 to match the runner) and have the e2e jobs pull the resulting GHCR images via a new HELM_E2E_IMAGE_TAG env var. The local-dev buildx path is preserved as the fallback when the tag is unset, so 'mise run e2e:helm:*' still works without CI.
When helm-k3s-local.sh runs inside a Docker container that mounts the host's docker socket (e.g., a GitHub Actions `container:` job), k3d creates the cluster on the host's daemon and publishes the API server on `0.0.0.0:<port>` of the host. From inside the CI container that address is unreachable, so kubectl (and helm OpenAPI validation) fail with 'dial tcp 0.0.0.0:<port>: connect: connection refused'. After merging the kubeconfig, detect that we're in a container via /.dockerenv and rewrite the server URL to the default-route gateway (which routes to the docker host on standard sibling-container setups). The API cert isn't signed for the gateway IP, so also mark the cluster insecure-skip-tls-verify and clear the embedded CA — CI-only path; the local-dev case where 0.0.0.0 already works is unchanged.
PR #1037 added include_str!("../../../providers/*.yaml") in crates/openshell-providers/src/profiles.rs, but the BUILD_FROM_SOURCE=1 path of Dockerfile.images only COPY's Cargo.toml/Cargo.lock, crates/, and proto/. With providers/ missing the cargo build inside the rust- builder stage fails to read the embedded YAML. The release path is unaffected because it copies pre-built binaries from deploy/docker/.build/prebuilt-binaries/. This breaks 'mise run e2e:helm:*' and any other workflow that builds images from source via this Dockerfile (e.g., the local helm-e2e harness). Add 'COPY providers/ providers/' alongside the other source inputs.
The CI container (ghcr.io/nvidia/openshell/ci:latest) does not have the `ip` command installed, so the kubeconfig-rewrite block exited 127 with `set -euo pipefail`. Read the default gateway directly from /proc/net/route instead — that file is always present on Linux and needs no extra package. Decode the gateway field as a little-endian 32-bit hex string into dotted decimal.
The previous attempts to make the in-container kubectl reach the host's k3d API server kept hitting tooling gaps (missing iproute2, gawk-only strtonum). Step back and follow the conventional pattern instead: - Drop the `container:` block from the helm-e2e jobs and run on the bare runner. Install mise via `curl https://mise.run | sh`. - Use `helm/kind-action` to provision a kind cluster on the runner. Because the workflow steps run on the runner directly, the kind API server is reachable through the standard kubeconfig the action writes. - Add HELM_E2E_SKIP_CLUSTER and HELM_E2E_IMAGE_LOADER env vars to helm-e2e.sh so it can drive the existing flow against either a self- managed k3d cluster (default; what 'mise run e2e:helm:*' uses locally) or a caller-managed kind cluster (CI). Image loading switches between 'k3d image import' and 'kind load docker-image' accordingly. - Revert the in-container kubeconfig-rewrite hacks in helm-k3s-local.sh; they are no longer needed once CI runs on the bare runner.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tasks/scripts/helm-e2e.shande2e:helm*mise tasks that bootstrap a k3d cluster, build images viadocker buildx, deploy via Helm, and run the existing Rust and Python e2e suites against the Kubernetes compute driverdeploy/helm/openshell/ci/and adds ahelm:lintmatrix that validates the chart against everyci/values-*.yamlvariant (cert-manager, gateway, keycloak, skaffold, tls-disabled)helm-lint.ymlworkflow that runs the lint matrix on PRs touchingdeploy/helm/**branch-helm-e2e.yml— a label-gated workflow (test:e2e-helm) that runsHelm E2E (rust)andHelm E2E (python)as parallel jobs, and wires the gate intoe2e-gate.ymlande2e-label-help.ymlRelated Issue
Builds on #1158 (k3d local-dev environment). Subsumes the previously-stacked PR #1162 (label-gated CI workflow), now folded into this branch.
Changes
Helm e2e harness and tasks
tasks/scripts/helm-e2e.sh— new script: preflight → reuse/create k3d cluster → docker build gateway+supervisor → k3d image import → helm upgrade --install → wait for PKI secrets → port-forward → register gateway → poll health → run suites → cleanup traptasks/helm.toml—helm:lintexpanded to loop over allci/values-*.yaml;e2e:helm,e2e:helm:rust,e2e:helm:python,e2e:helm:cert-managertasks addedChart layout
deploy/helm/openshell/ci/— new directory;values-skaffold.yaml,values-cert-manager.yaml,values-gateway.yaml,values-keycloak.yamlmoved here from the chart root;values-tls-disabled.yamladded for lint coveragedeploy/helm/openshell/.helmignore— simplified toci/wildcarddeploy/helm/openshell/skaffold.yaml— updatedvaluesFilespaths toci/CI workflows
.github/workflows/helm-lint.yml— new workflow: triggers ondeploy/helm/**path changes, runsmise run helm:lintin the CI container.github/workflows/branch-helm-e2e.yml— new label-gated workflow: gates ontest:e2e-helm, runsHelm E2E (rust)andHelm E2E (python)as parallel jobs onlinux-amd64-cpu8(60-min timeout each); privileged container with Docker socket for k3d.github/workflows/e2e-gate.yml— addsBranch Helm E2Eto theworkflow_runtrigger and ahelm-e2egate check.github/workflows/e2e-label-help.yml— extends label handling to post the correct next-step comment whentest:e2e-helmis appliedDocs
.agents/skills/helm-dev-environment/SKILL.md— updated paths and addedhelm-e2e.shto the key files tableDesign notes
helm-e2e.shbuilds gateway and supervisor images internally viadocker buildx build --loadand imports them into k3d — simpler than thebranch-e2e.ymlpatternmise install --lockedprovisions k3d, helm, and kubectl frommise.toml; no CI image changes neededgit config safe.directoryis required becausehelm-e2e.shcallsgit rev-parseto derive the default cluster name, and GHA container user/UID mismatch causes git to refuse the workspace otherwisehelm-e2e-${run_id}-{rust,python}and the local-dev derivation truncates the branch suffix to 18 chars, leaving headroom under the limitTesting
mise run helm:lint— 6 variants, all passing locallyHELM_E2E_KEEP_CLUSTER=1 mise run e2e:helm:rust— full Rust suite passes against a fresh k3d clusterhelm-lint.ymlworkflow verified firing on GHA (PR ci: add helm lint workflow triggered on helm chart changes #1160)test:e2e-helmto this PR and verify the label-help comment posts correctlyBranch Helm E2Efires and bothHelm E2E (rust)andHelm E2E (python)jobs run to greenE2E Gateposts aHelm E2Echeck that goes green once both jobs passhost.openshell.internalhost-network access (graphql_l7,forward_proxy_l7_bypass allow,host_gateway_aliasreach/inference tests) are skipped on the Kubernetes path; these require Docker-native networking not available in k3d podsChecklist
mise run helm:lintpasses