Skip to content

ci: run external e2e and sanity tests on pr on kind#490

Merged
jmclong merged 6 commits into
mainfrom
dev/jlong/kind-e2e
May 27, 2026
Merged

ci: run external e2e and sanity tests on pr on kind#490
jmclong merged 6 commits into
mainfrom
dev/jlong/kind-e2e

Conversation

@jmclong
Copy link
Copy Markdown
Contributor

@jmclong jmclong commented May 27, 2026

This change add sanity and external-e2e tests to every PR run on kind clusters (no azure access). This can be executed on PRs from forks with no permissions needed on the repo.

AI Summary

This pull request introduces several improvements to the Kind-based integration test workflow and related test infrastructure. The most significant changes are the addition of a robust script to pre-create an LVM volume group on each Kind node, new options to configure Kind clusters for tests, and refactoring of test job definitions to support multiple test suites and configurations. There are also targeted test fixes and simplifications to improve reliability and clarity.

Kind Cluster Setup & LVM Volume Group Management:

  • Added a new script hack/kind-setup-vg.sh to automate the creation and teardown of a loop-backed LVM volume group on each Kind node, enabling tests that require persistent storage to run in CI without real NVMe hardware.
  • Updated the Makefile with kind-setup-vg, kind-teardown-vg, and kind-e2e-bootstrap targets to leverage the new script for cluster preparation and cleanup.

Test Runner Enhancements:

  • Extended .github/workflows/scripts/run_tests.py to accept --kind-nodes and --kind-setup-vg arguments, allowing dynamic selection of single/multi-node clusters and optional LVM VG setup. The KindCluster class and cluster creation logic were updated accordingly. The script now also returns an exit code for better CI integration. [1] [2] [3] [4] [5] [6] [7]

CI Workflow Refactoring:

  • Refactored .github/workflows/test-e2e-pr.yml to define a matrix of Kind-based tests (E2E, external E2E, and sanity), each with specific arguments and Helm overrides. Jobs are now named dynamically, and Kind cluster creation is explicitly separated. [1] [2] [3] [4]

Test Suite Fixes and Improvements:

  • Marked certain LVM tests as "aks" only, clarifying that they require NVMe hardware and should not run on Kind clusters. [1] [2] [3]
  • Improved the sanity test socat patch to use a prebuilt alpine/socat image, eliminating the need for unreliable runtime installation in CI. [1] [2]

Minor Improvements:

  • Added strings import in test/sanity/sanity_suite_test.go for future or existing string manipulation needs.
  • Updated the test-sanity Makefile target to add --fail-fast for quicker feedback on failures.

@jmclong jmclong force-pushed the dev/jlong/kind-e2e branch from 9d7175b to 7da907f Compare May 27, 2026 15:08
jmclong and others added 2 commits May 27, 2026 15:42
Replace the runtime tdnf-install init container with the prebuilt
alpine/socat image and preload it into the kind cluster before
applying the patch. The previous approach hit CoreDNS cold-start
and double-NAT issues on CI, causing init-socat to hang for >10m.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add --fail-fast to test-sanity Makefile target so CI stops on the
  first failing spec instead of running through dozens of cascading
  failures (saves >10 minutes when the driver mis-installs).
- Bump kind-setup-vg.sh default VG_SIZE from 100G to 500G. The file
  is sparse so this does not consume real host disk until written to,
  but it removes capacity-pressure flakes on the external-e2e suite.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jmclong jmclong marked this pull request as ready for review May 27, 2026 16:03
@jmclong jmclong requested review from a team, croomes and landreasyan as code owners May 27, 2026 16:03
jmclong and others added 2 commits May 27, 2026 16:19
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The driver's pod-termination cleanup wipes the kind VG when the
daemonset is restarted mid-suite, which surfaces as CreateVolume RPCs
returning 0 capacity / 'no devices found'. Disable cleanup.enabled,
lvGarbageCollection.enabled, and lvmOrphanCleanup.enabled the same way
the sanity suite already does.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jmclong jmclong force-pushed the dev/jlong/kind-e2e branch from 0fb088b to a6f79bb Compare May 27, 2026 17:51
- Switch kind-setup-vg.sh from a tmpfs mount to a sparse file under
  /var, which kind mounts as a real Docker volume on the host fs.
  tmpfs was RAM-backed, so parallel ephemeral suites that allocate
  hundreds of GiB worth of LVs ran the node OOM and crash-looped the
  driver (probes hit 'connection refused').
- Stop disabling cleanup.lvGarbageCollection and cleanup.lvmOrphanCleanup
  in external-e2e. Only cleanup.enabled (pod-termination VG wipe) needs
  to be off. The orphan controllers are what reclaim LVs left behind
  by flaky DeleteVolume RPCs - without them the VG filled up and the
  scheduler returned 'node(s) did not have enough free storage'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jmclong jmclong force-pushed the dev/jlong/kind-e2e branch from a6f79bb to e6d11f7 Compare May 27, 2026 20:06
Comment thread .github/workflows/test-e2e-pr.yml Outdated
@jmclong jmclong merged commit da4d4e8 into main May 27, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants