Summary
VMs cloned via scale_computing.hypercore.vm_clone from a standard Canonical Ubuntu cloud image inherit two independent boot-fragility issues that surface the first time a second VIRTIO_DISK is attached AND the VM is rebooted:
bootDevices is empty on the cloned VirDomain. HC's BIOS has an implicit auto-fallback to "the only virtio disk" but that fallback gives up once a second virtio is attached → Boot failed: not a bootable disk / No bootable device.
- The cloud image's
/etc/default/grub ships with GRUB_DISABLE_LINUX_UUID=true, so the generated /boot/grub/grub.cfg uses root=/dev/vda1 (device-path) instead of root=UUID=… (UUID-based). When a second virtio disk reorders PCI enumeration, the OS disk becomes vdb, the kernel can't find /dev/vda1, and initramfs hangs at Btrfs loaded ….
Combined effect: a cloud-image VM cloned via this module boots correctly for years while it has exactly one virtio disk, then bricks on the first reboot after any tool (CSI driver, manual disk-add, Terraform provider, or even another playbook in the same suite) attaches a second virtio disk.
This is the underlying root cause for a pair of related issues filed in ScaleComputing/k3s-on-hypercore (#7 and #8), but the same exposure exists in every downstream consumer of vm_clone: k3s-ansible-hypercore, ansible_edge_playbooks, and customer playbooks that clone from cloud-image templates.
Reproducer
# Clone an Ubuntu cloud-image VM via this collection (any of the standard
# patterns, e.g. ansible_edge_playbooks/simple_vm_deploy.yml):
ansible-playbook simple_vm_deploy.yml
# Inspect the resulting VM:
curl -sk -u admin:admin "https://<hc-host>/rest/v1/VirDomain/<new-vm-uuid>" \
| jq '.[0] | {name, bootDevices}'
# → bootDevices: [] ← issue #1
# SSH into the new VM:
grep GRUB_DISABLE_LINUX_UUID /etc/default/grub
# → GRUB_DISABLE_LINUX_UUID=true ← issue #2
cat /proc/cmdline
# → root=/dev/vda1 ← issue #2 consequence
# Attach a second VIRTIO_DISK, then reboot:
# - If issue #1 unaddressed: BIOS "No bootable device" — VM unbootable.
# - If issue #1 fixed but #2 unaddressed: kernel loads but initramfs hangs.
Suggested fixes in this collection
Layer 1: vm_clone should set bootDevices after clone
The most natural fix is to add a parameter (or change defaults) so the cloned VM's bootDevices is populated with the primary disk's UUID immediately after creation. Something like:
- name: Clone the VM
scale_computing.hypercore.vm_clone:
vm_name: "{{ inventory_hostname }}"
source_vm_name: "{{ source_template }}"
set_boot_devices: yes # new default-yes parameter
Or simpler: always populate bootDevices with the cloned VM's largest VIRTIO_DISK by default. Users who want to manage boot order manually can override.
Layer 2: optional cloud_init.runcmd injection
If the module accepts cloud_init.user_data, the generated cloud-init should include a runcmd to fix the grub UUID issue at first boot:
runcmd:
- sed -i 's/^GRUB_DISABLE_LINUX_UUID=true/#GRUB_DISABLE_LINUX_UUID=true/' /etc/default/grub
- update-grub
Either bake this into the module's default user_data, or document it prominently in the role's README so every downstream playbook can add it.
Workaround for the field today
I've documented both layers + per-VM fix recipes in an internal hypercore-api-notes.md reference. The TL;DR for any playbook today:
- After
vm_clone, PATCH /VirDomain/{uuid} with bootDevices: [<primary-disk-uuid>].
- Include the grub
runcmd snippet in the cloud-init user data.
I've also done both layers manually on an 8-VM reference cluster — works cleanly.
Why this matters
Without these fixes, every k8s-on-HyperCore deployment is a ticking time bomb. The VMs come up fine, work fine for months/years, then the first time someone adds storage (CSI driver, data disk, NFS export disk, etc.) and the VM reboots, it bricks. That's exactly the wrong failure pattern: silent and recovery-blocking.
Related
- HC platform team issue (BIOS auto-fallback should be smarter or refuse empty
bootDevices): github.lab.local/dev/dev#394
k3s-on-hypercore#7 (bootDevices)
k3s-on-hypercore#8 (grub UUID)
Suggested labels
bug, priority:high — silent failure mode that affects every multi-disk workload on HC.
Summary
VMs cloned via
scale_computing.hypercore.vm_clonefrom a standard Canonical Ubuntu cloud image inherit two independent boot-fragility issues that surface the first time a secondVIRTIO_DISKis attached AND the VM is rebooted:bootDevicesis empty on the cloned VirDomain. HC's BIOS has an implicit auto-fallback to "the only virtio disk" but that fallback gives up once a second virtio is attached →Boot failed: not a bootable disk / No bootable device./etc/default/grubships withGRUB_DISABLE_LINUX_UUID=true, so the generated/boot/grub/grub.cfgusesroot=/dev/vda1(device-path) instead ofroot=UUID=…(UUID-based). When a second virtio disk reorders PCI enumeration, the OS disk becomesvdb, the kernel can't find/dev/vda1, and initramfs hangs atBtrfs loaded ….Combined effect: a cloud-image VM cloned via this module boots correctly for years while it has exactly one virtio disk, then bricks on the first reboot after any tool (CSI driver, manual disk-add, Terraform provider, or even another playbook in the same suite) attaches a second virtio disk.
This is the underlying root cause for a pair of related issues filed in
ScaleComputing/k3s-on-hypercore(#7 and #8), but the same exposure exists in every downstream consumer ofvm_clone:k3s-ansible-hypercore,ansible_edge_playbooks, and customer playbooks that clone from cloud-image templates.Reproducer
Suggested fixes in this collection
Layer 1:
vm_cloneshould setbootDevicesafter cloneThe most natural fix is to add a parameter (or change defaults) so the cloned VM's
bootDevicesis populated with the primary disk's UUID immediately after creation. Something like:Or simpler: always populate
bootDeviceswith the cloned VM's largest VIRTIO_DISK by default. Users who want to manage boot order manually can override.Layer 2: optional
cloud_init.runcmdinjectionIf the module accepts
cloud_init.user_data, the generated cloud-init should include aruncmdto fix the grub UUID issue at first boot:Either bake this into the module's default user_data, or document it prominently in the role's README so every downstream playbook can add it.
Workaround for the field today
I've documented both layers + per-VM fix recipes in an internal
hypercore-api-notes.mdreference. The TL;DR for any playbook today:vm_clone,PATCH /VirDomain/{uuid}withbootDevices: [<primary-disk-uuid>].runcmdsnippet in the cloud-init user data.I've also done both layers manually on an 8-VM reference cluster — works cleanly.
Why this matters
Without these fixes, every k8s-on-HyperCore deployment is a ticking time bomb. The VMs come up fine, work fine for months/years, then the first time someone adds storage (CSI driver, data disk, NFS export disk, etc.) and the VM reboots, it bricks. That's exactly the wrong failure pattern: silent and recovery-blocking.
Related
bootDevices):github.lab.local/dev/dev#394k3s-on-hypercore#7(bootDevices)k3s-on-hypercore#8(grub UUID)Suggested labels
bug,priority:high— silent failure mode that affects every multi-disk workload on HC.