Skip to content

vm_clone produces VMs that are silently fragile to multi-disk attach + reboot (no bootDevices, vda-anchored grub) #370

@ddemlow

Description

@ddemlow

Summary

VMs cloned via scale_computing.hypercore.vm_clone from a standard Canonical Ubuntu cloud image inherit two independent boot-fragility issues that surface the first time a second VIRTIO_DISK is attached AND the VM is rebooted:

  1. bootDevices is empty on the cloned VirDomain. HC's BIOS has an implicit auto-fallback to "the only virtio disk" but that fallback gives up once a second virtio is attached → Boot failed: not a bootable disk / No bootable device.
  2. The cloud image's /etc/default/grub ships with GRUB_DISABLE_LINUX_UUID=true, so the generated /boot/grub/grub.cfg uses root=/dev/vda1 (device-path) instead of root=UUID=… (UUID-based). When a second virtio disk reorders PCI enumeration, the OS disk becomes vdb, the kernel can't find /dev/vda1, and initramfs hangs at Btrfs loaded ….

Combined effect: a cloud-image VM cloned via this module boots correctly for years while it has exactly one virtio disk, then bricks on the first reboot after any tool (CSI driver, manual disk-add, Terraform provider, or even another playbook in the same suite) attaches a second virtio disk.

This is the underlying root cause for a pair of related issues filed in ScaleComputing/k3s-on-hypercore (#7 and #8), but the same exposure exists in every downstream consumer of vm_clone: k3s-ansible-hypercore, ansible_edge_playbooks, and customer playbooks that clone from cloud-image templates.

Reproducer

# Clone an Ubuntu cloud-image VM via this collection (any of the standard
# patterns, e.g. ansible_edge_playbooks/simple_vm_deploy.yml):
ansible-playbook simple_vm_deploy.yml

# Inspect the resulting VM:
curl -sk -u admin:admin "https://<hc-host>/rest/v1/VirDomain/<new-vm-uuid>" \
  | jq '.[0] | {name, bootDevices}'
# → bootDevices: []   ← issue #1

# SSH into the new VM:
grep GRUB_DISABLE_LINUX_UUID /etc/default/grub
# → GRUB_DISABLE_LINUX_UUID=true     ← issue #2
cat /proc/cmdline
# → root=/dev/vda1    ← issue #2 consequence

# Attach a second VIRTIO_DISK, then reboot:
#   - If issue #1 unaddressed: BIOS "No bootable device" — VM unbootable.
#   - If issue #1 fixed but #2 unaddressed: kernel loads but initramfs hangs.

Suggested fixes in this collection

Layer 1: vm_clone should set bootDevices after clone

The most natural fix is to add a parameter (or change defaults) so the cloned VM's bootDevices is populated with the primary disk's UUID immediately after creation. Something like:

- name: Clone the VM
  scale_computing.hypercore.vm_clone:
    vm_name: "{{ inventory_hostname }}"
    source_vm_name: "{{ source_template }}"
    set_boot_devices: yes   # new default-yes parameter

Or simpler: always populate bootDevices with the cloned VM's largest VIRTIO_DISK by default. Users who want to manage boot order manually can override.

Layer 2: optional cloud_init.runcmd injection

If the module accepts cloud_init.user_data, the generated cloud-init should include a runcmd to fix the grub UUID issue at first boot:

runcmd:
  - sed -i 's/^GRUB_DISABLE_LINUX_UUID=true/#GRUB_DISABLE_LINUX_UUID=true/' /etc/default/grub
  - update-grub

Either bake this into the module's default user_data, or document it prominently in the role's README so every downstream playbook can add it.

Workaround for the field today

I've documented both layers + per-VM fix recipes in an internal hypercore-api-notes.md reference. The TL;DR for any playbook today:

  1. After vm_clone, PATCH /VirDomain/{uuid} with bootDevices: [<primary-disk-uuid>].
  2. Include the grub runcmd snippet in the cloud-init user data.

I've also done both layers manually on an 8-VM reference cluster — works cleanly.

Why this matters

Without these fixes, every k8s-on-HyperCore deployment is a ticking time bomb. The VMs come up fine, work fine for months/years, then the first time someone adds storage (CSI driver, data disk, NFS export disk, etc.) and the VM reboots, it bricks. That's exactly the wrong failure pattern: silent and recovery-blocking.

Related

  • HC platform team issue (BIOS auto-fallback should be smarter or refuse empty bootDevices): github.lab.local/dev/dev#394
  • k3s-on-hypercore#7 (bootDevices)
  • k3s-on-hypercore#8 (grub UUID)

Suggested labels

bug, priority:high — silent failure mode that affects every multi-disk workload on HC.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions