vm_clone produces VMs that are silently fragile to multi-disk attach + reboot (no bootDevices, vda-anchored grub)

## Summary

VMs cloned via `scale_computing.hypercore.vm_clone` from a standard Canonical Ubuntu cloud image inherit two independent boot-fragility issues that surface the first time a second `VIRTIO_DISK` is attached AND the VM is rebooted:

1. **`bootDevices` is empty** on the cloned VirDomain. HC's BIOS has an implicit auto-fallback to "the only virtio disk" but that fallback gives up once a second virtio is attached → `Boot failed: not a bootable disk / No bootable device`.
2. **The cloud image's `/etc/default/grub` ships with `GRUB_DISABLE_LINUX_UUID=true`**, so the generated `/boot/grub/grub.cfg` uses `root=/dev/vda1` (device-path) instead of `root=UUID=…` (UUID-based). When a second virtio disk reorders PCI enumeration, the OS disk becomes `vdb`, the kernel can't find `/dev/vda1`, and initramfs hangs at `Btrfs loaded …`.

Combined effect: a cloud-image VM cloned via this module **boots correctly for years** while it has exactly one virtio disk, then bricks on the first reboot after any tool (CSI driver, manual disk-add, Terraform provider, or even another playbook in the same suite) attaches a second virtio disk.

This is the underlying root cause for a pair of related issues filed in `ScaleComputing/k3s-on-hypercore` (#7 and #8), but the same exposure exists in **every** downstream consumer of `vm_clone`: `k3s-ansible-hypercore`, `ansible_edge_playbooks`, and customer playbooks that clone from cloud-image templates.

## Reproducer

```bash
# Clone an Ubuntu cloud-image VM via this collection (any of the standard
# patterns, e.g. ansible_edge_playbooks/simple_vm_deploy.yml):
ansible-playbook simple_vm_deploy.yml

# Inspect the resulting VM:
curl -sk -u admin:admin "https://<hc-host>/rest/v1/VirDomain/<new-vm-uuid>" \
  | jq '.[0] | {name, bootDevices}'
# → bootDevices: []   ← issue #1

# SSH into the new VM:
grep GRUB_DISABLE_LINUX_UUID /etc/default/grub
# → GRUB_DISABLE_LINUX_UUID=true     ← issue #2
cat /proc/cmdline
# → root=/dev/vda1    ← issue #2 consequence

# Attach a second VIRTIO_DISK, then reboot:
#   - If issue #1 unaddressed: BIOS "No bootable device" — VM unbootable.
#   - If issue #1 fixed but #2 unaddressed: kernel loads but initramfs hangs.
```

## Suggested fixes in this collection

### Layer 1: `vm_clone` should set `bootDevices` after clone

The most natural fix is to add a parameter (or change defaults) so the cloned VM's `bootDevices` is populated with the primary disk's UUID immediately after creation. Something like:

```yaml
- name: Clone the VM
  scale_computing.hypercore.vm_clone:
    vm_name: "{{ inventory_hostname }}"
    source_vm_name: "{{ source_template }}"
    set_boot_devices: yes   # new default-yes parameter
```

Or simpler: always populate `bootDevices` with the cloned VM's largest VIRTIO_DISK by default. Users who want to manage boot order manually can override.

### Layer 2: optional `cloud_init.runcmd` injection

If the module accepts `cloud_init.user_data`, the generated cloud-init should include a `runcmd` to fix the grub UUID issue at first boot:

```yaml
runcmd:
  - sed -i 's/^GRUB_DISABLE_LINUX_UUID=true/#GRUB_DISABLE_LINUX_UUID=true/' /etc/default/grub
  - update-grub
```

Either bake this into the module's default user_data, or document it prominently in the role's README so every downstream playbook can add it.

## Workaround for the field today

I've documented both layers + per-VM fix recipes in an internal `hypercore-api-notes.md` reference. The TL;DR for any playbook today:

1. After `vm_clone`, `PATCH /VirDomain/{uuid}` with `bootDevices: [<primary-disk-uuid>]`.
2. Include the grub `runcmd` snippet in the cloud-init user data.

I've also done both layers manually on an 8-VM reference cluster — works cleanly.

## Why this matters

Without these fixes, **every k8s-on-HyperCore deployment is a ticking time bomb**. The VMs come up fine, work fine for months/years, then the first time someone adds storage (CSI driver, data disk, NFS export disk, etc.) and the VM reboots, it bricks. That's exactly the wrong failure pattern: silent and recovery-blocking.

## Related

- HC platform team issue (BIOS auto-fallback should be smarter or refuse empty `bootDevices`): `github.lab.local/dev/dev#394`
- `k3s-on-hypercore#7` (bootDevices)
- `k3s-on-hypercore#8` (grub UUID)

## Suggested labels

`bug`, `priority:high` — silent failure mode that affects every multi-disk workload on HC.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vm_clone produces VMs that are silently fragile to multi-disk attach + reboot (no bootDevices, vda-anchored grub) #370

Summary

Reproducer

Suggested fixes in this collection

Layer 1: `vm_clone` should set `bootDevices` after clone

Layer 2: optional `cloud_init.runcmd` injection

Workaround for the field today

Why this matters

Related

Suggested labels

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vm_clone produces VMs that are silently fragile to multi-disk attach + reboot (no bootDevices, vda-anchored grub) #370

Description

Summary

Reproducer

Suggested fixes in this collection

Layer 1: vm_clone should set bootDevices after clone

Layer 2: optional cloud_init.runcmd injection

Workaround for the field today

Why this matters

Related

Suggested labels

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Layer 1: `vm_clone` should set `bootDevices` after clone

Layer 2: optional `cloud_init.runcmd` injection