Skip to content

Cloud image's GRUB_DISABLE_LINUX_UUID=true causes initramfs hang when a second virtio disk is attached #8

@ddemlow

Description

@ddemlow

Summary

VMs produced by this playbook ship with GRUB_DISABLE_LINUX_UUID=true in /etc/default/grub (inherited from the Ubuntu cloud image). That setting tells grub-mkconfig to emit root=/dev/vda1 in the kernel command line instead of root=UUID=<...>. When a second VIRTIO_DISK is later attached to the VM and the VM is rebooted, the new disk takes PCI slot 0 — the kernel renames it vda, the OS disk becomes vdb, and the initramfs hangs waiting for the non-existent /dev/vda1.

This is the initramfs-level twin of the issue I filed as #7 (HC bootDevices empty by default). The two are independent layers of the same disaster:

  1. BIOS-level (issue VMs created without explicit bootDevices are silently bricked by a second virtio disk attach #7): empty bootDevices + two virtio disks → BIOS gives up → "No bootable device".
  2. Initramfs-level (this issue): if you fix the BIOS path via bootDevices, grub still loads with root=/dev/vda1 → kernel boots → initramfs can't find root because the orphan disk reordered virtio device naming → boot hangs.

Both have to be fixed for a worker VM with attached non-OS disks to survive a reboot.

Reproducer

# On any VM from this playbook (HC 9.6.x):
grep GRUB_DISABLE_LINUX_UUID /etc/default/grub
# → GRUB_DISABLE_LINUX_UUID=true

cat /proc/cmdline
# → BOOT_IMAGE=/boot/vmlinuz-... root=/dev/vda1 ro nomodeset ...

# /etc/fstab is already correct (LABEL=cloudimg-rootfs) — only grub is the weak link.
# Verified on an HC 9.6.25 cluster: 8 VMs, all 8 had the bad setting.

After attaching a second VIRTIO_DISK and rebooting, the kernel boot reaches Btrfs loaded, crc32c=crc32c-intel, zoned=yes, fsverity=yes and hangs (initramfs is waiting for /dev/vda1 that doesn't exist).

Suggested fix in the playbook

Add a task that runs once at provision time, after the VM has booted but before any non-OS disks could be attached:

- name: Use UUID-based root for resilience against extra virtio disk attachments
  ansible.builtin.lineinfile:
    path: /etc/default/grub
    regexp: '^GRUB_DISABLE_LINUX_UUID='
    line: '#GRUB_DISABLE_LINUX_UUID=true  # disabled — vda reorder hazard'

- name: Regenerate grub.cfg with root=UUID=
  ansible.builtin.command: update-grub
  changed_when: false

Or cleaner, set it via cloud-init's runcmd:

runcmd:
  - sed -i 's/^GRUB_DISABLE_LINUX_UUID=true/#GRUB_DISABLE_LINUX_UUID=true/' /etc/default/grub
  - update-grub

Workaround for existing deployments

Per host:

sudo sed -i.bak 's/^GRUB_DISABLE_LINUX_UUID=true/#GRUB_DISABLE_LINUX_UUID=true/' /etc/default/grub
sudo update-grub
# Verify: grep "root=" /boot/grub/grub.cfg | head -1
#   should now contain root=UUID=<your-rootfs-uuid>

I just applied this to all 8 VMs on the dd-k3s reference cluster — works cleanly on Ubuntu 22.04 cloud-image. grub.cfg now contains root=UUID=ff9efcec-... instead of root=/dev/vda1. The cluster is now defended against the virtio-reorder boot hang at both layers (BIOS bootDevices already patched via #7's workaround, plus this grub change).

Why this matters together with #7

bootDevices (BIOS layer) + UUID root (initramfs layer) are both needed. Fixing only one leaves the VM vulnerable:

bootDevices grub UUID Reboot after 2nd virtio attached
empty /dev/vda1 "No bootable device" (BIOS gives up)
OS disk pinned /dev/vda1 grub loads, initramfs hangs on missing /dev/vda1
empty UUID "No bootable device" (BIOS gives up before UUID matters)
OS disk pinned UUID Boots cleanly ✓

Both fixes should land together in the playbook for a complete defense.

Suggested labels

bug, priority:high — same "silent until reboot" hazard as #7.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions