KVM: fix UEFI disk-only instance snapshot NVRAM handling#13020
KVM: fix UEFI disk-only instance snapshot NVRAM handling#13020Kunalbehbud wants to merge 6 commits intoapache:4.22from
Conversation
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## 4.22 #13020 +/- ##
============================================
+ Coverage 17.68% 17.74% +0.05%
- Complexity 15793 15866 +73
============================================
Files 5922 5922
Lines 533096 533575 +479
Branches 65209 65262 +53
============================================
+ Hits 94275 94663 +388
- Misses 428181 428201 +20
- Partials 10640 10711 +71
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17485 |
|
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch. |
f657823 to
2bc9051
Compare
|
Rebased this branch on the latest Local verification passed again on top of the updated base: The new GitHub Actions runs for this fork push are currently in |
There was a problem hiding this comment.
Pull request overview
Fixes KVM disk-only instance snapshots for UEFI VMs by capturing/restoring the active NVRAM state via an NVRAM “sidecar” file, and gating these flows on explicit host capabilities to prevent unsafe reverts.
Changes:
- Plumbs UEFI/NVRAM metadata through management ↔ agent commands (create/revert/delete) and persists NVRAM sidecar path in snapshot details.
- Implements NVRAM sidecar copy on snapshot creation and atomic restore on revert; adds sidecar cleanup on delete/merge flows.
- Adds host capability advertising + reconnect-time capability syncing/cleanup, plus targeted unit tests for the new behavior.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| plugins/hypervisors/kvm/src/test/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtDiskOnlyVMSnapshotCommandWrapperTest.java | New unit tests covering NVRAM sidecar backup/restore/cleanup and freeze/suspend sequencing/warnings. |
| plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtCreateDiskOnlyVMSnapshotCommandWrapper.java | Copies active NVRAM into a per-snapshot sidecar; adds freeze/suspend + post-snapshot recovery warnings. |
| plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtRevertDiskOnlyVMSnapshotCommandWrapper.java | Validates presence of NVRAM sidecar for UEFI reverts and atomically restores active NVRAM from the sidecar. |
| plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtDeleteDiskOnlyVMSnapshotCommandWrapper.java | Deletes NVRAM sidecar when present, using either provided primary datastore or root snapshot lookup. |
| plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtReadyCommandWrapper.java | Advertises the new host capability for NVRAM-aware disk-only snapshots. |
| plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/LibvirtComputingResource.java | Adds host detail advertising at startup and provides getUefiNvramPath(vmUuid) helper. |
| engine/storage/snapshot/src/test/java/org/apache/cloudstack/storage/vmsnapshot/KvmFileBasedStorageVmSnapshotStrategyTest.java | New tests validating NVRAM path persistence, agent command plumbing, and capability gating. |
| engine/storage/snapshot/src/main/java/org/apache/cloudstack/storage/vmsnapshot/KvmFileBasedStorageVmSnapshotStrategy.java | Persists NVRAM sidecar metadata, gates create/revert/delete/merge cleanup on capabilities, sends alert on recovery warnings. |
| engine/orchestration/src/test/java/com/cloud/agent/manager/AgentManagerImplTest.java | Adds test ensuring stale NVRAM capability is cleared on reconnect when no longer advertised. |
| engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java | Syncs boolean host capabilities from ReadyAnswer details, including removal of stale DB details when not advertised. |
| core/src/main/java/com/cloud/agent/api/storage/CreateDiskOnlyVmSnapshotCommand.java | Adds vmUuid and uefiEnabled fields for agent-side NVRAM handling decisions. |
| core/src/main/java/com/cloud/agent/api/storage/CreateDiskOnlyVmSnapshotAnswer.java | Adds nvramSnapshotPath to return sidecar path back to management. |
| core/src/main/java/com/cloud/agent/api/storage/RevertDiskOnlyVmSnapshotCommand.java | Adds vmUuid, uefiEnabled, and nvramSnapshotPath to support validated/atomic NVRAM restores. |
| core/src/main/java/com/cloud/agent/api/storage/DeleteDiskOnlyVmSnapshotCommand.java | Adds optional nvramSnapshotPath and primaryDataStore to support sidecar-only cleanup. |
| api/src/main/java/com/cloud/host/Host.java | Introduces HOST_KVM_DISK_ONLY_VM_SNAPSHOT_NVRAM host capability key. |
| plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtConvertInstanceCommandWrapper.java | Fixes a typo in an error message (“calll” → “call”). |
| agent/conf/agent.properties | Removes a trailing whitespace-only line. |
| PendingReleaseNotes | Adds release note about UEFI disk-only snapshots now capturing NVRAM and legacy snapshot revert rejection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…spend Address Copilot's review comment on apache#13020. `JsonParser.parse(...)`, `.get("return")` and `.getAsString()` in `verifyVmFilesystemsFrozen` can throw unchecked exceptions (`JsonSyntaxException`, `IllegalStateException`, `NullPointerException`) on malformed or unexpected QEMU guest-agent responses, bypassing the surrounding `catch (LibvirtException | IOException)` and potentially leaving the guest frozen until the `finally` recovery runs. Parse defensively, validate that `return` is a JSON string, and explicitly treat `{"error": {...}}` guest-agent responses as an IO failure so the outer handler thaws/resumes the guest and surfaces a proper failure answer. Also document in PendingReleaseNotes that taking a disk-only instance snapshot for KVM UEFI VMs now briefly suspends the guest while the NVRAM sidecar is copied. Non-UEFI VMs are unaffected. Add regression tests covering the guest-agent error response and the malformed-JSON path, including verification that the thaw recovery is invoked when freeze verification fails. Signed-off-by: kunal.behbudzade <kunal.behbudzade@btsgrp.com>
|
Addressed the Copilot comment in 142d116. Added two regression tests ( Also updated Local verification on top of 142d116 passed: |
Add the command payload, answer metadata, and host capability plumbing required for KVM disk-only instance snapshots to carry UEFI NVRAM state between management and the KVM agent. Also synchronize host capability booleans on reconnect so stale UEFI/NVRAM support details are removed when an older agent reconnects.
Copy the active UEFI NVRAM file as a sidecar during disk-only instance snapshot creation, restore it on revert, and clean it up during delete and merge flows. Also tighten host capability checks, preserve successful snapshot metadata when post-snapshot thaw or resume fails, and reject reverting UEFI disk-only snapshots that do not contain NVRAM state.
Cover host capability synchronization, UEFI NVRAM sidecar handling across create/revert/delete flows, and the running-VM recovery paths for disk-only instance snapshots.
…spend Address Copilot's review comment on apache#13020. `JsonParser.parse(...)`, `.get("return")` and `.getAsString()` in `verifyVmFilesystemsFrozen` can throw unchecked exceptions (`JsonSyntaxException`, `IllegalStateException`, `NullPointerException`) on malformed or unexpected QEMU guest-agent responses, bypassing the surrounding `catch (LibvirtException | IOException)` and potentially leaving the guest frozen until the `finally` recovery runs. Parse defensively, validate that `return` is a JSON string, and explicitly treat `{"error": {...}}` guest-agent responses as an IO failure so the outer handler thaws/resumes the guest and surfaces a proper failure answer. Also document in PendingReleaseNotes that taking a disk-only instance snapshot for KVM UEFI VMs now briefly suspends the guest while the NVRAM sidecar is copied. Non-UEFI VMs are unaffected. Add regression tests covering the guest-agent error response and the malformed-JSON path, including verification that the thaw recovery is invoked when freeze verification fails. Signed-off-by: kunal.behbudzade <kunal.behbudzade@btsgrp.com>
|
Replaced the GitHub |
5f30788 to
75bf215
Compare
|
Local verification passed again on the current rebased head: No GitHub checks have attached to the current head yet. Could a maintainer please re-run or approve the workflows for this PR head? |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| String result = getResultOfQemuCommand(FreezeThawVMCommand.FREEZE, domain); | ||
| if (StringUtils.isBlank(result) || result.startsWith("error")) { | ||
| throw new IOException(String.format("Failed to freeze VM [%s] filesystems before taking the disk-only VM snapshot. Result: %s", vmName, result)); | ||
| } |
There was a problem hiding this comment.
freezeVmFilesystems() treats QEMU guest agent failures as only null/blank or result.startsWith("error"). qemuAgentCommand typically returns JSON like { "error": ... } on failure, which will not be caught here and may allow the snapshot flow to continue as if the freeze succeeded. Consider parsing the JSON response and failing when it contains an error object (or reuse the existing FreezeThawVM logic of checking guest-fsfreeze-status after the freeze).
| } | |
| } | |
| try { | |
| JsonElement resultElement = new JsonParser().parse(result); | |
| if (resultElement.isJsonObject()) { | |
| JsonObject resultObject = resultElement.getAsJsonObject(); | |
| if (resultObject.has("error")) { | |
| throw new IOException(String.format("Failed to freeze VM [%s] filesystems before taking the disk-only VM snapshot. Result: %s", vmName, result)); | |
| } | |
| } | |
| } catch (IOException e) { | |
| throw e; | |
| } catch (RuntimeException e) { | |
| // Keep existing behavior for non-JSON successful responses while still failing on JSON error payloads. | |
| } |
| protected boolean thawVmFilesystemsIfNeeded(Domain domain, String vmName) { | ||
| try { | ||
| String result = getResultOfQemuCommand(FreezeThawVMCommand.THAW, domain); | ||
| if (StringUtils.isBlank(result) || result.startsWith("error")) { | ||
| logger.warn("Failed to thaw VM [{}] filesystems after taking the disk-only VM snapshot. Result: {}", vmName, result); | ||
| return false; | ||
| } | ||
| return true; | ||
| } catch (LibvirtException e) { |
There was a problem hiding this comment.
thawVmFilesystemsIfNeeded() uses the same result.startsWith("error") check, so a JSON QEMU error response (e.g. { "error": ... }) could be treated as success. That can incorrectly suppress the warning/alert path and leave guest filesystems frozen. Parse the JSON and treat responses containing an error field (or lacking an expected return) as a thaw failure.
| } | ||
|
|
||
| String subject = String.format("Disk-only VM snapshot [%s] completed with guest recovery warnings", vmSnapshot.getUuid()); | ||
| String message = String.format("Disk-only VM snapshot [%s] for UEFI VM [%s] completed, but post-snapshot guest recovery reported: %s", |
There was a problem hiding this comment.
notifyGuestRecoveryIssueIfNeeded() hard-codes "for UEFI VM" in the alert message, but the method is invoked unconditionally after snapshot creation. If answer.getDetails() is ever populated for a non-UEFI snapshot path, the alert text will be misleading. Consider either gating this method on isUefiVm(userVm) or making the alert text generic (or include boot type conditionally).
| String message = String.format("Disk-only VM snapshot [%s] for UEFI VM [%s] completed, but post-snapshot guest recovery reported: %s", | |
| String message = String.format("Disk-only VM snapshot [%s] for VM [%s] completed, but post-snapshot guest recovery reported: %s", |
JoaoJandre
left a comment
There was a problem hiding this comment.
Hey @Kunalbehbud, thanks for the contribution. I have some doubts and reviews regarding this change, but overall the code quality looks good.
|
|
||
| VMSnapshotVO vmSnapshotBeingReverted = (VMSnapshotVO) vmSnapshot; | ||
| Long hostId = vmSnapshotHelper.pickRunningHost(vmSnapshotBeingReverted.getVmId()); | ||
| validateHostSupportsUefiNvramAwareDiskOnlySnapshots(hostId, userVm, "revert"); |
There was a problem hiding this comment.
If we need to validate the host support for restore, we should try to find a host with this in mind in the first place.
If the VM has no lastHost set, the pickRunningHost method will return a random host that is up and enabled.
In a scenario where half the hosts have UEFI support and half does not, we might get an error when trying to revert, followed by a success on the following attempt. It makes the system seem unstable.
You could either create a new method to pick hosts that have uefi enabled, or tinker with the existing one, adding a parameter that when set will filter this on the SQL query.
I would apply this to all processes (create, delete, revert).
There was a problem hiding this comment.
Agreed. Host selection now looks for the required capabilities before sending the command. One nuance is stopped UEFI snapshot creation does not fall back away from the VM's last host, because the active NVRAM file is host-local. Revert and sidecar cleanup can still use a capable fallback host.
| filesystemsFrozenByThisWrapper = true; | ||
| verifyVmFilesystemsFrozen(dm, vmName); | ||
| } | ||
| if (shouldSuspendVmForSnapshot(cmd)) { |
There was a problem hiding this comment.
Do we need to freeze the filesystem if the vm is suspended?
Also, if we always need to suspend the VM to create the snapshot when it is using UEFI, this should be explicitly stated in the API and GUI.
What happens if the VM is not suspended?
There was a problem hiding this comment.
The freeze and suspend cover different parts of the consistency window. The guest-agent freeze flushes/quiesces guest filesystems, while suspend prevents concurrent pflash/NVRAM writes while the sidecar is copied. Without the suspend, NVRAM can change during the copy and no longer match the disk snapshot. I added the API and UI warning for this behavior.
| @Override | ||
| public Answer execute(final ReadyCommand command, final LibvirtComputingResource libvirtComputingResource) { | ||
| Map<String, String> hostDetails = new HashMap<String, String>(); | ||
| hostDetails.put(Host.HOST_KVM_DISK_ONLY_VM_SNAPSHOT_NVRAM, Boolean.TRUE.toString()); |
There was a problem hiding this comment.
Isn't this supposed to be configurable?
There was a problem hiding this comment.
I kept this as an advertised agent capability rather than a config option. It reflects whether the KVM agent version understands the NVRAM sidecar flow. Making it configurable could let management send these commands to an agent that does not actually support them, which is unsafe during rolling upgrades.
| * Taking a disk-only instance snapshot for KVM UEFI VMs now briefly suspends | ||
| the guest while the NVRAM sidecar is copied, so that the captured firmware | ||
| state is consistent with the disk snapshot. Non-UEFI VMs are unaffected and | ||
| continue to snapshot live. |
There was a problem hiding this comment.
This should be warned in the command as well. And GUI.
There was a problem hiding this comment.
added this to the createVMSnapshot API description and to the UI confirmation message for the Instance Snapshot action.
Keep stopped UEFI snapshot creation tied to the last host so the agent copies the active host-local NVRAM file, while still allowing capable-host selection for revert and NVRAM sidecar cleanup. Also handle QEMU guest-agent error payloads consistently and add the API/UI warning for the brief suspend used when snapshotting running UEFI guests. Signed-off-by: kunal.behbudzade <kunal.behbudzade@btsgrp.com>
|
Pushed a follow-up commit for the review feedback. The main change is that stopped UEFI disk-only snapshot creation no longer falls back away from the VM's last host. The active NVRAM file is host-local, so if the selected host is not the last host we fail early instead of risking a stale or missing NVRAM sidecar. Capability-aware host selection is still used for revert and sidecar cleanup. I also removed the duplicate sidecar cleanup validation and the test-only freeze helper, handled QEMU guest-agent error payloads in freeze/thaw paths, and added the API/UI warning for the brief suspend used for running KVM UEFI snapshots. Verified locally: |
Description
KVM disk-only instance snapshots do not capture the active UEFI NVRAM state, which makes revert unsafe for UEFI guests.
This PR fixes the KVM disk-only instance snapshot flow to:
quiescevm=trueOlder UEFI disk-only snapshots created without an NVRAM sidecar are intentionally rejected on revert.
This PR only covers
disk-onlyinstance snapshots for KVM UEFI VMs.snapshotMemory=true/ internal snapshots remain out of scope.Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
N/A
How Has This Been Tested?
mvn -pl engine/orchestration,engine/storage/snapshot,plugins/hypervisors/kvm -am -Dtest=AgentManagerImplTest,KvmFileBasedStorageVmSnapshotStrategyTest,LibvirtDiskOnlyVMSnapshotCommandWrapperTest -Dsurefire.failIfNoSpecifiedTests=false testHow did you try to break this feature and the system with this change?