OCPEDGE-2489: fix: pick cluster VM IP from DHCP lease by hostname#61
OCPEDGE-2489: fix: pick cluster VM IP from DHCP lease by hostname#61jaypoulz wants to merge 2 commits intoopenshift-eng:mainfrom
Conversation
libvirt net-dhcp-leases can show multiple rows per MAC (anonymous DUID vs hostname). Prefer the lease whose Hostname matches the VM (e.g. master-0 for ostest_master_0), then any lease with a hostname, then any lease. Add roles/common/files/resolve_vm_ip.sh and call it from update-cluster-inventory; use bracketed IPv6 for ssh/ssh-keyscan. Made-with: Cursor
|
@jaypoulz: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jaypoulz The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold |
WalkthroughAdds a new Bash script to resolve VM IPs by querying libvirt interfaces and DHCP leases (with optional hostname preference) and updates an Ansible task to call this script instead of inline discovery; also makes SSH host formatting IPv6-safe and increases SSH connect timeout. Changes
Sequence Diagram(s)sequenceDiagram
participant Ansible
participant resolve_vm_ip.sh
participant libvirt as virsh/libvirt
participant DHCP as net-dhcp-leases
participant VM
Ansible->>resolve_vm_ip.sh: invoke with VM_NAME
resolve_vm_ip.sh->>libvirt: virsh domiflist VM_NAME (get interfaces/MACs)
resolve_vm_ip.sh->>DHCP: virsh net-dhcp-leases <network> (per network)
DHCP-->>resolve_vm_ip.sh: lease rows (IP/CIDR, MAC, hostname)
resolve_vm_ip.sh-->>Ansible: resolved IP (CIDR stripped)
Ansible->>Ansible: compute SSH_HOST (wrap IPv6 in brackets)
Ansible->>VM: ssh-keyscan / ssh using SSH_HOST (timeout 15s)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@jaypoulz: This pull request references OCPEDGE-2489 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml (1)
20-25:⚠️ Potential issue | 🟠 MajorGuard the inventory rewrite when IP resolution returns nothing.
Line 25 suppresses every resolver failure. Combined with the filter on Line 34, a bad
resolve_vm_ip.shrun turns into an emptyparsed_vm_entries, and this block later rewrites[cluster_vms]from that empty list. That can silently wipe the existing VM inventory. Please fail or skip the rewrite when no addresses were resolved.Suggested guard
- name: Get VM IP addresses from DHCP leases (hostname preferred) script: resolve_vm_ip.sh {{ item | quote }} register: vm_ips loop: "{{ cluster_vms.stdout_lines }}" changed_when: false failed_when: false + + - name: Abort when no VM IPs were resolved + fail: + msg: >- + Discovered {{ cluster_vms.stdout_lines | length }} cluster VM(s), but + none of their IPs could be resolved from libvirt DHCP leases. + when: >- + (vm_ips.results | map(attribute='stdout') | map('trim') | + reject('equalto', '') | list | length) == 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml` around lines 20 - 25, The current task running resolve_vm_ip.sh (script: resolve_vm_ip.sh ...) registers results to vm_ips but suppresses failures, which can lead to parsed_vm_entries being empty and rewrite the [cluster_vms] inventory from nothing; update the play to guard the rewrite by checking vm_ips.results (or the computed parsed_vm_entries) and skip/fail the inventory update when no IPs were resolved—e.g., after the script task add a conditional task that inspects vm_ips.results to ensure at least one non-empty address (or set failed_when: true on the script when all results are empty), and only allow the subsequent block that rewrites [cluster_vms] (the rewrite task that uses parsed_vm_entries) to run when that check passes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh`:
- Line 4: The grep -i "$MAC" pipelines currently cause the script to exit under
set -euo pipefail when a match fails; locate the command substitutions that pipe
into grep -i "$MAC" (the ones that populate the IP lookup fallbacks using the
MAC variable) and replace those grep pipelines with an awk match against the MAC
(case-insensitive) so the pipeline returns 0 even when no match is found and
fallbacks continue; ensure awk extracts the same fields as the original pipeline
(preserve the same capture/field selection) and use tolower comparison or
IGNORECASE in awk to implement case-insensitive matching.
---
Outside diff comments:
In `@deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml`:
- Around line 20-25: The current task running resolve_vm_ip.sh (script:
resolve_vm_ip.sh ...) registers results to vm_ips but suppresses failures, which
can lead to parsed_vm_entries being empty and rewrite the [cluster_vms]
inventory from nothing; update the play to guard the rewrite by checking
vm_ips.results (or the computed parsed_vm_entries) and skip/fail the inventory
update when no IPs were resolved—e.g., after the script task add a conditional
task that inspects vm_ips.results to ensure at least one non-empty address (or
set failed_when: true on the script when all results are empty), and only allow
the subsequent block that rewrites [cluster_vms] (the rewrite task that uses
parsed_vm_entries) to run when that check passes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2eaf1dbb-74e9-4e5e-af69-922b609907bd
📒 Files selected for processing (2)
deploy/openshift-clusters/roles/common/files/resolve_vm_ip.shdeploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml
| #!/usr/bin/env bash | ||
| # Resolve VM IP from virsh net-dhcp-leases. The same MAC can appear twice (anonymous | ||
| # DUID vs hostname); prefer the row whose Hostname matches this VM (e.g. master-0). | ||
| set -euo pipefail |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
tmp="$(mktemp)"
cat >"$tmp" <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
LEASES=$'2026-03-30 00:00:00 aa:bb:cc:dd:ee:ff ipv4 192.0.2.10/24 host-0 duid'
MAC='11:22:33:44:55:66'
IP=$(printf '%s\n' "$LEASES" | grep -i "$MAC" | awk '{ print $5; exit }')
echo "survived: $IP"
EOF
if bash "$tmp"; then
echo "unexpected: command substitution survived"
rm -f "$tmp"
exit 1
fi
rm -f "$tmp"
rg -n -C2 'set -euo pipefail|grep -i "\$MAC"' deploy/openshift-clusters/roles/common/files/resolve_vm_ip.shRepository: openshift-eng/two-node-toolbox
Length of output: 782
🏁 Script executed:
cat -n deploy/openshift-clusters/roles/common/files/resolve_vm_ip.shRepository: openshift-eng/two-node-toolbox
Length of output: 2096
The grep pipelines can exit the helper before the fallbacks run.
With set -euo pipefail on Line 4, a grep -i "$MAC" miss on Lines 22, 30, or 38 returns 1 and aborts the whole script from inside the command substitution. This prevents fallback attempts and stops the interface loop before all MACs are tried. Match the MAC in awk instead of piping through grep.
Suggested fix
if [[ -n "$EXPECTED_HOSTNAME" ]]; then
- IP=$(echo "$LEASES" | grep -i "$MAC" | awk -v host="$EXPECTED_HOSTNAME" '$6 == host {
+ IP=$(awk -v mac="$MAC" -v host="$EXPECTED_HOSTNAME" 'tolower($3) == mac && $6 == host {
ip = $5
sub(/\/.*/, "", ip)
print ip
exit
- }')
+ }' <<< "$LEASES")
fi
if [[ -z "$IP" ]]; then
- IP=$(echo "$LEASES" | grep -i "$MAC" | awk '$6 != "-" {
+ IP=$(awk -v mac="$MAC" 'tolower($3) == mac && $6 != "-" {
ip = $5
sub(/\/.*/, "", ip)
print ip
exit
- }')
+ }' <<< "$LEASES")
fi
if [[ -z "$IP" ]]; then
- IP=$(echo "$LEASES" | grep -i "$MAC" | awk '{
+ IP=$(awk -v mac="$MAC" 'tolower($3) == mac {
ip = $5
sub(/\/.*/, "", ip)
print ip
exit
- }')
+ }' <<< "$LEASES")
fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh` at line 4, The
grep -i "$MAC" pipelines currently cause the script to exit under set -euo
pipefail when a match fails; locate the command substitutions that pipe into
grep -i "$MAC" (the ones that populate the IP lookup fallbacks using the MAC
variable) and replace those grep pipelines with an awk match against the MAC
(case-insensitive) so the pipeline returns 0 even when no match is found and
fallbacks continue; ensure awk extracts the same fields as the original pipeline
(preserve the same capture/field selection) and use tolower comparison or
IGNORECASE in awk to implement case-insensitive matching.
Some loop results from the VM IP resolver may not include stdout when a single script invocation fails, which previously caused a hard template failure. Default stdout to an empty string so the play skips that entry. Made-with: Cursor
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml (1)
20-25:⚠️ Potential issue | 🟠 MajorDon't silently rewrite
cluster_vmsto zero hosts on resolver failure.
deploy/openshift-clusters/roles/common/files/resolve_vm_ip.shexits non-zero with empty stdout on its no-result paths, and this block now turns those cases into “skip this VM”. If the resolver breaks globally, the later inventory rewrite still removes the existing[cluster_vms]section and writes back an empty group even thoughvirsh listfound running VMs. Please fail loudly when VMs were discovered but zero IPs were resolved.Also applies to: 27-38
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml` around lines 20 - 25, The current task "Get VM IP addresses from DHCP leases (hostname preferred)" runs resolve_vm_ip.sh with failed_when: false which hides resolver failures and lets the inventory be rewritten empty; change the flow so that you still collect per-VM results in vm_ips but explicitly fail if virsh found VMs (cluster_vms.stdout_lines is non-empty) and none of the vm_ips.results contain a usable IP. Concretely: keep the script task but remove or limit the global failed_when:false behavior, then add a follow-up assertion/fail task that examines vm_ips.results and fails with a clear message when cluster_vms.stdout_lines | length > 0 and vm_ips.results | selectattr('stdout','truthy') | list | length == 0 (apply same check for the other similar block referenced). Ensure the failure path triggers before the inventory rewrite.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml`:
- Around line 20-25: The current task "Get VM IP addresses from DHCP leases
(hostname preferred)" runs resolve_vm_ip.sh with failed_when: false which hides
resolver failures and lets the inventory be rewritten empty; change the flow so
that you still collect per-VM results in vm_ips but explicitly fail if virsh
found VMs (cluster_vms.stdout_lines is non-empty) and none of the vm_ips.results
contain a usable IP. Concretely: keep the script task but remove or limit the
global failed_when:false behavior, then add a follow-up assertion/fail task that
examines vm_ips.results and fails with a clear message when
cluster_vms.stdout_lines | length > 0 and vm_ips.results |
selectattr('stdout','truthy') | list | length == 0 (apply same check for the
other similar block referenced). Ensure the failure path triggers before the
inventory rewrite.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7991a4bc-ac01-4a44-b04f-5780bdc69110
📒 Files selected for processing (1)
deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml
libvirt net-dhcp-leases can show multiple rows per MAC (anonymous DUID vs hostname). Prefer the lease whose Hostname matches the VM (e.g. master-0 for ostest_master_0), then any lease with a hostname, then any lease.
Add roles/common/files/resolve_vm_ip.sh and call it from update-cluster-inventory; use bracketed IPv6 for ssh/ssh-keyscan.
This is UNTESTED on IPv4
Made-with: Cursor