Skip to content

OCPEDGE-2489: fix: pick cluster VM IP from DHCP lease by hostname#61

Open
jaypoulz wants to merge 2 commits intoopenshift-eng:mainfrom
jaypoulz:fix-dhcp-lease-hostname
Open

OCPEDGE-2489: fix: pick cluster VM IP from DHCP lease by hostname#61
jaypoulz wants to merge 2 commits intoopenshift-eng:mainfrom
jaypoulz:fix-dhcp-lease-hostname

Conversation

@jaypoulz
Copy link
Copy Markdown
Contributor

@jaypoulz jaypoulz commented Mar 30, 2026

libvirt net-dhcp-leases can show multiple rows per MAC (anonymous DUID vs hostname). Prefer the lease whose Hostname matches the VM (e.g. master-0 for ostest_master_0), then any lease with a hostname, then any lease.

Add roles/common/files/resolve_vm_ip.sh and call it from update-cluster-inventory; use bracketed IPv6 for ssh/ssh-keyscan.

This is UNTESTED on IPv4

Made-with: Cursor

libvirt net-dhcp-leases can show multiple rows per MAC (anonymous DUID
vs hostname). Prefer the lease whose Hostname matches the VM (e.g.
master-0 for ostest_master_0), then any lease with a hostname, then any
lease.

Add roles/common/files/resolve_vm_ip.sh and call it from
update-cluster-inventory; use bracketed IPv6 for ssh/ssh-keyscan.

Made-with: Cursor
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 30, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@jaypoulz: This pull request explicitly references no jira issue.

Details

In response to this:

libvirt net-dhcp-leases can show multiple rows per MAC (anonymous DUID vs hostname). Prefer the lease whose Hostname matches the VM (e.g. master-0 for ostest_master_0), then any lease with a hostname, then any lease.

Add roles/common/files/resolve_vm_ip.sh and call it from update-cluster-inventory; use bracketed IPv6 for ssh/ssh-keyscan.

Made-with: Cursor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from fonta-rh and jerpeter1 March 30, 2026 19:41
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 30, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jaypoulz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 30, 2026
@jaypoulz
Copy link
Copy Markdown
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 30, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

Walkthrough

Adds a new Bash script to resolve VM IPs by querying libvirt interfaces and DHCP leases (with optional hostname preference) and updates an Ansible task to call this script instead of inline discovery; also makes SSH host formatting IPv6-safe and increases SSH connect timeout.

Changes

Cohort / File(s) Summary
VM IP Resolution Script
deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh
New executable script: enumerates VM interfaces via virsh domiflist, queries virsh net-dhcp-leases per network for MAC-to-IP mappings, prefers leases matching a derived hostname pattern, falls back to first lease for a MAC, strips CIDR, prints the resolved IP, exits 0 on success or 1 on failure.
Ansible Inventory Update Task
deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml
Replaced inline VM IP discovery with per-VM invocation of resolve_vm_ip.sh; defaulted empty script output to '' before trimming; compute SSH_HOST that wraps IPv6 addresses in brackets for ssh-keyscan and ssh; increased SSH connect timeout from 10s to 15s.

Sequence Diagram(s)

sequenceDiagram
    participant Ansible
    participant resolve_vm_ip.sh
    participant libvirt as virsh/libvirt
    participant DHCP as net-dhcp-leases
    participant VM

    Ansible->>resolve_vm_ip.sh: invoke with VM_NAME
    resolve_vm_ip.sh->>libvirt: virsh domiflist VM_NAME (get interfaces/MACs)
    resolve_vm_ip.sh->>DHCP: virsh net-dhcp-leases <network> (per network)
    DHCP-->>resolve_vm_ip.sh: lease rows (IP/CIDR, MAC, hostname)
    resolve_vm_ip.sh-->>Ansible: resolved IP (CIDR stripped)
    Ansible->>Ansible: compute SSH_HOST (wrap IPv6 in brackets)
    Ansible->>VM: ssh-keyscan / ssh using SSH_HOST (timeout 15s)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@jaypoulz jaypoulz changed the title NO-JIRA: fix: pick cluster VM IP from DHCP lease by hostname OCPEDGE-2489: fix: pick cluster VM IP from DHCP lease by hostname Mar 30, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 30, 2026

@jaypoulz: This pull request references OCPEDGE-2489 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

libvirt net-dhcp-leases can show multiple rows per MAC (anonymous DUID vs hostname). Prefer the lease whose Hostname matches the VM (e.g. master-0 for ostest_master_0), then any lease with a hostname, then any lease.

Add roles/common/files/resolve_vm_ip.sh and call it from update-cluster-inventory; use bracketed IPv6 for ssh/ssh-keyscan.

This is UNTESTED on IPv4

Made-with: Cursor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml (1)

20-25: ⚠️ Potential issue | 🟠 Major

Guard the inventory rewrite when IP resolution returns nothing.

Line 25 suppresses every resolver failure. Combined with the filter on Line 34, a bad resolve_vm_ip.sh run turns into an empty parsed_vm_entries, and this block later rewrites [cluster_vms] from that empty list. That can silently wipe the existing VM inventory. Please fail or skip the rewrite when no addresses were resolved.

Suggested guard
     - name: Get VM IP addresses from DHCP leases (hostname preferred)
       script: resolve_vm_ip.sh {{ item | quote }}
       register: vm_ips
       loop: "{{ cluster_vms.stdout_lines }}"
       changed_when: false
       failed_when: false
+
+    - name: Abort when no VM IPs were resolved
+      fail:
+        msg: >-
+          Discovered {{ cluster_vms.stdout_lines | length }} cluster VM(s), but
+          none of their IPs could be resolved from libvirt DHCP leases.
+      when: >-
+        (vm_ips.results | map(attribute='stdout') | map('trim') |
+         reject('equalto', '') | list | length) == 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml`
around lines 20 - 25, The current task running resolve_vm_ip.sh (script:
resolve_vm_ip.sh ...) registers results to vm_ips but suppresses failures, which
can lead to parsed_vm_entries being empty and rewrite the [cluster_vms]
inventory from nothing; update the play to guard the rewrite by checking
vm_ips.results (or the computed parsed_vm_entries) and skip/fail the inventory
update when no IPs were resolved—e.g., after the script task add a conditional
task that inspects vm_ips.results to ensure at least one non-empty address (or
set failed_when: true on the script when all results are empty), and only allow
the subsequent block that rewrites [cluster_vms] (the rewrite task that uses
parsed_vm_entries) to run when that check passes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh`:
- Line 4: The grep -i "$MAC" pipelines currently cause the script to exit under
set -euo pipefail when a match fails; locate the command substitutions that pipe
into grep -i "$MAC" (the ones that populate the IP lookup fallbacks using the
MAC variable) and replace those grep pipelines with an awk match against the MAC
(case-insensitive) so the pipeline returns 0 even when no match is found and
fallbacks continue; ensure awk extracts the same fields as the original pipeline
(preserve the same capture/field selection) and use tolower comparison or
IGNORECASE in awk to implement case-insensitive matching.

---

Outside diff comments:
In `@deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml`:
- Around line 20-25: The current task running resolve_vm_ip.sh (script:
resolve_vm_ip.sh ...) registers results to vm_ips but suppresses failures, which
can lead to parsed_vm_entries being empty and rewrite the [cluster_vms]
inventory from nothing; update the play to guard the rewrite by checking
vm_ips.results (or the computed parsed_vm_entries) and skip/fail the inventory
update when no IPs were resolved—e.g., after the script task add a conditional
task that inspects vm_ips.results to ensure at least one non-empty address (or
set failed_when: true on the script when all results are empty), and only allow
the subsequent block that rewrites [cluster_vms] (the rewrite task that uses
parsed_vm_entries) to run when that check passes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2eaf1dbb-74e9-4e5e-af69-922b609907bd

📥 Commits

Reviewing files that changed from the base of the PR and between c2c0209 and 7f70935.

📒 Files selected for processing (2)
  • deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh
  • deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml

#!/usr/bin/env bash
# Resolve VM IP from virsh net-dhcp-leases. The same MAC can appear twice (anonymous
# DUID vs hostname); prefer the row whose Hostname matches this VM (e.g. master-0).
set -euo pipefail
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

tmp="$(mktemp)"
cat >"$tmp" <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
LEASES=$'2026-03-30 00:00:00 aa:bb:cc:dd:ee:ff ipv4 192.0.2.10/24 host-0 duid'
MAC='11:22:33:44:55:66'
IP=$(printf '%s\n' "$LEASES" | grep -i "$MAC" | awk '{ print $5; exit }')
echo "survived: $IP"
EOF

if bash "$tmp"; then
  echo "unexpected: command substitution survived"
  rm -f "$tmp"
  exit 1
fi
rm -f "$tmp"

rg -n -C2 'set -euo pipefail|grep -i "\$MAC"' deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh

Repository: openshift-eng/two-node-toolbox

Length of output: 782


🏁 Script executed:

cat -n deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh

Repository: openshift-eng/two-node-toolbox

Length of output: 2096


The grep pipelines can exit the helper before the fallbacks run.

With set -euo pipefail on Line 4, a grep -i "$MAC" miss on Lines 22, 30, or 38 returns 1 and aborts the whole script from inside the command substitution. This prevents fallback attempts and stops the interface loop before all MACs are tried. Match the MAC in awk instead of piping through grep.

Suggested fix
   if [[ -n "$EXPECTED_HOSTNAME" ]]; then
-    IP=$(echo "$LEASES" | grep -i "$MAC" | awk -v host="$EXPECTED_HOSTNAME" '$6 == host {
+    IP=$(awk -v mac="$MAC" -v host="$EXPECTED_HOSTNAME" 'tolower($3) == mac && $6 == host {
       ip = $5
       sub(/\/.*/, "", ip)
       print ip
       exit
-    }')
+    }' <<< "$LEASES")
   fi
   if [[ -z "$IP" ]]; then
-    IP=$(echo "$LEASES" | grep -i "$MAC" | awk '$6 != "-" {
+    IP=$(awk -v mac="$MAC" 'tolower($3) == mac && $6 != "-" {
       ip = $5
       sub(/\/.*/, "", ip)
       print ip
       exit
-    }')
+    }' <<< "$LEASES")
   fi
   if [[ -z "$IP" ]]; then
-    IP=$(echo "$LEASES" | grep -i "$MAC" | awk '{
+    IP=$(awk -v mac="$MAC" 'tolower($3) == mac {
       ip = $5
       sub(/\/.*/, "", ip)
       print ip
       exit
-    }')
+    }' <<< "$LEASES")
   fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh` at line 4, The
grep -i "$MAC" pipelines currently cause the script to exit under set -euo
pipefail when a match fails; locate the command substitutions that pipe into
grep -i "$MAC" (the ones that populate the IP lookup fallbacks using the MAC
variable) and replace those grep pipelines with an awk match against the MAC
(case-insensitive) so the pipeline returns 0 even when no match is found and
fallbacks continue; ensure awk extracts the same fields as the original pipeline
(preserve the same capture/field selection) and use tolower comparison or
IGNORECASE in awk to implement case-insensitive matching.

Some loop results from the VM IP resolver may not include stdout when a
single script invocation fails, which previously caused a hard template
failure. Default stdout to an empty string so the play skips that entry.

Made-with: Cursor
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml (1)

20-25: ⚠️ Potential issue | 🟠 Major

Don't silently rewrite cluster_vms to zero hosts on resolver failure.

deploy/openshift-clusters/roles/common/files/resolve_vm_ip.sh exits non-zero with empty stdout on its no-result paths, and this block now turns those cases into “skip this VM”. If the resolver breaks globally, the later inventory rewrite still removes the existing [cluster_vms] section and writes back an empty group even though virsh list found running VMs. Please fail loudly when VMs were discovered but zero IPs were resolved.

Also applies to: 27-38

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml`
around lines 20 - 25, The current task "Get VM IP addresses from DHCP leases
(hostname preferred)" runs resolve_vm_ip.sh with failed_when: false which hides
resolver failures and lets the inventory be rewritten empty; change the flow so
that you still collect per-VM results in vm_ips but explicitly fail if virsh
found VMs (cluster_vms.stdout_lines is non-empty) and none of the vm_ips.results
contain a usable IP. Concretely: keep the script task but remove or limit the
global failed_when:false behavior, then add a follow-up assertion/fail task that
examines vm_ips.results and fails with a clear message when
cluster_vms.stdout_lines | length > 0 and vm_ips.results |
selectattr('stdout','truthy') | list | length == 0 (apply same check for the
other similar block referenced). Ensure the failure path triggers before the
inventory rewrite.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml`:
- Around line 20-25: The current task "Get VM IP addresses from DHCP leases
(hostname preferred)" runs resolve_vm_ip.sh with failed_when: false which hides
resolver failures and lets the inventory be rewritten empty; change the flow so
that you still collect per-VM results in vm_ips but explicitly fail if virsh
found VMs (cluster_vms.stdout_lines is non-empty) and none of the vm_ips.results
contain a usable IP. Concretely: keep the script task but remove or limit the
global failed_when:false behavior, then add a follow-up assertion/fail task that
examines vm_ips.results and fails with a clear message when
cluster_vms.stdout_lines | length > 0 and vm_ips.results |
selectattr('stdout','truthy') | list | length == 0 (apply same check for the
other similar block referenced). Ensure the failure path triggers before the
inventory rewrite.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7991a4bc-ac01-4a44-b04f-5780bdc69110

📥 Commits

Reviewing files that changed from the base of the PR and between 7f70935 and e31d6ca.

📒 Files selected for processing (1)
  • deploy/openshift-clusters/roles/common/tasks/update-cluster-inventory.yml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants