Add a standalone monitor skill for persistent job tracking by kaix-nv · Pull Request #1252 · NVIDIA/Model-Optimizer

kaix-nv · 2026-04-14T00:33:43Z

What does this PR do?

Type of change: ?
Add a standalone monitor skill for persistent job tracking across sessions, and integrate it with PTQ, evaluation, and deployment skills.

Problem: Each skill had ad-hoc inline monitoring (squeue polling, nel status checks) that didn't survive session restarts and couldn't track multiple jobs. Users had to manually ask "check status" every time.

Solution: A centralized monitor skill with:

Job registry (.claude/active_jobs.json): single source of truth for all active jobs
Durable recurring cron: polls every 15 min, survives session restarts, self-cleans when all jobs complete
User-initiated mode: works in new conversations by reading the registry
Aggregated reporting: "2 of 4 completed" instead of per-job noise

Usage

After any skill submits a job, the monitor skill automatically:

Registers the job in .claude/active_jobs.json
Sets up a durable cron to poll status every 15 minutes

User can also trigger manually:
User: "check my eval status" → reads registry, reports current state
User: "is the PTQ done?" → finds job, checks status
User: "what jobs are running?" → lists all registered jobs

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

New Features
- Added a monitor skill for registering and durably tracking SLURM jobs, NEL runs, and launcher experiments with periodic background polling and aggregated status reporting.
Documentation
- Updated deployment, evaluation, and PTQ docs to instruct registering jobs with the monitor skill.
- Consolidated diagnostics and simplified SSH/log inspection guidance.

copy-pr-bot · 2026-04-14T00:33:48Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-04-14T00:33:51Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9da80363-18b1-49d4-90b4-8a8154b8d126

📥 Commits

Reviewing files that changed from the base of the PR and between d781534 and 7aadb62.

📒 Files selected for processing (4)

.claude/skills/deployment/SKILL.md
.claude/skills/evaluation/SKILL.md
.claude/skills/monitor/SKILL.md
.claude/skills/ptq/SKILL.md

✅ Files skipped from review due to trivial changes (2)

.claude/skills/evaluation/SKILL.md
.claude/skills/monitor/SKILL.md

🚧 Files skipped from review as they are similar to previous changes (2)

.claude/skills/deployment/SKILL.md
.claude/skills/ptq/SKILL.md

📝 Walkthrough

Walkthrough

Added a new monitor skill spec and updated three existing skill docs to delegate job monitoring to this centralized registry-and-cron-based monitor workflow; SLURM deployment guidance now instructs registering jobs and configuring monitoring via the monitor skill.

Changes

Cohort / File(s)	Summary
New Monitor Skill `.claude/skills/monitor/SKILL.md`	Adds `monitor` skill: centralized registry (`.claude/active_jobs.json`), supports `nel`/`launcher`/`slurm` job types, mandates upsert-on-submit, recurring 15-minute cron polling, state-diff reporting, terminal-state cleanup, and discovery fallback strategies for user queries.
Monitoring Refactor — Deployment `.claude/skills/deployment/SKILL.md`	Replaces direct use of `remote_submit_job`/`remote_poll_job` with instruction to register submitted SLURM jobs and configure monitoring via the `monitor` skill; retains existing SLURM command examples and hostname retrieval.
Monitoring Refactor — Evaluation `.claude/skills/evaluation/SKILL.md`	Reworks Monitoring Progress to require job registration with `monitor` for durable tracking; consolidates diagnostics into a single NEL-focused bash block, removes stepwise `nel logs` streaming and intermediate `nel info --logs` fetch step, and clarifies which skills to use for different monitoring intents.
Monitoring Refactor — PTQ `.claude/skills/ptq/SKILL.md`	Replaces three separate monitoring instructions (launcher blocking/tailing, SLURM polling, local stdout watching) with a single directive to register jobs and use the `monitor` skill for monitoring.

Sequence Diagram

sequenceDiagram
    participant User
    participant Deployment as Deployment Skill
    participant Registry as Job Registry<br/>(.claude/active_jobs.json)
    participant Cron as Cron Scheduler
    participant Monitor as Monitor Skill
    participant Job as Job Backend<br/>(NEL/SLURM/Launcher)

    User->>Deployment: Submit job
    Deployment->>Registry: Upsert/register job entry
    Deployment->>Cron: Ensure 15-min poll scheduled
    Deployment-->>User: Job submitted

    rect rgba(100, 150, 200, 0.5)
    Note over Cron,Monitor: Automatic polling loop (every 15 min)
    Cron->>Monitor: Trigger poll
    Monitor->>Registry: Read active jobs
    loop For each job
        Monitor->>Job: Query status (squeue/sacct, nel status/info, launcher logs)
        Job-->>Monitor: Current status & diagnostics
        Monitor->>Monitor: Diff vs last_status
        Monitor->>Registry: Update last_status
        alt Terminal state detected
            Monitor->>Registry: Remove job entry
        end
        Monitor-->>User: Report meaningful state changes
    end
    end

    alt Registry empty
        Monitor->>Cron: Delete recurring cron
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the main change: introducing a new standalone monitor skill for persistent job tracking across SLURM clusters.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns	✅ Passed	The pull request contains only documentation changes to skill specification files in the .claude/skills/ directory. No Python code files have been modified, so the security anti-patterns check does not apply.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kaix/monitor-skill

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 202-214: The current decision flow's submit-then-fallback behavior
must be replaced so the runtime/image is chosen based on auth checks before any
SLURM submission; update the text around "Default behavior: use DockerHub image
first." and the numbered "Decision flow" to: validate DockerHub credentials up
front, if valid use vllm/vllm-openai:latest, else set deployment.image to the
NGC image (nvcr.io/nvidia/vllm:<YY.MM>-py3) and submit once; remove steps that
describe submitting first and retrying on 401 and ensure the doc enforces "Do
not retry more than once" by requiring credential fix before any additional
submission.
- Around line 31-32: The final user-facing checklist is missing the newly added
"Step 7.5: Check container registry auth (SLURM only)"; update the copied
checklist section so it includes that same line immediately before "Step 8: Run
the evaluation" to keep both checklists in sync, ensuring the phrase "Step 7.5:
Check container registry auth (SLURM only)" appears exactly as in the main
workflow checklist.

In @.claude/skills/monitor/SKILL.md:
- Around line 43-45: Add atomic locking and an idempotent cron marker: wrap
every read/modify/write to the registry file `.claude/active_jobs.json` with an
exclusive file lock (e.g., `flock` on a dedicated lockfile) so cron and
user-triggered checks cannot clobber each other, and ensure the cron setup step
uses a unique cron marker (e.g., a comment/token or a marker file name) so the
"if one isn't already running" logic checks for that marker before installing a
new recurring poller; acquire the lock for the whole mutation sequence (read
registry, modify entries, write back, remove completed jobs) and release after
write, and make the cron install routine idempotent by searching for the marker
before adding the cron entry so duplicates are never created.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 26e86888-6c9a-497f-b376-8ab0c3569a1d

📥 Commits

Reviewing files that changed from the base of the PR and between 202c3d3 and e2f7cd6.

📒 Files selected for processing (6)

.claude/skills/common/slurm-setup.md
.claude/skills/deployment/SKILL.md
.claude/skills/evaluation/SKILL.md
.claude/skills/monitor/SKILL.md
.claude/skills/ptq/SKILL.md
tools/launcher/common/hf/ptq.sh

coderabbitai · 2026-04-14T00:39:24Z

+1. **Add an entry** to `.claude/active_jobs.json`. Create the file if it doesn't exist.
+2. **Set up a durable recurring cron** (if one isn't already running) that polls all registered jobs every 15 minutes. The cron prompt should: read the registry, check each job, report state changes to the user, remove completed jobs, and delete itself when the registry is empty.
+


⚠️ Potential issue | 🟠 Major

Define atomic registry updates and idempotent cron setup.

Line 54-Line 59 implies concurrent read/modify/write from cron + user-triggered checks, but no locking is specified. Combined with Line 44 (“if one isn't already running”), this can cause duplicate pollers and registry clobbering.

Please specify a lock (e.g., flock around all registry mutations) and a unique cron marker to ensure single-instance scheduling.

Also applies to: 54-59

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.claude/skills/monitor/SKILL.md around lines 43 - 45, Add atomic locking and an idempotent cron marker: wrap every read/modify/write to the registry file `.claude/active_jobs.json` with an exclusive file lock (e.g., `flock` on a dedicated lockfile) so cron and user-triggered checks cannot clobber each other, and ensure the cron setup step uses a unique cron marker (e.g., a comment/token or a marker file name) so the "if one isn't already running" logic checks for that marker before installing a new recurring poller; acquire the lock for the whole mutation sequence (read registry, modify entries, write back, remove completed jobs) and release after write, and make the cron install routine idempotent by searching for the marker before adding the cron entry so duplicates are never created.

Claude Code processes turns sequentially and crons only fire when idle, so registry operations are inherently serialized. No locking is needed.

@kaix-nv, that's a valid point — Claude Code's turn-based execution guarantees that only one operation runs at a time (user turn or cron turn, never both), so registry mutations are already serialized by the runtime. The concurrency concern doesn't apply in this execution environment. Thanks for the clarification!

✏️ Learnings added

Learnt from: kaix-nv Repo: NVIDIA/Model-Optimizer PR: 1252 File: .claude/skills/monitor/SKILL.md:43-45 Timestamp: 2026-04-14T00:42:02.848Z Learning: In the NVIDIA/Model-Optimizer repository, `.claude/skills/monitor/SKILL.md` uses `.claude/active_jobs.json` as a job registry. No file locking (e.g., `flock`) is needed for this registry because Claude Code processes turns sequentially and crons only fire when idle — registry operations are inherently serialized by the Claude Code runtime. Do not flag concurrent access concerns for this registry.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

github-actions · 2026-04-14T00:42:10Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-19 04:44 UTC

codecov · 2026-04-14T00:50:20Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.61%. Comparing base (e9a4989) to head (7aadb62).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1252   +/-   ##
=======================================
  Coverage   76.61%   76.61%           
=======================================
  Files         459      459           
  Lines       49153    49153           
=======================================
  Hits        37661    37661           
  Misses      11492    11492

Flag	Coverage Δ
unit	`52.96% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

shengliangxu · 2026-04-16T17:35:26Z

@@ -0,0 +1,102 @@
+---
+name: monitor
+description: Monitor submitted jobs (PTQ, evaluation, deployment) on SLURM clusters. Use when the user asks "check job status", "is my job done", "monitor my evaluation", "what's the status of the PTQ", "check on job 12345", or after any skill submits a long-running job. Also triggers on "nel status", "squeue", or any request to check progress of a previously submitted job.


12345 -> <slurm_job_id> it's been used in other skills.

Edwardf0t1 · 2026-04-19T01:15:55Z

Heads-up on overlap with #1239 — I'm also trimming the "Monitoring Progress" block in .claude/skills/evaluation/SKILL.md, so we'll collide on those lines.

In #1239 I'm vendoring two upstream NEL skills verbatim from NVIDIA-NeMo/Evaluator (commit 01899f8):

launching-evals — one-shot run / check / debug / analyze for a given invocation (stateless, user supplies the ID)
accessing-mlflow — query MLflow runs, compare metrics, fetch artifacts

These are complementary to your monitor skill rather than competing:

Skill	Scope	State
`monitor` (this PR)	Continuous tracking of many jobs across sessions	Stateful — registry + cron
`launching-evals`	One-shot interactive run/debug/analyze for a specific invocation	Stateless
`accessing-mlflow`	Query MLflow for past-run metrics and artifacts	Stateless

Proposed joint replacement for the trimmed Monitoring block so all three angles are covered:

**Monitoring Progress**

After job submission, register the job per the **monitor skill** for durable cross-session tracking. For one-off queries (live status, debugging a failed run, analyzing results) use the **launching-evals skill**; for querying past runs in MLflow use **accessing-mlflow**.

I'd prefer to fold this into #1252 directly (keeps the narrative in one place) — happy to push a commit to your branch or send you a patch. Otherwise I'll rebase #1239 on top of yours. Either works, just want to avoid stepping on each other.

Edwardf0t1

LGTM in general, left a comment.

I think we can land this one first, then I will rebase and work on #1239

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv · 2026-04-19T04:31:39Z

Heads-up on overlap with #1239 — I'm also trimming the "Monitoring Progress" block in .claude/skills/evaluation/SKILL.md, so we'll collide on those lines.

In #1239 I'm vendoring two upstream NEL skills verbatim from NVIDIA-NeMo/Evaluator (commit 01899f8):

launching-evals — one-shot run / check / debug / analyze for a given invocation (stateless, user supplies the ID)

accessing-mlflow — query MLflow runs, compare metrics, fetch artifacts

These are complementary to your monitor skill rather than competing:

Skill Scope State
monitor (this PR) Continuous tracking of many jobs across sessions Stateful — registry + cron
launching-evals One-shot interactive run/debug/analyze for a specific invocation Stateless
accessing-mlflow Query MLflow for past-run metrics and artifacts Stateless
Proposed joint replacement for the trimmed Monitoring block so all three angles are covered:
**Monitoring Progress**

After job submission, register the job per the **monitor skill** for durable cross-session tracking. For one-off queries (live status, debugging a failed run, analyzing results) use the **launching-evals skill**; for querying past runs in MLflow use **accessing-mlflow**.
I'd prefer to fold this into #1252 directly (keeps the narrative in one place) — happy to push a commit to your branch or send you a patch. Otherwise I'll rebase #1239 on top of yours. Either works, just want to avoid stepping on each other.

Good catch. The monitoring block now references all three skills.

PR #1252 (kaix-nv) landed with the joint trim text I proposed in the coordination comment. The Monitoring block in evaluation/SKILL.md now points at all three skills: - `monitor` for durable cross-session job tracking - `launching-evals` for one-shot status / debug / analyze - `accessing-mlflow` for querying past runs All merges auto-resolved (evaluation/SKILL.md, ptq/SKILL.md, deployment/SKILL.md). Nothing from our branch was lost — credentials.md, remote-execution.md reframe, markdownlint `ignores:`, and the vendored skills all survived. The Monitoring-trim item on our plan is now satisfied by #1252 rather than a direct edit on this branch. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

kaix-nv requested review from Edwardf0t1 and mxinO April 14, 2026 00:33

kaix-nv marked this pull request as ready for review April 14, 2026 00:34

kaix-nv force-pushed the kaix/monitor-skill branch from e2f7cd6 to d781534 Compare April 14, 2026 00:36

kaix-nv changed the title ~~Kaix/monitor skill~~ Add a standalone monitor skill for persistent job tracking Apr 14, 2026

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

shengliangxu reviewed Apr 16, 2026

View reviewed changes

Edwardf0t1 approved these changes Apr 19, 2026

View reviewed changes

kaix-nv added 2 commits April 18, 2026 21:30

Add monitor skill for persistent job tracking

10d4238

Signed-off-by: Kai Xu <kaix@nvidia.com>

Address feedbacks

7aadb62

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv force-pushed the kaix/monitor-skill branch from d781534 to 7aadb62 Compare April 19, 2026 04:31

kaix-nv enabled auto-merge (squash) April 19, 2026 04:32

kaix-nv merged commit c20f9c4 into main Apr 19, 2026
31 checks passed

kaix-nv deleted the kaix/monitor-skill branch April 19, 2026 04:44

Edwardf0t1 mentioned this pull request Apr 19, 2026

[1/N] Polish evaluation skills and common skills based on an E2E workflow testing, vendor two Claude skills from NeMo Evaluator #1239

Open

		1. Add an entry to `.claude/active_jobs.json`. Create the file if it doesn't exist.
		2. Set up a durable recurring cron (if one isn't already running) that polls all registered jobs every 15 minutes. The cron prompt should: read the registry, check each job, report state changes to the user, remove completed jobs, and delete itself when the registry is empty.

Conversation

kaix-nv commented Apr 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Apr 14, 2026

Uh oh!

coderabbitai bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaix-nv Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shengliangxu Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

kaix-nv Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 commented Apr 19, 2026

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

kaix-nv commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kaix-nv commented Apr 14, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 14, 2026 •

edited

Loading

coderabbitai bot Apr 14, 2026 •

edited

Loading

github-actions bot commented Apr 14, 2026 •

edited

Loading

codecov bot commented Apr 14, 2026 •

edited

Loading