Skip to content

Add a standalone monitor skill for persistent job tracking#1252

Merged
kaix-nv merged 2 commits intomainfrom
kaix/monitor-skill
Apr 19, 2026
Merged

Add a standalone monitor skill for persistent job tracking#1252
kaix-nv merged 2 commits intomainfrom
kaix/monitor-skill

Conversation

@kaix-nv
Copy link
Copy Markdown
Contributor

@kaix-nv kaix-nv commented Apr 14, 2026

What does this PR do?

Type of change: ?
Add a standalone monitor skill for persistent job tracking across sessions, and integrate it with PTQ, evaluation, and deployment skills.

Problem: Each skill had ad-hoc inline monitoring (squeue polling, nel status checks) that didn't survive session restarts and couldn't track multiple jobs. Users had to manually ask "check status" every time.

Solution: A centralized monitor skill with:

  • Job registry (.claude/active_jobs.json): single source of truth for all active jobs
  • Durable recurring cron: polls every 15 min, survives session restarts, self-cleans when all jobs complete
  • User-initiated mode: works in new conversations by reading the registry
  • Aggregated reporting: "2 of 4 completed" instead of per-job noise

Usage

After any skill submits a job, the monitor skill automatically:

  1. Registers the job in .claude/active_jobs.json
  2. Sets up a durable cron to poll status every 15 minutes

User can also trigger manually:
User: "check my eval status" → reads registry, reports current state
User: "is the PTQ done?" → finds job, checks status
User: "what jobs are running?" → lists all registered jobs

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features

    • Added a monitor skill for registering and durably tracking SLURM jobs, NEL runs, and launcher experiments with periodic background polling and aggregated status reporting.
  • Documentation

    • Updated deployment, evaluation, and PTQ docs to instruct registering jobs with the monitor skill.
    • Consolidated diagnostics and simplified SSH/log inspection guidance.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 14, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9da80363-18b1-49d4-90b4-8a8154b8d126

📥 Commits

Reviewing files that changed from the base of the PR and between d781534 and 7aadb62.

📒 Files selected for processing (4)
  • .claude/skills/deployment/SKILL.md
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/monitor/SKILL.md
  • .claude/skills/ptq/SKILL.md
✅ Files skipped from review due to trivial changes (2)
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/monitor/SKILL.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • .claude/skills/deployment/SKILL.md
  • .claude/skills/ptq/SKILL.md

📝 Walkthrough

Walkthrough

Added a new monitor skill spec and updated three existing skill docs to delegate job monitoring to this centralized registry-and-cron-based monitor workflow; SLURM deployment guidance now instructs registering jobs and configuring monitoring via the monitor skill.

Changes

Cohort / File(s) Summary
New Monitor Skill
​.claude/skills/monitor/SKILL.md
Adds monitor skill: centralized registry (.claude/active_jobs.json), supports nel/launcher/slurm job types, mandates upsert-on-submit, recurring 15-minute cron polling, state-diff reporting, terminal-state cleanup, and discovery fallback strategies for user queries.
Monitoring Refactor — Deployment
​.claude/skills/deployment/SKILL.md
Replaces direct use of remote_submit_job/remote_poll_job with instruction to register submitted SLURM jobs and configure monitoring via the monitor skill; retains existing SLURM command examples and hostname retrieval.
Monitoring Refactor — Evaluation
​.claude/skills/evaluation/SKILL.md
Reworks Monitoring Progress to require job registration with monitor for durable tracking; consolidates diagnostics into a single NEL-focused bash block, removes stepwise nel logs streaming and intermediate nel info --logs fetch step, and clarifies which skills to use for different monitoring intents.
Monitoring Refactor — PTQ
​.claude/skills/ptq/SKILL.md
Replaces three separate monitoring instructions (launcher blocking/tailing, SLURM polling, local stdout watching) with a single directive to register jobs and use the monitor skill for monitoring.

Sequence Diagram

sequenceDiagram
    participant User
    participant Deployment as Deployment Skill
    participant Registry as Job Registry<br/>(.claude/active_jobs.json)
    participant Cron as Cron Scheduler
    participant Monitor as Monitor Skill
    participant Job as Job Backend<br/>(NEL/SLURM/Launcher)

    User->>Deployment: Submit job
    Deployment->>Registry: Upsert/register job entry
    Deployment->>Cron: Ensure 15-min poll scheduled
    Deployment-->>User: Job submitted

    rect rgba(100, 150, 200, 0.5)
    Note over Cron,Monitor: Automatic polling loop (every 15 min)
    Cron->>Monitor: Trigger poll
    Monitor->>Registry: Read active jobs
    loop For each job
        Monitor->>Job: Query status (squeue/sacct, nel status/info, launcher logs)
        Job-->>Monitor: Current status & diagnostics
        Monitor->>Monitor: Diff vs last_status
        Monitor->>Registry: Update last_status
        alt Terminal state detected
            Monitor->>Registry: Remove job entry
        end
        Monitor-->>User: Report meaningful state changes
    end
    end

    alt Registry empty
        Monitor->>Cron: Delete recurring cron
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: introducing a new standalone monitor skill for persistent job tracking across SLURM clusters.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns ✅ Passed The pull request contains only documentation changes to skill specification files in the .claude/skills/ directory. No Python code files have been modified, so the security anti-patterns check does not apply.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kaix/monitor-skill

Comment @coderabbitai help to get the list of available commands and usage tips.

@kaix-nv kaix-nv requested review from Edwardf0t1 and mxinO April 14, 2026 00:33
@kaix-nv kaix-nv marked this pull request as ready for review April 14, 2026 00:34
@kaix-nv kaix-nv force-pushed the kaix/monitor-skill branch from e2f7cd6 to d781534 Compare April 14, 2026 00:36
@kaix-nv kaix-nv changed the title Kaix/monitor skill Add a standalone monitor skill for persistent job tracking Apr 14, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 202-214: The current decision flow's submit-then-fallback behavior
must be replaced so the runtime/image is chosen based on auth checks before any
SLURM submission; update the text around "Default behavior: use DockerHub image
first." and the numbered "Decision flow" to: validate DockerHub credentials up
front, if valid use vllm/vllm-openai:latest, else set deployment.image to the
NGC image (nvcr.io/nvidia/vllm:<YY.MM>-py3) and submit once; remove steps that
describe submitting first and retrying on 401 and ensure the doc enforces "Do
not retry more than once" by requiring credential fix before any additional
submission.
- Around line 31-32: The final user-facing checklist is missing the newly added
"Step 7.5: Check container registry auth (SLURM only)"; update the copied
checklist section so it includes that same line immediately before "Step 8: Run
the evaluation" to keep both checklists in sync, ensuring the phrase "Step 7.5:
Check container registry auth (SLURM only)" appears exactly as in the main
workflow checklist.

In @.claude/skills/monitor/SKILL.md:
- Around line 43-45: Add atomic locking and an idempotent cron marker: wrap
every read/modify/write to the registry file `.claude/active_jobs.json` with an
exclusive file lock (e.g., `flock` on a dedicated lockfile) so cron and
user-triggered checks cannot clobber each other, and ensure the cron setup step
uses a unique cron marker (e.g., a comment/token or a marker file name) so the
"if one isn't already running" logic checks for that marker before installing a
new recurring poller; acquire the lock for the whole mutation sequence (read
registry, modify entries, write back, remove completed jobs) and release after
write, and make the cron install routine idempotent by searching for the marker
before adding the cron entry so duplicates are never created.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 26e86888-6c9a-497f-b376-8ab0c3569a1d

📥 Commits

Reviewing files that changed from the base of the PR and between 202c3d3 and e2f7cd6.

📒 Files selected for processing (6)
  • .claude/skills/common/slurm-setup.md
  • .claude/skills/deployment/SKILL.md
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/monitor/SKILL.md
  • .claude/skills/ptq/SKILL.md
  • tools/launcher/common/hf/ptq.sh

Comment thread .claude/skills/evaluation/SKILL.md
Comment thread .claude/skills/evaluation/SKILL.md Outdated
Comment on lines +43 to +45
1. **Add an entry** to `.claude/active_jobs.json`. Create the file if it doesn't exist.
2. **Set up a durable recurring cron** (if one isn't already running) that polls all registered jobs every 15 minutes. The cron prompt should: read the registry, check each job, report state changes to the user, remove completed jobs, and delete itself when the registry is empty.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Define atomic registry updates and idempotent cron setup.

Line 54-Line 59 implies concurrent read/modify/write from cron + user-triggered checks, but no locking is specified. Combined with Line 44 (“if one isn't already running”), this can cause duplicate pollers and registry clobbering.

Please specify a lock (e.g., flock around all registry mutations) and a unique cron marker to ensure single-instance scheduling.

Also applies to: 54-59

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/monitor/SKILL.md around lines 43 - 45, Add atomic locking and
an idempotent cron marker: wrap every read/modify/write to the registry file
`.claude/active_jobs.json` with an exclusive file lock (e.g., `flock` on a
dedicated lockfile) so cron and user-triggered checks cannot clobber each other,
and ensure the cron setup step uses a unique cron marker (e.g., a comment/token
or a marker file name) so the "if one isn't already running" logic checks for
that marker before installing a new recurring poller; acquire the lock for the
whole mutation sequence (read registry, modify entries, write back, remove
completed jobs) and release after write, and make the cron install routine
idempotent by searching for the marker before adding the cron entry so
duplicates are never created.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code processes turns sequentially and crons only fire when idle, so registry operations are inherently serialized. No locking is needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaix-nv, that's a valid point — Claude Code's turn-based execution guarantees that only one operation runs at a time (user turn or cron turn, never both), so registry mutations are already serialized by the runtime. The concurrency concern doesn't apply in this execution environment. Thanks for the clarification!


✏️ Learnings added
Learnt from: kaix-nv
Repo: NVIDIA/Model-Optimizer PR: 1252
File: .claude/skills/monitor/SKILL.md:43-45
Timestamp: 2026-04-14T00:42:02.848Z
Learning: In the NVIDIA/Model-Optimizer repository, `.claude/skills/monitor/SKILL.md` uses `.claude/active_jobs.json` as a job registry. No file locking (e.g., `flock`) is needed for this registry because Claude Code processes turns sequentially and crons only fire when idle — registry operations are inherently serialized by the Claude Code runtime. Do not flag concurrent access concerns for this registry.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 14, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-19 04:44 UTC

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.61%. Comparing base (e9a4989) to head (7aadb62).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1252   +/-   ##
=======================================
  Coverage   76.61%   76.61%           
=======================================
  Files         459      459           
  Lines       49153    49153           
=======================================
  Hits        37661    37661           
  Misses      11492    11492           
Flag Coverage Δ
unit 52.96% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread .claude/skills/monitor/SKILL.md Outdated
@@ -0,0 +1,102 @@
---
name: monitor
description: Monitor submitted jobs (PTQ, evaluation, deployment) on SLURM clusters. Use when the user asks "check job status", "is my job done", "monitor my evaluation", "what's the status of the PTQ", "check on job 12345", or after any skill submits a long-running job. Also triggers on "nel status", "squeue", or any request to check progress of a previously submitted job.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12345 -> <slurm_job_id> it's been used in other skills.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@Edwardf0t1
Copy link
Copy Markdown
Contributor

Heads-up on overlap with #1239 — I'm also trimming the "Monitoring Progress" block in .claude/skills/evaluation/SKILL.md, so we'll collide on those lines.

In #1239 I'm vendoring two upstream NEL skills verbatim from NVIDIA-NeMo/Evaluator (commit 01899f8):

  • launching-evals — one-shot run / check / debug / analyze for a given invocation (stateless, user supplies the ID)
  • accessing-mlflow — query MLflow runs, compare metrics, fetch artifacts

These are complementary to your monitor skill rather than competing:

Skill Scope State
monitor (this PR) Continuous tracking of many jobs across sessions Stateful — registry + cron
launching-evals One-shot interactive run/debug/analyze for a specific invocation Stateless
accessing-mlflow Query MLflow for past-run metrics and artifacts Stateless

Proposed joint replacement for the trimmed Monitoring block so all three angles are covered:

**Monitoring Progress**

After job submission, register the job per the **monitor skill** for durable cross-session tracking. For one-off queries (live status, debugging a failed run, analyzing results) use the **launching-evals skill**; for querying past runs in MLflow use **accessing-mlflow**.

I'd prefer to fold this into #1252 directly (keeps the narrative in one place) — happy to push a commit to your branch or send you a patch. Otherwise I'll rebase #1239 on top of yours. Either works, just want to avoid stepping on each other.

Copy link
Copy Markdown
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, left a comment.

I think we can land this one first, then I will rebase and work on #1239

kaix-nv added 2 commits April 18, 2026 21:30
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
@kaix-nv
Copy link
Copy Markdown
Contributor Author

kaix-nv commented Apr 19, 2026

Heads-up on overlap with #1239 — I'm also trimming the "Monitoring Progress" block in .claude/skills/evaluation/SKILL.md, so we'll collide on those lines.

In #1239 I'm vendoring two upstream NEL skills verbatim from NVIDIA-NeMo/Evaluator (commit 01899f8):

  • launching-evals — one-shot run / check / debug / analyze for a given invocation (stateless, user supplies the ID)
  • accessing-mlflow — query MLflow runs, compare metrics, fetch artifacts

These are complementary to your monitor skill rather than competing:

Skill Scope State
monitor (this PR) Continuous tracking of many jobs across sessions Stateful — registry + cron
launching-evals One-shot interactive run/debug/analyze for a specific invocation Stateless
accessing-mlflow Query MLflow for past-run metrics and artifacts Stateless
Proposed joint replacement for the trimmed Monitoring block so all three angles are covered:

**Monitoring Progress**

After job submission, register the job per the **monitor skill** for durable cross-session tracking. For one-off queries (live status, debugging a failed run, analyzing results) use the **launching-evals skill**; for querying past runs in MLflow use **accessing-mlflow**.

I'd prefer to fold this into #1252 directly (keeps the narrative in one place) — happy to push a commit to your branch or send you a patch. Otherwise I'll rebase #1239 on top of yours. Either works, just want to avoid stepping on each other.

Good catch. The monitoring block now references all three skills.

@kaix-nv kaix-nv force-pushed the kaix/monitor-skill branch from d781534 to 7aadb62 Compare April 19, 2026 04:31
@kaix-nv kaix-nv enabled auto-merge (squash) April 19, 2026 04:32
@kaix-nv kaix-nv merged commit c20f9c4 into main Apr 19, 2026
31 checks passed
@kaix-nv kaix-nv deleted the kaix/monitor-skill branch April 19, 2026 04:44
Edwardf0t1 added a commit that referenced this pull request Apr 19, 2026
PR #1252 (kaix-nv) landed with the joint trim text I proposed in the
coordination comment. The Monitoring block in evaluation/SKILL.md now
points at all three skills:

- `monitor` for durable cross-session job tracking
- `launching-evals` for one-shot status / debug / analyze
- `accessing-mlflow` for querying past runs

All merges auto-resolved (evaluation/SKILL.md, ptq/SKILL.md,
deployment/SKILL.md). Nothing from our branch was lost —
credentials.md, remote-execution.md reframe, markdownlint `ignores:`,
and the vendored skills all survived.

The Monitoring-trim item on our plan is now satisfied by #1252 rather
than a direct edit on this branch.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants