Add a standalone monitor skill for persistent job tracking#1252
Add a standalone monitor skill for persistent job tracking#1252
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (4)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdded a new Changes
Sequence DiagramsequenceDiagram
participant User
participant Deployment as Deployment Skill
participant Registry as Job Registry<br/>(.claude/active_jobs.json)
participant Cron as Cron Scheduler
participant Monitor as Monitor Skill
participant Job as Job Backend<br/>(NEL/SLURM/Launcher)
User->>Deployment: Submit job
Deployment->>Registry: Upsert/register job entry
Deployment->>Cron: Ensure 15-min poll scheduled
Deployment-->>User: Job submitted
rect rgba(100, 150, 200, 0.5)
Note over Cron,Monitor: Automatic polling loop (every 15 min)
Cron->>Monitor: Trigger poll
Monitor->>Registry: Read active jobs
loop For each job
Monitor->>Job: Query status (squeue/sacct, nel status/info, launcher logs)
Job-->>Monitor: Current status & diagnostics
Monitor->>Monitor: Diff vs last_status
Monitor->>Registry: Update last_status
alt Terminal state detected
Monitor->>Registry: Remove job entry
end
Monitor-->>User: Report meaningful state changes
end
end
alt Registry empty
Monitor->>Cron: Delete recurring cron
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
e2f7cd6 to
d781534
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/evaluation/SKILL.md:
- Around line 202-214: The current decision flow's submit-then-fallback behavior
must be replaced so the runtime/image is chosen based on auth checks before any
SLURM submission; update the text around "Default behavior: use DockerHub image
first." and the numbered "Decision flow" to: validate DockerHub credentials up
front, if valid use vllm/vllm-openai:latest, else set deployment.image to the
NGC image (nvcr.io/nvidia/vllm:<YY.MM>-py3) and submit once; remove steps that
describe submitting first and retrying on 401 and ensure the doc enforces "Do
not retry more than once" by requiring credential fix before any additional
submission.
- Around line 31-32: The final user-facing checklist is missing the newly added
"Step 7.5: Check container registry auth (SLURM only)"; update the copied
checklist section so it includes that same line immediately before "Step 8: Run
the evaluation" to keep both checklists in sync, ensuring the phrase "Step 7.5:
Check container registry auth (SLURM only)" appears exactly as in the main
workflow checklist.
In @.claude/skills/monitor/SKILL.md:
- Around line 43-45: Add atomic locking and an idempotent cron marker: wrap
every read/modify/write to the registry file `.claude/active_jobs.json` with an
exclusive file lock (e.g., `flock` on a dedicated lockfile) so cron and
user-triggered checks cannot clobber each other, and ensure the cron setup step
uses a unique cron marker (e.g., a comment/token or a marker file name) so the
"if one isn't already running" logic checks for that marker before installing a
new recurring poller; acquire the lock for the whole mutation sequence (read
registry, modify entries, write back, remove completed jobs) and release after
write, and make the cron install routine idempotent by searching for the marker
before adding the cron entry so duplicates are never created.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 26e86888-6c9a-497f-b376-8ab0c3569a1d
📒 Files selected for processing (6)
.claude/skills/common/slurm-setup.md.claude/skills/deployment/SKILL.md.claude/skills/evaluation/SKILL.md.claude/skills/monitor/SKILL.md.claude/skills/ptq/SKILL.mdtools/launcher/common/hf/ptq.sh
| 1. **Add an entry** to `.claude/active_jobs.json`. Create the file if it doesn't exist. | ||
| 2. **Set up a durable recurring cron** (if one isn't already running) that polls all registered jobs every 15 minutes. The cron prompt should: read the registry, check each job, report state changes to the user, remove completed jobs, and delete itself when the registry is empty. | ||
|
|
There was a problem hiding this comment.
Define atomic registry updates and idempotent cron setup.
Line 54-Line 59 implies concurrent read/modify/write from cron + user-triggered checks, but no locking is specified. Combined with Line 44 (“if one isn't already running”), this can cause duplicate pollers and registry clobbering.
Please specify a lock (e.g., flock around all registry mutations) and a unique cron marker to ensure single-instance scheduling.
Also applies to: 54-59
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/skills/monitor/SKILL.md around lines 43 - 45, Add atomic locking and
an idempotent cron marker: wrap every read/modify/write to the registry file
`.claude/active_jobs.json` with an exclusive file lock (e.g., `flock` on a
dedicated lockfile) so cron and user-triggered checks cannot clobber each other,
and ensure the cron setup step uses a unique cron marker (e.g., a comment/token
or a marker file name) so the "if one isn't already running" logic checks for
that marker before installing a new recurring poller; acquire the lock for the
whole mutation sequence (read registry, modify entries, write back, remove
completed jobs) and release after write, and make the cron install routine
idempotent by searching for the marker before adding the cron entry so
duplicates are never created.
There was a problem hiding this comment.
Claude Code processes turns sequentially and crons only fire when idle, so registry operations are inherently serialized. No locking is needed.
There was a problem hiding this comment.
@kaix-nv, that's a valid point — Claude Code's turn-based execution guarantees that only one operation runs at a time (user turn or cron turn, never both), so registry mutations are already serialized by the runtime. The concurrency concern doesn't apply in this execution environment. Thanks for the clarification!
✏️ Learnings added
Learnt from: kaix-nv
Repo: NVIDIA/Model-Optimizer PR: 1252
File: .claude/skills/monitor/SKILL.md:43-45
Timestamp: 2026-04-14T00:42:02.848Z
Learning: In the NVIDIA/Model-Optimizer repository, `.claude/skills/monitor/SKILL.md` uses `.claude/active_jobs.json` as a job registry. No file locking (e.g., `flock`) is needed for this registry because Claude Code processes turns sequentially and crons only fire when idle — registry operations are inherently serialized by the Claude Code runtime. Do not flag concurrent access concerns for this registry.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1252 +/- ##
=======================================
Coverage 76.61% 76.61%
=======================================
Files 459 459
Lines 49153 49153
=======================================
Hits 37661 37661
Misses 11492 11492
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| @@ -0,0 +1,102 @@ | |||
| --- | |||
| name: monitor | |||
| description: Monitor submitted jobs (PTQ, evaluation, deployment) on SLURM clusters. Use when the user asks "check job status", "is my job done", "monitor my evaluation", "what's the status of the PTQ", "check on job 12345", or after any skill submits a long-running job. Also triggers on "nel status", "squeue", or any request to check progress of a previously submitted job. | |||
There was a problem hiding this comment.
12345 -> <slurm_job_id> it's been used in other skills.
|
Heads-up on overlap with #1239 — I'm also trimming the "Monitoring Progress" block in In #1239 I'm vendoring two upstream NEL skills verbatim from
These are complementary to your
Proposed joint replacement for the trimmed Monitoring block so all three angles are covered: **Monitoring Progress**
After job submission, register the job per the **monitor skill** for durable cross-session tracking. For one-off queries (live status, debugging a failed run, analyzing results) use the **launching-evals skill**; for querying past runs in MLflow use **accessing-mlflow**.I'd prefer to fold this into #1252 directly (keeps the narrative in one place) — happy to push a commit to your branch or send you a patch. Otherwise I'll rebase #1239 on top of yours. Either works, just want to avoid stepping on each other. |
Edwardf0t1
left a comment
There was a problem hiding this comment.
LGTM in general, left a comment.
I think we can land this one first, then I will rebase and work on #1239
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
Good catch. The monitoring block now references all three skills. |
d781534 to
7aadb62
Compare
PR #1252 (kaix-nv) landed with the joint trim text I proposed in the coordination comment. The Monitoring block in evaluation/SKILL.md now points at all three skills: - `monitor` for durable cross-session job tracking - `launching-evals` for one-shot status / debug / analyze - `accessing-mlflow` for querying past runs All merges auto-resolved (evaluation/SKILL.md, ptq/SKILL.md, deployment/SKILL.md). Nothing from our branch was lost — credentials.md, remote-execution.md reframe, markdownlint `ignores:`, and the vendored skills all survived. The Monitoring-trim item on our plan is now satisfied by #1252 rather than a direct edit on this branch. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
What does this PR do?
Type of change: ?
Add a standalone monitor skill for persistent job tracking across sessions, and integrate it with PTQ, evaluation, and deployment skills.
Problem: Each skill had ad-hoc inline monitoring (squeue polling, nel status checks) that didn't survive session restarts and couldn't track multiple jobs. Users had to manually ask "check status" every time.
Solution: A centralized monitor skill with:
Usage
After any skill submits a job, the monitor skill automatically:
User can also trigger manually:
User: "check my eval status" → reads registry, reports current state
User: "is the PTQ done?" → finds job, checks status
User: "what jobs are running?" → lists all registered jobs
Testing
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Documentation