[1/N] Polish evaluation skills and common skills based on an E2E workflow testing, vendor two Claude skills from NeMo Evaluator by Edwardf0t1 · Pull Request #1239 · NVIDIA/Model-Optimizer

Edwardf0t1 · 2026-04-12T09:37:59Z

What does this PR do?

Type of change: Documentation / Skills

Polishes evaluation and common skills based on end-to-end experience quantizing + deploying + evaluating LLMs. Vendors the two upstream Claude skills from NeMo Evaluator, splits shared credential setup into its own doc, and applies reviewer feedback.

Changes

New files

.claude/skills/launching-evals/ — vendored verbatim from NVIDIA-NeMo/Evaluator @ commit 01899f8. Covers run / check / debug / analyze flows for NEL evaluations.
.claude/skills/accessing-mlflow/SKILL.md — vendored verbatim from the same upstream. Queries MLflow runs via the mlflow-mcp MCP server.
.claude/scripts/sync-upstream-skills.sh — re-vendors the two skills above at a pinned SHA. Idempotent; re-applies our provenance frontmatter on each run.
.claude/skills/common/credentials.md — shared HF / NGC / Docker credential setup, referenced from slurm-setup.md. Generic (not NVIDIA-internal) — public NEL SLURM-executor users rely on the same NGC/HF setup.

Updated files

.claude/skills/common/slurm-setup.md — NGC credential block collapsed to a one-paragraph pointer at credentials.md.
.claude/skills/common/remote-execution.md — reframed "Checkpoint and storage availability" as "Staging checkpoints from your workstation". Drops the misleading login-vs-compute framing and the dlcluster-specific row.
.claude/skills/common/workspace-management.md — drops stale pointer to the deleted e2e doc.
.claude/skills/evaluation/SKILL.md — workspace-integration intro trimmed; NEL CI section removed (content moved to Model-Optimizer-Internal). Monitoring block replaced via Add a standalone monitor skill for persistent job tracking #1252 merge with joint trim text that routes to monitor, launching-evals, and accessing-mlflow.
.claude/skills/ptq/SKILL.md — "Next steps" block refined.
.markdownlint-cli2.yaml — excludes vendored upstream skills from markdownlint so they stay byte-identical to upstream.

Deleted files

.claude/skills/common/end-to-end-workflow.md — per @kaix-nv and @mxinO: redundant with skill descriptions that already handle cross-skill routing.
.claude/skills/evaluation/references/nel-ci-guide.md — per @shengliangxu: NVIDIA-internal. Moved to Model-Optimizer-Internal:agent/nel-ci-guide.md (MR !57).

Review status

Reviewer	Concern	How addressed
@shengliangxu	`nel-ci-guide.md` contains NVIDIA-internal content	Moved to Model-Optimizer-Internal MR !57, with refresh (current cluster list, Sybil/non-Sybil distinction, prerequisites checklist, `nel-ci-cli` preferred trigger path)
@kaix-nv, @mxinO	e2e workflow doc unnecessary	Deleted. Skill descriptions already route between ptq / deployment / evaluation.
@kaix-nv (#1252)	Overlap on `evaluation/SKILL.md` Monitoring section	Coordinated via comment on #1252. @kaix-nv incorporated the joint trim text referencing `monitor` + `launching-evals` + `accessing-mlflow`. Landed via #1252 merge.
@mxinO	`remote-execution.md` compute-node framing is misleading	Reframed as workstation→cluster staging.
@mxinO, CodeRabbit, Copilot	slurm-setup NGC creds aren't SLURM-specific; `$oauthtoken` literal clarification; overwrite safety	New `credentials.md` with NGC / HF / Docker setup. Append-if-missing pattern (`grep -q … \|\| echo … >>`). Explicitly calls out `$oauthtoken` as literal, kept unexpanded via single quotes.

Depends on: [1/N] Polish deployment skills - Add a debug loop for unsupported models #1236 (deployment/references/unsupported-models.md, merged)
Coordinated with: Add a standalone monitor skill for persistent job tracking #1252 (monitor skill, merged — joint trim text incorporated)
Internal counterpart: Model-Optimizer-Internal MR !57 — agent/nel-ci-guide.md

Motivation

Learnings from running end-to-end PTQ → Deploy → Eval on Devstral-Small-2-24B (FP8 VLM → NVFP4 MLP-only) on dlcluster B100, plus prior NEL CI experience on oci-hsg.

Testing

Validated end-to-end: PTQ (6 min) → vLLM deployment (3 debug iterations) → NEL evaluation (MMLU 77.4%, GSM8K 80%, GPQA 40% on limit_samples=10).

Before your PR is "Ready for review"

Is this change backward compatible?: ✅ (documentation / skills only)
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅
- .claude/skills/launching-evals/ and .claude/skills/accessing-mlflow/ are vendored verbatim from NVIDIA-NeMo/Evaluator (Apache-2.0). Provenance SHA pinned in each SKILL.md frontmatter and in .claude/scripts/sync-upstream-skills.sh.
Did you write any new necessary tests?: N/A (skill documentation)
Did you update Changelog?: N/A

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

copy-pr-bot · 2026-04-12T09:38:07Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-04-12T09:38:08Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds documentation tying PTQ → Deployment → Evaluation into a single pipeline: workspace conventions and artifact locations, cluster/CI execution guidance (SLURM vs NEL CI), registry/auth instructions, workspace persistence across stages, runtime deployment override patterns, and operational troubleshooting for JET and SLURM.

Changes

Cohort / File(s)	Summary
End-to-end workflow `.claude/skills/common/end-to-end-workflow.md`	New doc defining the PTQ → Deployment → Evaluation pipeline, shared workspace layout (`workspaces/<model>/`), expected PTQ/deploy/eval artifact locations, unsupported-model failure modes, and deployment override mechanisms (`deployment.command`, `NEL_DEPLOYMENT_COMMAND`).
Cluster & remote execution `.claude/skills/common/remote-execution.md`, `.claude/skills/common/slurm-setup.md`, `.claude/skills/evaluation/references/nel-ci-guide.md`	Added checkpoint accessibility checks and cluster storage path mappings with copy examples; Pyxis/Enroot registry auth steps and noted `srun` 401 failure mode; comprehensive NEL CI guide covering SLURM vs GitLab flows, GPU/tensor-parallel sizing, env-var/secret formats, gated dataset behavior, checkpoint transfer commands, and JET-specific operational snippets.
Workspace & pipeline integration `.claude/skills/common/workspace-management.md`, `.claude/skills/ptq/SKILL.md`	Documented cross-skill workspace flow persisting PTQ outputs and adding `eval_results/` + `eval_config.yaml`; updated example flow to reuse workspaces; PTQ SKILL now points users to deployment/evaluation next steps and notes carrying PTQ patches forward.
Evaluation guidance updates `.claude/skills/evaluation/SKILL.md`	Inserted guidance to carry deployment-time runtime patches into NEL config via `deployment.command`; added “NEL CI and Cluster-Specific Notes” checklist referencing `nel-ci-guide.md` and cluster/CI operational considerations.

Sequence Diagram(s)

sequenceDiagram
    participant User as User / Agent
    participant PTQ as PTQ Job
    participant WS as Workspace\n(workspaces/<model>)
    participant Deploy as Deployment Job\n(NEL/SLURM/CI)
    participant Serve as Model Server
    participant Eval as Evaluation Job
    participant Storage as Cluster Storage

    User->>PTQ: submit PTQ (produce quantized checkpoint -> WS/output/)
    PTQ->>WS: write checkpoint & artifacts
    User->>Deploy: trigger deployment (consumes WS/output/, may set deployment.command / NEL_DEPLOYMENT_COMMAND)
    Deploy->>Serve: stage model and start serving
    Serve->>Eval: run evaluation tasks (reads from Serve, writes WS/eval_results/)
    Eval->>WS: write eval artifacts and eval_config.yaml
    Eval->>Storage: persist final results

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns	✅ Passed	PR contains only documentation changes to markdown files in .claude/skills/ directory; no Python code or configuration files modified.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title references polishing evaluation and common skills and mentions both end-to-end workflow testing and vendoring Claude skills from NeMo Evaluator. The changeset adds substantial new documentation files (end-to-end-workflow.md, nel-ci-guide.md) and updates multiple skill files to reflect E2E workflow insights and NEL CI guidance, which aligns well with the stated title.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch zhiyu/polish-eval-skills

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-12T09:42:54Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1239/
Built to branch `gh-pages` at 2026-04-19 05:33 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

codecov · 2026-04-12T09:59:01Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.61%. Comparing base (e9a4989) to head (c664a30).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1239   +/-   ##
=======================================
  Coverage   76.61%   76.61%           
=======================================
  Files         459      459           
  Lines       49153    49153           
=======================================
  Hits        37661    37661           
  Misses      11492    11492

Flag	Coverage Δ
unit	`52.95% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Add common/end-to-end-workflow.md documenting the PTQ → Deploy → Eval pipeline, workspace continuity, unsupported model handling, NEL deployment.command pattern, and NEL CI vs SLURM executor decision table - Add cross-skill workspace flow to workspace-management.md - Add "Next steps" to ptq/SKILL.md pointing to deployment/evaluation - Add pipeline integration note to evaluation/SKILL.md Depends on PR #1236 (deployment/references/unsupported-models.md). Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Copilot

Pull request overview

Adds end-to-end documentation that connects the PTQ, deployment, and evaluation “skills”, with cluster-specific guidance for running NeMo Evaluator Launcher (NEL) on SLURM vs JET/NEL CI.

Changes:

Added a new PTQ → Deploy → Eval workflow doc and cross-skill workspace flow guidance.
Added a comprehensive NEL CI / cluster guide (JET vs SLURM executor, storage paths, env var formats, common issues).
Updated existing skill docs to point users to the new workflow and evaluation references.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`.claude/skills/ptq/SKILL.md`	Adds a “Next steps” pointer into the end-to-end pipeline doc.
`.claude/skills/evaluation/SKILL.md`	Adds pipeline integration notes and a pointer to the new NEL CI guide.
`.claude/skills/evaluation/references/nel-ci-guide.md`	New detailed guide for NEL CI/JET vs SLURM executor usage, clusters, storage, and troubleshooting.
`.claude/skills/common/workspace-management.md`	Documents how a single workspace is reused across PTQ/deploy/eval stages.
`.claude/skills/common/slurm-setup.md`	Adds enroot/pyxis private registry credential setup for nvcr.io.
`.claude/skills/common/remote-execution.md`	Adds compute-node storage availability guidance to prevent “checkpoint not found” issues.
`.claude/skills/common/end-to-end-workflow.md`	New end-to-end workflow doc tying the three skills together.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+| Stage | What can go wrong | Reference |
+|-------|-------------------|-----------|
+| **PTQ** | Unknown architecture, FP8 source checkpoint, VLM structure | `ptq/references/unsupported-models.md` |
+| **Deployment** | Missing architecture mapping, weight key mismatches, quant/unquant layer confusion | `deployment/references/unsupported-models.md` |


+When the serving framework needs runtime patches (e.g., transformers upgrade, model handler fix), override `deployment.command` in the NEL config to inject fixes before serving:
+
+```yaml
+deployment:
+  command: >-
+    pip install "transformers>=5.0.0.dev0" --pre -q &&


+```bash
+export GITLAB_TOKEN=<your_gitlab_token>
+
+curl -k --request POST \


+If the checkpoint is on a workstation, **copy it to cluster storage first**:
+
+```bash
+rsync -av /path/to/local/checkpoint \
+    <cluster-login>:/lustre/fsw/portfolios/coreai/users/$USER/checkpoints/
+```
+
+For dlcluster, the checkpoint paths are directly accessible since the NFS mounts are shared between login and compute nodes.
+


+mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface
+mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm
+mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs
+chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface
+chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm
+chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs
+```
+
+`chmod 777` is required because `svc-jet` (JET service account) runs containers and needs write access.


+# To add NGC credentials:
+mkdir -p ~/.config/enroot
+echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials


+
+- Binds to `0.0.0.0` by default — health checks work out of the box
+- For NVFP4: `--quantization modelopt_fp4`
+- For unsupported models (e.g., ministral3): may need custom `deployment.command` to patch the framework before serving (see `deployment/references/unsupported-models.md`)


coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

.claude/skills/common/slurm-setup.md (1)

62-62: Clarify that $oauthtoken is a literal string.

Users might misinterpret $oauthtoken as a shell variable that needs to be set. Consider adding a brief note that $oauthtoken is NGC's required literal login string (not a variable).

📝 Suggested clarification

 # To add NGC credentials:
 mkdir -p ~/.config/enroot
-echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials
+# Note: '$oauthtoken' is a literal string required by NGC (not a shell variable)
+echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials
 chmod 600 ~/.config/enroot/.credentials

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.claude/skills/common/slurm-setup.md at line 62, The line that writes
credentials uses the token string "$oauthtoken" which readers may think is a
shell variable; update the documentation line that currently shows echo 'machine
nvcr.io login $oauthtoken password <NGC_API_KEY>' >
~/.config/enroot/.credentials to explicitly state that $oauthtoken is the
literal NGC login string (not a shell variable) and add a short parenthetical
like "(use the literal string $oauthtoken as NGC requires, do not replace with a
shell variable)" or escape it in the example so readers understand it's not to
be substituted; reference the exact example command text "machine nvcr.io login
$oauthtoken password <NGC_API_KEY>" when making the clarification.

.claude/skills/evaluation/references/nel-ci-guide.md (1)

178-193: Consider noting the security implications of chmod 777.

While chmod 777 is necessary for the JET service account to access these directories (as explained), it makes them world-writable. Consider adding a brief note about the security trade-off, or suggesting users scope these permissions to only the directories actually needed for evaluation runs.
📝 Suggested addition
 chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs
+
+# Note: chmod 777 is required for svc-jet service account access, but makes
+# directories world-writable. Limit to only the directories needed for your runs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/references/nel-ci-guide.md around lines 178 - 193,
Add a short note after the chmod 777 commands explaining the security trade-off
of making those paths world-writable and suggest safer alternatives: prefer
scoping permissions to the service account or group (e.g., chown svc-jet or set
group ownership and use chmod 770/770-like perms), or use filesystem ACLs to
grant only the svc-jet account write access to the listed directories
(/lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface, /.../cache/vllm,
/.../nv-eval-rundirs) instead of blanket chmod 777.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/common/end-to-end-workflow.md:
- Around line 47-59: The doc currently claims the deployment.command override
works "with both NEL SLURM executor and NEL CI (via NEL_DEPLOYMENT_COMMAND)";
remove or amend this to avoid asserting unsupported CI support: either delete
the "NEL CI (via NEL_DEPLOYMENT_COMMAND)" clause or add a clear note that using
NEL_DEPLOYMENT_COMMAND is a custom/unsupported extension (not part of official
NEL executors) and list only the officially supported executors (e.g., NEL
SLURM, local, Lepton AI); update the text around deployment.command and
NEL_DEPLOYMENT_COMMAND to reflect this change.

---

Nitpick comments:
In @.claude/skills/common/slurm-setup.md:
- Line 62: The line that writes credentials uses the token string "$oauthtoken"
which readers may think is a shell variable; update the documentation line that
currently shows echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>'
> ~/.config/enroot/.credentials to explicitly state that $oauthtoken is the
literal NGC login string (not a shell variable) and add a short parenthetical
like "(use the literal string $oauthtoken as NGC requires, do not replace with a
shell variable)" or escape it in the example so readers understand it's not to
be substituted; reference the exact example command text "machine nvcr.io login
$oauthtoken password <NGC_API_KEY>" when making the clarification.

In @.claude/skills/evaluation/references/nel-ci-guide.md:
- Around line 178-193: Add a short note after the chmod 777 commands explaining
the security trade-off of making those paths world-writable and suggest safer
alternatives: prefer scoping permissions to the service account or group (e.g.,
chown svc-jet or set group ownership and use chmod 770/770-like perms), or use
filesystem ACLs to grant only the svc-jet account write access to the listed
directories (/lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface,
/.../cache/vllm, /.../nv-eval-rundirs) instead of blanket chmod 777.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3c8b9ea9-5ab6-43f7-96a8-fb58f0a1804e

📥 Commits

Reviewing files that changed from the base of the PR and between 0357cb9 and 2cb3b39.

📒 Files selected for processing (7)

.claude/skills/common/end-to-end-workflow.md
.claude/skills/common/remote-execution.md
.claude/skills/common/slurm-setup.md
.claude/skills/common/workspace-management.md
.claude/skills/evaluation/SKILL.md
.claude/skills/evaluation/references/nel-ci-guide.md
.claude/skills/ptq/SKILL.md

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

…ra escaping - Add wrapper script pattern for complex deployment commands that break Hydra's override parser (put serve.sh in checkpoint dir, reference as bash /checkpoint/serve.sh) - Add NEL_CONFIG_BASE64 + Python trigger pattern for custom configs - Add cross-cluster checkpoint copy via tar pipe - Add Hydra LexerNoViableAltException and Bad Request to common issues Learned from triggering full AA evaluation (MMLU-PRO, GPQA Diamond, LiveCodeBench, SCICODE, AIME 2025, Terminal-Bench Hard) for Devstral-Small-2-24B NVFP4 on oci-hsg via NEL CI. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

.claude/skills/evaluation/references/nel-ci-guide.md (1)
70-70: Prefer least-privilege wording for chmod 777 guidance

Given this is shared infra guidance, consider documenting 777 as a fallback-only option and prefer group ACL/owner-group write where possible to reduce accidental overexposure.

Also applies to: 128-128, 270-275
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/references/nel-ci-guide.md at line 70, Replace the
blanket "chmod -R 777 /lustre/.../checkpoints/model-name" guidance with
least-privilege instructions: set the directory owner/group to svc-jet (chown
svc-jet:svc-jet) and grant group write/execute (chmod g+rwX) or use POSIX ACLs
(setfacl) to grant only svc-jet the required access, and note that 777 should be
documented as a last-resort fallback; apply the same change to the other
occurrences referenced (lines 128 and 270-275) and ensure examples mention using
setfacl or group-owner changes rather than recommending 777.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/evaluation/references/nel-ci-guide.md:
- Line 87: Remove the insecure TLS bypass from the curl examples by deleting the
"-k" flag in the command that currently appears as "curl -k --request POST" (and
the other occurrence) and either leave the example using standard TLS
verification or replace it with an explicit documented option to set a CA bundle
(e.g., mention "--cacert /path/to/corporate-ca.pem" or a pinned CA path) so the
examples do not send PRIVATE-TOKEN over disabled verification.
- Around line 133-135: The fenced code block containing the JSON snippet with
the key "NEL_DEPLOYMENT_COMMAND" should include a language tag to satisfy
markdownlint MD040; update the block fence to use ```json so the snippet
({"key": "NEL_DEPLOYMENT_COMMAND", "value": "bash /checkpoint/serve.sh"}) is
marked as JSON, ensuring the markdown linter recognizes the language.

---

Nitpick comments:
In @.claude/skills/evaluation/references/nel-ci-guide.md:
- Line 70: Replace the blanket "chmod -R 777 /lustre/.../checkpoints/model-name"
guidance with least-privilege instructions: set the directory owner/group to
svc-jet (chown svc-jet:svc-jet) and grant group write/execute (chmod g+rwX) or
use POSIX ACLs (setfacl) to grant only svc-jet the required access, and note
that 777 should be documented as a last-resort fallback; apply the same change
to the other occurrences referenced (lines 128 and 270-275) and ensure examples
mention using setfacl or group-owner changes rather than recommending 777.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2e6188dc-54d7-48bd-8e55-d7d6b0fc9ed8

📥 Commits

Reviewing files that changed from the base of the PR and between 1b94fc9 and b1be817.

📒 Files selected for processing (1)

.claude/skills/evaluation/references/nel-ci-guide.md

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

When using NEL_DEPLOYMENT_COMMAND with a custom --served-model-name, deployment.served_model_name must also be overridden via NEL_OTHER_OVERRIDES — NEL uses the config field (not the actual serve command) to set the eval client's model_id. Without this, the client sends the checkpoint path as model_id, causing 404 errors. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

shengliangxu · 2026-04-13T19:02:44Z

@@ -0,0 +1,276 @@
+# NEL CI Evaluation Guide


this is Nvidia internal, suggest:

Put all Nvidia internal stuff under something like:

.claude/skills/evaluation/references/nvidia-internal/

Put it in the ModelOpt-Internal repo

Thanks — went with option 2. The file is moved to Model-Optimizer-Internal MR !57 at agent/evaluation_guide.md, and the public copy + the "NEL CI and Cluster-Specific Notes" section in evaluation/SKILL.md are deleted in 290f432. Also took the opportunity to refresh it against current upstream NEL / NEL-CI (newer clusters, Sybil vs non-Sybil, nel-ci-cli as preferred trigger, balanced NEL vs NEL-CI framing).

kaix-nv · 2026-04-14T01:00:58Z

Is this workflow really necessary? It feels like the existing skills and Claude Code already cover it.

+1, we talked about this and decided to not add an e2e workflow I remember, the idea is the Claude knows to call multiple skills when user says something "quantize and deploy"

You're right — deleted in 9cb309b. The skill descriptions already handle cross-skill routing ("quantize and deploy" chains ptq → deployment); an e2e prose doc was duplicating that in a form agents don't read.

Acknowledged, sorry for re-introducing it despite the earlier discussion. Deleted in 9cb309b. The one useful insight ("carry PTQ patches forward to deploy/eval") is preserved as a one-liner in evaluation/SKILL.md.

mxinO · 2026-04-14T04:30:33Z

+1, we talked about this and decided to not add an e2e workflow I remember, the idea is the Claude knows to call multiple skills when user says something "quantize and deploy"

mxinO · 2026-04-14T05:55:00Z

 default_cluster: my-cluster
 ```

+### Checkpoint and storage availability


I didn't get this part, the compute node should have same storage access as login node, the only special i have seen is dlcluster, in which each node needs IT path export to access user storage. This shouldn't be added publicly, and internally we should use our team's storage path to avoid this.

Good catch — framing was wrong. Compute nodes on a given cluster do share login-node storage; the real issue is workstation paths aren't on the cluster at all. Reframed as "Staging checkpoints from your workstation" in 7824b24, and dropped the dlcluster-specific row per your note that it's an internal quirk that shouldn't ship publicly.

mxinO · 2026-04-14T05:59:30Z

+cat ~/.config/enroot/.credentials 2>/dev/null || echo "No credentials"
+# To add NGC credentials:
+mkdir -p ~/.config/enroot
+echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials


This seems not a slurm specific, considering the env setup, we should have a general setup guide to help users to set HF token, docker loign token, ngc login token etc. Maybe we can create another env-setup.md to handle this in one place? But this sounds beyond a modelopt skill. cc @kaix-nv

Agreed — split out into .claude/skills/common/credentials.md in 645a545, covering HF_TOKEN, NGC API key (Docker + enroot paths), and Docker Hub. slurm-setup.md keeps only a one-paragraph pyxis pointer at it. Same change also handles CodeRabbit's $oauthtoken-is-literal note and Copilot's append-don't-overwrite concern.

…valuator Both are vendored verbatim from commit 01899f8 with SHA-pin provenance in frontmatter. `launching-evals` covers run/monitor/debug/analyze flows for NEL evaluations; `accessing-mlflow` covers MLflow run querying via mlflow-mcp. These complement (do not duplicate) our existing `evaluation` skill, which remains focused on config generation with ModelOpt-specific additions. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Resolves conflict in .claude/skills/ptq/SKILL.md by keeping both additions: main's new "Post-quantization validation" subsection (with pointer to references/checkpoint-validation.md), followed by our existing "Next steps" cross-skill pointer. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

The script re-downloads launching-evals/ and accessing-mlflow/ from NVIDIA-NeMo/Evaluator at a pinned SHA (DEFAULT_SHA in the script) and re-applies our provenance frontmatter. Idempotent — running repeatedly at the same SHA produces no diff. Also fixes the re-sync path in both SKILL.md frontmatters to the actual script location (.claude/scripts/). Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Reviewers on PR #1239 (kaix-nv, mxinO) flagged the e2e workflow doc as unnecessary: the skill descriptions already route Claude to chain PTQ, deployment, and evaluation skills, and the content duplicated workspace-management.md or lived better inside the evaluation skill's nel-ci-guide.md references. Removes the file and its three cross-references (evaluation/SKILL.md, ptq/SKILL.md, workspace-management.md). The "carry PTQ patches forward to deploy/eval" insight is preserved as a one-liner in evaluation/SKILL.md. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

@shengliangxu

Reviewer @shengliangxu flagged that the NEL CI evaluation guide contains NVIDIA-internal infrastructure (JET clusters, svc-jet service account, gitlab-master NEL CI triggers, COMPEVAL_HF_TOKEN, internal lustre paths) and should not ship in the public repo. The file has been moved to Model-Optimizer-Internal:agent/nel-ci-guide.md (see internal MR: zhiyu/add-nel-ci-guide-to-agent). This commit removes the public copy and the "NEL CI and Cluster-Specific Notes" section from evaluation/SKILL.md that referenced it. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

@mxinO

Two fixes: 1. Exclude vendored upstream skills from markdownlint. `.claude/skills/launching-evals/` and `.claude/skills/accessing-mlflow/` are vendored verbatim from NVIDIA-NeMo/Evaluator and re-synced via .claude/scripts/sync-upstream-skills.sh. Markdownlint wanted to reformat them (trailing blank lines, spacing around fences), but fixing would violate the "verbatim" property documented in their frontmatter. Add an `ignores:` glob to `.markdownlint-cli2.yaml`. 2. Reframe the checkpoint/storage note on `skills/common/remote-execution.md`. Reviewer @mxinO noted (PR #1239) that the previous "compute nodes may not share the same filesystem as login nodes" framing is misleading — compute nodes on a given cluster do share storage with the login node. The real issue is that workstation filesystems aren't mounted on the cluster at all. Also drops the dlcluster-specific row, which @mxinO flagged as an internal quirk that shouldn't ship publicly. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

@mxinO

Addresses three overlapping review comments on slurm-setup.md:62 from PR #1239: - @mxinO: NGC/HF/Docker tokens aren't SLURM-specific — wanted a general credential setup guide referenced from multiple skills. - CodeRabbit: `$oauthtoken` needs to be called out as a literal NGC login string, not a shell variable to substitute. - Copilot: the previous snippet overwrote `~/.config/enroot/.credentials` unconditionally, clobbering entries for other registries. New `skills/common/credentials.md` covers HF_TOKEN, NGC API key (Docker + enroot paths), and Docker Hub. The NGC/enroot block uses an append-if-missing pattern (`grep -q ... || echo ... >>`) and spells out that `$oauthtoken` is a literal, kept unexpanded via single quotes. `slurm-setup.md` now keeps only the pyxis-specific signpost — one paragraph pointing at `credentials.md` for the actual setup. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

PR #1252 (kaix-nv) landed with the joint trim text I proposed in the coordination comment. The Monitoring block in evaluation/SKILL.md now points at all three skills: - `monitor` for durable cross-session job tracking - `launching-evals` for one-shot status / debug / analyze - `accessing-mlflow` for querying past runs All merges auto-resolved (evaluation/SKILL.md, ptq/SKILL.md, deployment/SKILL.md). Nothing from our branch was lost — credentials.md, remote-execution.md reframe, markdownlint `ignores:`, and the vendored skills all survived. The Monitoring-trim item on our plan is now satisfied by #1252 rather than a direct edit on this branch. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Documents the new Claude Code evaluation-related skills and the shared credentials.md common doc, mirroring the style of the existing PTQ skill entry in the 0.44 release notes. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

kevalmorabia97 · 2026-04-19T06:44:20Z

+- Benchmark-specific info learned during launching/analyzing evals should be added to `references/benchmarks/`
+- **PPP** = Slurm account (the `account` field in cluster_config.yaml). When the user says "change PPP to X", update the account value (e.g., `coreai_dlalgo_compeval` → `coreai_dlalgo_llm`).
+- **Slurm job pairs**: NEL (nemo-evaluator-launcher) submits paired Slurm jobs — a RUNNING job + a PENDING restart job (for when the 4h walltime expires). Never cancel the pending restart jobs — they are expected and necessary.
+- **HF cache requirement**: For configs with `HF_HUB_OFFLINE=1`, models must be pre-downloaded to the HF cache on each cluster before launching. **Before running a model on a new cluster, always ask the user if the model is already cached there.** If not, on the cluster login node: `python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub` then `HF_HOME=/lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface hf download <model>`. Without this, vLLM will fail with `LocalEntryNotFoundError`.


Can you change full luster path to some mount path or dummy path

kevalmorabia97 · 2026-04-19T06:44:56Z

+
+# List available evaluation tasks (by default, only shows tasks from the latest released containers)
+uv run nemo-evaluator-launcher ls tasks
+uv run nemo-evaluator-launcher ls tasks --from_container gitlab-master.nvidia.com/dl/joc/competitive_evaluation/nvidia-core-evals/ci-llm/long-context-eval:dev-2025-12-16T14-37-1693de28-amd64


Remove internal gitlab containers and PPP

Edwardf0t1 added 2 commits April 12, 2026 02:35

Polish eval skills

4bf8253

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Polish eval skills

2e84f3b

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

update

e952bcd

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 marked this pull request as ready for review April 12, 2026 20:58

Edwardf0t1 self-assigned this Apr 12, 2026

Edwardf0t1 requested review from Copilot, kaix-nv, mxinO and shengliangxu April 12, 2026 20:58

Copilot started reviewing on behalf of Edwardf0t1 April 12, 2026 20:58 View session

Edwardf0t1 changed the title ~~[1/N] Polish evaluation skills~~ [1/N] Polish evaluation skills and common skills based on an E2E workflow testing Apr 12, 2026

Copilot AI reviewed Apr 12, 2026

View reviewed changes

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

Comment thread .claude/skills/common/end-to-end-workflow.md Outdated

Edwardf0t1 added 2 commits April 12, 2026 17:55

fix format

1b94fc9

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

Comment thread .claude/skills/evaluation/references/nel-ci-guide.md Outdated

Comment thread .claude/skills/evaluation/references/nel-ci-guide.md Outdated

Edwardf0t1 added 2 commits April 12, 2026 21:44

fix format

7dcede4

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

shengliangxu reviewed Apr 13, 2026

View reviewed changes

Edwardf0t1 requested a review from cjluo-nv April 13, 2026 21:30

kaix-nv reviewed Apr 14, 2026

View reviewed changes

mxinO reviewed Apr 14, 2026

View reviewed changes

Edwardf0t1 mentioned this pull request Apr 19, 2026

Add a standalone monitor skill for persistent job tracking #1252

Merged

Edwardf0t1 added 2 commits April 18, 2026 18:23

Edwardf0t1 added 7 commits April 18, 2026 18:41

fix format

31c4fe8

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 changed the title ~~[1/N] Polish evaluation skills and common skills based on an E2E workflow testing~~ [1/N] Polish evaluation skills and common skills based on an E2E workflow testing, vendor two Claude skills from NeMo Evaluator Apr 19, 2026

Add CHANGELOG entry for evaluation skills polish

c664a30

Documents the new Claude Code evaluation-related skills and the shared credentials.md common doc, mirroring the style of the existing PTQ skill entry in the 0.44 release notes. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 requested review from kaix-nv, kevalmorabia97, mxinO and shengliangxu April 19, 2026 05:30

kevalmorabia97 reviewed Apr 19, 2026

View reviewed changes

Conversation

Edwardf0t1 commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Changes

Review status

Related

Motivation

Testing

Before your PR is "Ready for review"

Uh oh!

copy-pr-bot bot commented Apr 12, 2026

Uh oh!

coderabbitai bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

github-actions bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-04-19 05:33 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mxinO Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

Edwardf0t1 commented Apr 12, 2026 •

edited

Loading

coderabbitai bot commented Apr 12, 2026 •

edited

Loading

github-actions bot commented Apr 12, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-19 05:33 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov bot commented Apr 12, 2026 •

edited

Loading

Edwardf0t1 Apr 19, 2026 •

edited

Loading

mxinO Apr 14, 2026 •

edited

Loading