Skip to content

[1/N] Polish evaluation skills and common skills based on an E2E workflow testing, vendor two Claude skills from NeMo Evaluator#1239

Open
Edwardf0t1 wants to merge 18 commits intomainfrom
zhiyu/polish-eval-skills
Open

[1/N] Polish evaluation skills and common skills based on an E2E workflow testing, vendor two Claude skills from NeMo Evaluator#1239
Edwardf0t1 wants to merge 18 commits intomainfrom
zhiyu/polish-eval-skills

Conversation

@Edwardf0t1
Copy link
Copy Markdown
Contributor

@Edwardf0t1 Edwardf0t1 commented Apr 12, 2026

What does this PR do?

Type of change: Documentation / Skills

Polishes evaluation and common skills based on end-to-end experience quantizing + deploying + evaluating LLMs. Vendors the two upstream Claude skills from NeMo Evaluator, splits shared credential setup into its own doc, and applies reviewer feedback.

Changes

New files

  • .claude/skills/launching-evals/ — vendored verbatim from NVIDIA-NeMo/Evaluator @ commit 01899f8. Covers run / check / debug / analyze flows for NEL evaluations.
  • .claude/skills/accessing-mlflow/SKILL.md — vendored verbatim from the same upstream. Queries MLflow runs via the mlflow-mcp MCP server.
  • .claude/scripts/sync-upstream-skills.sh — re-vendors the two skills above at a pinned SHA. Idempotent; re-applies our provenance frontmatter on each run.
  • .claude/skills/common/credentials.md — shared HF / NGC / Docker credential setup, referenced from slurm-setup.md. Generic (not NVIDIA-internal) — public NEL SLURM-executor users rely on the same NGC/HF setup.

Updated files

  • .claude/skills/common/slurm-setup.md — NGC credential block collapsed to a one-paragraph pointer at credentials.md.
  • .claude/skills/common/remote-execution.md — reframed "Checkpoint and storage availability" as "Staging checkpoints from your workstation". Drops the misleading login-vs-compute framing and the dlcluster-specific row.
  • .claude/skills/common/workspace-management.md — drops stale pointer to the deleted e2e doc.
  • .claude/skills/evaluation/SKILL.md — workspace-integration intro trimmed; NEL CI section removed (content moved to Model-Optimizer-Internal). Monitoring block replaced via Add a standalone monitor skill for persistent job tracking #1252 merge with joint trim text that routes to monitor, launching-evals, and accessing-mlflow.
  • .claude/skills/ptq/SKILL.md — "Next steps" block refined.
  • .markdownlint-cli2.yaml — excludes vendored upstream skills from markdownlint so they stay byte-identical to upstream.

Deleted files

  • .claude/skills/common/end-to-end-workflow.md — per @kaix-nv and @mxinO: redundant with skill descriptions that already handle cross-skill routing.
  • .claude/skills/evaluation/references/nel-ci-guide.md — per @shengliangxu: NVIDIA-internal. Moved to Model-Optimizer-Internal:agent/nel-ci-guide.md (MR !57).

Review status

Reviewer Concern How addressed
@shengliangxu nel-ci-guide.md contains NVIDIA-internal content Moved to Model-Optimizer-Internal MR !57, with refresh (current cluster list, Sybil/non-Sybil distinction, prerequisites checklist, nel-ci-cli preferred trigger path)
@kaix-nv, @mxinO e2e workflow doc unnecessary Deleted. Skill descriptions already route between ptq / deployment / evaluation.
@kaix-nv (#1252) Overlap on evaluation/SKILL.md Monitoring section Coordinated via comment on #1252. @kaix-nv incorporated the joint trim text referencing monitor + launching-evals + accessing-mlflow. Landed via #1252 merge.
@mxinO remote-execution.md compute-node framing is misleading Reframed as workstation→cluster staging.
@mxinO, CodeRabbit, Copilot slurm-setup NGC creds aren't SLURM-specific; $oauthtoken literal clarification; overwrite safety New credentials.md with NGC / HF / Docker setup. Append-if-missing pattern (grep -q … || echo … >>). Explicitly calls out $oauthtoken as literal, kept unexpanded via single quotes.

Related

Motivation

Learnings from running end-to-end PTQ → Deploy → Eval on Devstral-Small-2-24B (FP8 VLM → NVFP4 MLP-only) on dlcluster B100, plus prior NEL CI experience on oci-hsg.

Testing

Validated end-to-end: PTQ (6 min) → vLLM deployment (3 debug iterations) → NEL evaluation (MMLU 77.4%, GSM8K 80%, GPQA 40% on limit_samples=10).

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅ (documentation / skills only)
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅
    • .claude/skills/launching-evals/ and .claude/skills/accessing-mlflow/ are vendored verbatim from NVIDIA-NeMo/Evaluator (Apache-2.0). Provenance SHA pinned in each SKILL.md frontmatter and in .claude/scripts/sync-upstream-skills.sh.
  • Did you write any new necessary tests?: N/A (skill documentation)
  • Did you update Changelog?: N/A

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 12, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds documentation tying PTQ → Deployment → Evaluation into a single pipeline: workspace conventions and artifact locations, cluster/CI execution guidance (SLURM vs NEL CI), registry/auth instructions, workspace persistence across stages, runtime deployment override patterns, and operational troubleshooting for JET and SLURM.

Changes

Cohort / File(s) Summary
End-to-end workflow
​.claude/skills/common/end-to-end-workflow.md
New doc defining the PTQ → Deployment → Evaluation pipeline, shared workspace layout (workspaces/<model>/), expected PTQ/deploy/eval artifact locations, unsupported-model failure modes, and deployment override mechanisms (deployment.command, NEL_DEPLOYMENT_COMMAND).
Cluster & remote execution
​.claude/skills/common/remote-execution.md, ​.claude/skills/common/slurm-setup.md, ​.claude/skills/evaluation/references/nel-ci-guide.md
Added checkpoint accessibility checks and cluster storage path mappings with copy examples; Pyxis/Enroot registry auth steps and noted srun 401 failure mode; comprehensive NEL CI guide covering SLURM vs GitLab flows, GPU/tensor-parallel sizing, env-var/secret formats, gated dataset behavior, checkpoint transfer commands, and JET-specific operational snippets.
Workspace & pipeline integration
​.claude/skills/common/workspace-management.md, ​.claude/skills/ptq/SKILL.md
Documented cross-skill workspace flow persisting PTQ outputs and adding eval_results/ + eval_config.yaml; updated example flow to reuse workspaces; PTQ SKILL now points users to deployment/evaluation next steps and notes carrying PTQ patches forward.
Evaluation guidance updates
​.claude/skills/evaluation/SKILL.md
Inserted guidance to carry deployment-time runtime patches into NEL config via deployment.command; added “NEL CI and Cluster-Specific Notes” checklist referencing nel-ci-guide.md and cluster/CI operational considerations.

Sequence Diagram(s)

sequenceDiagram
    participant User as User / Agent
    participant PTQ as PTQ Job
    participant WS as Workspace\n(workspaces/<model>)
    participant Deploy as Deployment Job\n(NEL/SLURM/CI)
    participant Serve as Model Server
    participant Eval as Evaluation Job
    participant Storage as Cluster Storage

    User->>PTQ: submit PTQ (produce quantized checkpoint -> WS/output/)
    PTQ->>WS: write checkpoint & artifacts
    User->>Deploy: trigger deployment (consumes WS/output/, may set deployment.command / NEL_DEPLOYMENT_COMMAND)
    Deploy->>Serve: stage model and start serving
    Serve->>Eval: run evaluation tasks (reads from Serve, writes WS/eval_results/)
    Eval->>WS: write eval artifacts and eval_config.yaml
    Eval->>Storage: persist final results
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns ✅ Passed PR contains only documentation changes to markdown files in .claude/skills/ directory; no Python code or configuration files modified.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title references polishing evaluation and common skills and mentions both end-to-end workflow testing and vendoring Claude skills from NeMo Evaluator. The changeset adds substantial new documentation files (end-to-end-workflow.md, nel-ci-guide.md) and updates multiple skill files to reflect E2E workflow insights and NEL CI guidance, which aligns well with the stated title.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch zhiyu/polish-eval-skills

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 12, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1239/

Built to branch gh-pages at 2026-04-19 05:33 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.61%. Comparing base (e9a4989) to head (c664a30).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1239   +/-   ##
=======================================
  Coverage   76.61%   76.61%           
=======================================
  Files         459      459           
  Lines       49153    49153           
=======================================
  Hits        37661    37661           
  Misses      11492    11492           
Flag Coverage Δ
unit 52.95% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Add common/end-to-end-workflow.md documenting the PTQ → Deploy → Eval
  pipeline, workspace continuity, unsupported model handling, NEL
  deployment.command pattern, and NEL CI vs SLURM executor decision table
- Add cross-skill workspace flow to workspace-management.md
- Add "Next steps" to ptq/SKILL.md pointing to deployment/evaluation
- Add pipeline integration note to evaluation/SKILL.md

Depends on PR #1236 (deployment/references/unsupported-models.md).

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1 Edwardf0t1 marked this pull request as ready for review April 12, 2026 20:58
@Edwardf0t1 Edwardf0t1 self-assigned this Apr 12, 2026
@Edwardf0t1 Edwardf0t1 changed the title [1/N] Polish evaluation skills [1/N] Polish evaluation skills and common skills based on an E2E workflow testing Apr 12, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end documentation that connects the PTQ, deployment, and evaluation “skills”, with cluster-specific guidance for running NeMo Evaluator Launcher (NEL) on SLURM vs JET/NEL CI.

Changes:

  • Added a new PTQ → Deploy → Eval workflow doc and cross-skill workspace flow guidance.
  • Added a comprehensive NEL CI / cluster guide (JET vs SLURM executor, storage paths, env var formats, common issues).
  • Updated existing skill docs to point users to the new workflow and evaluation references.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
.claude/skills/ptq/SKILL.md Adds a “Next steps” pointer into the end-to-end pipeline doc.
.claude/skills/evaluation/SKILL.md Adds pipeline integration notes and a pointer to the new NEL CI guide.
.claude/skills/evaluation/references/nel-ci-guide.md New detailed guide for NEL CI/JET vs SLURM executor usage, clusters, storage, and troubleshooting.
.claude/skills/common/workspace-management.md Documents how a single workspace is reused across PTQ/deploy/eval stages.
.claude/skills/common/slurm-setup.md Adds enroot/pyxis private registry credential setup for nvcr.io.
.claude/skills/common/remote-execution.md Adds compute-node storage availability guidance to prevent “checkpoint not found” issues.
.claude/skills/common/end-to-end-workflow.md New end-to-end workflow doc tying the three skills together.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

| Stage | What can go wrong | Reference |
|-------|-------------------|-----------|
| **PTQ** | Unknown architecture, FP8 source checkpoint, VLM structure | `ptq/references/unsupported-models.md` |
| **Deployment** | Missing architecture mapping, weight key mismatches, quant/unquant layer confusion | `deployment/references/unsupported-models.md` |
Comment on lines +49 to +54
When the serving framework needs runtime patches (e.g., transformers upgrade, model handler fix), override `deployment.command` in the NEL config to inject fixes before serving:

```yaml
deployment:
command: >-
pip install "transformers>=5.0.0.dev0" --pre -q &&
```bash
export GITLAB_TOKEN=<your_gitlab_token>

curl -k --request POST \
Comment on lines +56 to +64
If the checkpoint is on a workstation, **copy it to cluster storage first**:

```bash
rsync -av /path/to/local/checkpoint \
<cluster-login>:/lustre/fsw/portfolios/coreai/users/$USER/checkpoints/
```

For dlcluster, the checkpoint paths are directly accessible since the NFS mounts are shared between login and compute nodes.

Comment on lines +185 to +193
mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface
mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm
mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs
chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface
chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm
chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs
```

`chmod 777` is required because `svc-jet` (JET service account) runs containers and needs write access.
Comment thread .claude/skills/common/slurm-setup.md Outdated
Comment on lines +60 to +62
# To add NGC credentials:
mkdir -p ~/.config/enroot
echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials

- Binds to `0.0.0.0` by default — health checks work out of the box
- For NVFP4: `--quantization modelopt_fp4`
- For unsupported models (e.g., ministral3): may need custom `deployment.command` to patch the framework before serving (see `deployment/references/unsupported-models.md`)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
.claude/skills/common/slurm-setup.md (1)

62-62: Clarify that $oauthtoken is a literal string.

Users might misinterpret $oauthtoken as a shell variable that needs to be set. Consider adding a brief note that $oauthtoken is NGC's required literal login string (not a variable).

📝 Suggested clarification
 # To add NGC credentials:
 mkdir -p ~/.config/enroot
-echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials
+# Note: '$oauthtoken' is a literal string required by NGC (not a shell variable)
+echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials
 chmod 600 ~/.config/enroot/.credentials
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/common/slurm-setup.md at line 62, The line that writes
credentials uses the token string "$oauthtoken" which readers may think is a
shell variable; update the documentation line that currently shows echo 'machine
nvcr.io login $oauthtoken password <NGC_API_KEY>' >
~/.config/enroot/.credentials to explicitly state that $oauthtoken is the
literal NGC login string (not a shell variable) and add a short parenthetical
like "(use the literal string $oauthtoken as NGC requires, do not replace with a
shell variable)" or escape it in the example so readers understand it's not to
be substituted; reference the exact example command text "machine nvcr.io login
$oauthtoken password <NGC_API_KEY>" when making the clarification.
.claude/skills/evaluation/references/nel-ci-guide.md (1)

178-193: Consider noting the security implications of chmod 777.

While chmod 777 is necessary for the JET service account to access these directories (as explained), it makes them world-writable. Consider adding a brief note about the security trade-off, or suggesting users scope these permissions to only the directories actually needed for evaluation runs.

📝 Suggested addition
 chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs
+
+# Note: chmod 777 is required for svc-jet service account access, but makes
+# directories world-writable. Limit to only the directories needed for your runs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/references/nel-ci-guide.md around lines 178 - 193,
Add a short note after the chmod 777 commands explaining the security trade-off
of making those paths world-writable and suggest safer alternatives: prefer
scoping permissions to the service account or group (e.g., chown svc-jet or set
group ownership and use chmod 770/770-like perms), or use filesystem ACLs to
grant only the svc-jet account write access to the listed directories
(/lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface, /.../cache/vllm,
/.../nv-eval-rundirs) instead of blanket chmod 777.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/common/end-to-end-workflow.md:
- Around line 47-59: The doc currently claims the deployment.command override
works "with both NEL SLURM executor and NEL CI (via NEL_DEPLOYMENT_COMMAND)";
remove or amend this to avoid asserting unsupported CI support: either delete
the "NEL CI (via NEL_DEPLOYMENT_COMMAND)" clause or add a clear note that using
NEL_DEPLOYMENT_COMMAND is a custom/unsupported extension (not part of official
NEL executors) and list only the officially supported executors (e.g., NEL
SLURM, local, Lepton AI); update the text around deployment.command and
NEL_DEPLOYMENT_COMMAND to reflect this change.

---

Nitpick comments:
In @.claude/skills/common/slurm-setup.md:
- Line 62: The line that writes credentials uses the token string "$oauthtoken"
which readers may think is a shell variable; update the documentation line that
currently shows echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>'
> ~/.config/enroot/.credentials to explicitly state that $oauthtoken is the
literal NGC login string (not a shell variable) and add a short parenthetical
like "(use the literal string $oauthtoken as NGC requires, do not replace with a
shell variable)" or escape it in the example so readers understand it's not to
be substituted; reference the exact example command text "machine nvcr.io login
$oauthtoken password <NGC_API_KEY>" when making the clarification.

In @.claude/skills/evaluation/references/nel-ci-guide.md:
- Around line 178-193: Add a short note after the chmod 777 commands explaining
the security trade-off of making those paths world-writable and suggest safer
alternatives: prefer scoping permissions to the service account or group (e.g.,
chown svc-jet or set group ownership and use chmod 770/770-like perms), or use
filesystem ACLs to grant only the svc-jet account write access to the listed
directories (/lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface,
/.../cache/vllm, /.../nv-eval-rundirs) instead of blanket chmod 777.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3c8b9ea9-5ab6-43f7-96a8-fb58f0a1804e

📥 Commits

Reviewing files that changed from the base of the PR and between 0357cb9 and 2cb3b39.

📒 Files selected for processing (7)
  • .claude/skills/common/end-to-end-workflow.md
  • .claude/skills/common/remote-execution.md
  • .claude/skills/common/slurm-setup.md
  • .claude/skills/common/workspace-management.md
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/evaluation/references/nel-ci-guide.md
  • .claude/skills/ptq/SKILL.md

Comment thread .claude/skills/common/end-to-end-workflow.md Outdated
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…ra escaping

- Add wrapper script pattern for complex deployment commands that break
  Hydra's override parser (put serve.sh in checkpoint dir, reference as
  bash /checkpoint/serve.sh)
- Add NEL_CONFIG_BASE64 + Python trigger pattern for custom configs
- Add cross-cluster checkpoint copy via tar pipe
- Add Hydra LexerNoViableAltException and Bad Request to common issues

Learned from triggering full AA evaluation (MMLU-PRO, GPQA Diamond,
LiveCodeBench, SCICODE, AIME 2025, Terminal-Bench Hard) for
Devstral-Small-2-24B NVFP4 on oci-hsg via NEL CI.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
.claude/skills/evaluation/references/nel-ci-guide.md (1)

70-70: Prefer least-privilege wording for chmod 777 guidance

Given this is shared infra guidance, consider documenting 777 as a fallback-only option and prefer group ACL/owner-group write where possible to reduce accidental overexposure.

Also applies to: 128-128, 270-275

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/references/nel-ci-guide.md at line 70, Replace the
blanket "chmod -R 777 /lustre/.../checkpoints/model-name" guidance with
least-privilege instructions: set the directory owner/group to svc-jet (chown
svc-jet:svc-jet) and grant group write/execute (chmod g+rwX) or use POSIX ACLs
(setfacl) to grant only svc-jet the required access, and note that 777 should be
documented as a last-resort fallback; apply the same change to the other
occurrences referenced (lines 128 and 270-275) and ensure examples mention using
setfacl or group-owner changes rather than recommending 777.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/evaluation/references/nel-ci-guide.md:
- Line 87: Remove the insecure TLS bypass from the curl examples by deleting the
"-k" flag in the command that currently appears as "curl -k --request POST" (and
the other occurrence) and either leave the example using standard TLS
verification or replace it with an explicit documented option to set a CA bundle
(e.g., mention "--cacert /path/to/corporate-ca.pem" or a pinned CA path) so the
examples do not send PRIVATE-TOKEN over disabled verification.
- Around line 133-135: The fenced code block containing the JSON snippet with
the key "NEL_DEPLOYMENT_COMMAND" should include a language tag to satisfy
markdownlint MD040; update the block fence to use ```json so the snippet
({"key": "NEL_DEPLOYMENT_COMMAND", "value": "bash /checkpoint/serve.sh"}) is
marked as JSON, ensuring the markdown linter recognizes the language.

---

Nitpick comments:
In @.claude/skills/evaluation/references/nel-ci-guide.md:
- Line 70: Replace the blanket "chmod -R 777 /lustre/.../checkpoints/model-name"
guidance with least-privilege instructions: set the directory owner/group to
svc-jet (chown svc-jet:svc-jet) and grant group write/execute (chmod g+rwX) or
use POSIX ACLs (setfacl) to grant only svc-jet the required access, and note
that 777 should be documented as a last-resort fallback; apply the same change
to the other occurrences referenced (lines 128 and 270-275) and ensure examples
mention using setfacl or group-owner changes rather than recommending 777.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2e6188dc-54d7-48bd-8e55-d7d6b0fc9ed8

📥 Commits

Reviewing files that changed from the base of the PR and between 1b94fc9 and b1be817.

📒 Files selected for processing (1)
  • .claude/skills/evaluation/references/nel-ci-guide.md

Comment thread .claude/skills/evaluation/references/nel-ci-guide.md Outdated
Comment thread .claude/skills/evaluation/references/nel-ci-guide.md Outdated
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
When using NEL_DEPLOYMENT_COMMAND with a custom --served-model-name,
deployment.served_model_name must also be overridden via
NEL_OTHER_OVERRIDES — NEL uses the config field (not the actual serve
command) to set the eval client's model_id. Without this, the client
sends the checkpoint path as model_id, causing 404 errors.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@@ -0,0 +1,276 @@
# NEL CI Evaluation Guide
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is Nvidia internal, suggest:

  1. Put all Nvidia internal stuff under something like:

.claude/skills/evaluation/references/nvidia-internal/

  1. Put it in the ModelOpt-Internal repo

Copy link
Copy Markdown
Contributor Author

@Edwardf0t1 Edwardf0t1 Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — went with option 2. The file is moved to Model-Optimizer-Internal MR !57 at agent/evaluation_guide.md, and the public copy + the "NEL CI and Cluster-Specific Notes" section in evaluation/SKILL.md are deleted in 290f432. Also took the opportunity to refresh it against current upstream NEL / NEL-CI (newer clusters, Sybil vs non-Sybil, nel-ci-cli as preferred trigger, balanced NEL vs NEL-CI framing).

@Edwardf0t1 Edwardf0t1 requested a review from cjluo-nv April 13, 2026 21:30
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this workflow really necessary? It feels like the existing skills and Claude Code already cover it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we talked about this and decided to not add an e2e workflow I remember, the idea is the Claude knows to call multiple skills when user says something "quantize and deploy"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — deleted in 9cb309b. The skill descriptions already handle cross-skill routing ("quantize and deploy" chains ptq → deployment); an e2e prose doc was duplicating that in a form agents don't read.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged, sorry for re-introducing it despite the earlier discussion. Deleted in 9cb309b. The one useful insight ("carry PTQ patches forward to deploy/eval") is preserved as a one-liner in evaluation/SKILL.md.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we talked about this and decided to not add an e2e workflow I remember, the idea is the Claude knows to call multiple skills when user says something "quantize and deploy"

default_cluster: my-cluster
```

### Checkpoint and storage availability
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get this part, the compute node should have same storage access as login node, the only special i have seen is dlcluster, in which each node needs IT path export to access user storage. This shouldn't be added publicly, and internally we should use our team's storage path to avoid this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — framing was wrong. Compute nodes on a given cluster do share login-node storage; the real issue is workstation paths aren't on the cluster at all. Reframed as "Staging checkpoints from your workstation" in 7824b24, and dropped the dlcluster-specific row per your note that it's an internal quirk that shouldn't ship publicly.

Comment thread .claude/skills/common/slurm-setup.md Outdated
cat ~/.config/enroot/.credentials 2>/dev/null || echo "No credentials"
# To add NGC credentials:
mkdir -p ~/.config/enroot
echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials
Copy link
Copy Markdown
Contributor

@mxinO mxinO Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not a slurm specific, considering the env setup, we should have a general setup guide to help users to set HF token, docker loign token, ngc login token etc. Maybe we can create another env-setup.md to handle this in one place? But this sounds beyond a modelopt skill. cc @kaix-nv

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — split out into .claude/skills/common/credentials.md in 645a545, covering HF_TOKEN, NGC API key (Docker + enroot paths), and Docker Hub. slurm-setup.md keeps only a one-paragraph pyxis pointer at it. Same change also handles CodeRabbit's $oauthtoken-is-literal note and Copilot's append-don't-overwrite concern.

…valuator

Both are vendored verbatim from commit 01899f8 with SHA-pin provenance in
frontmatter. `launching-evals` covers run/monitor/debug/analyze flows for NEL
evaluations; `accessing-mlflow` covers MLflow run querying via mlflow-mcp.

These complement (do not duplicate) our existing `evaluation` skill, which
remains focused on config generation with ModelOpt-specific additions.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Resolves conflict in .claude/skills/ptq/SKILL.md by keeping both additions:
main's new "Post-quantization validation" subsection (with pointer to
references/checkpoint-validation.md), followed by our existing "Next steps"
cross-skill pointer.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
The script re-downloads launching-evals/ and accessing-mlflow/ from
NVIDIA-NeMo/Evaluator at a pinned SHA (DEFAULT_SHA in the script) and
re-applies our provenance frontmatter. Idempotent — running repeatedly
at the same SHA produces no diff.

Also fixes the re-sync path in both SKILL.md frontmatters to the actual
script location (.claude/scripts/).

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Reviewers on PR #1239 (kaix-nv, mxinO) flagged the e2e workflow doc as
unnecessary: the skill descriptions already route Claude to chain PTQ,
deployment, and evaluation skills, and the content duplicated
workspace-management.md or lived better inside the evaluation skill's
nel-ci-guide.md references.

Removes the file and its three cross-references (evaluation/SKILL.md,
ptq/SKILL.md, workspace-management.md). The "carry PTQ patches forward
to deploy/eval" insight is preserved as a one-liner in evaluation/SKILL.md.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Reviewer @shengliangxu flagged that the NEL CI evaluation guide contains
NVIDIA-internal infrastructure (JET clusters, svc-jet service account,
gitlab-master NEL CI triggers, COMPEVAL_HF_TOKEN, internal lustre paths)
and should not ship in the public repo.

The file has been moved to Model-Optimizer-Internal:agent/nel-ci-guide.md
(see internal MR: zhiyu/add-nel-ci-guide-to-agent). This commit removes
the public copy and the "NEL CI and Cluster-Specific Notes" section from
evaluation/SKILL.md that referenced it.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Two fixes:

1. Exclude vendored upstream skills from markdownlint.

   `.claude/skills/launching-evals/` and `.claude/skills/accessing-mlflow/`
   are vendored verbatim from NVIDIA-NeMo/Evaluator and re-synced via
   .claude/scripts/sync-upstream-skills.sh. Markdownlint wanted to
   reformat them (trailing blank lines, spacing around fences), but
   fixing would violate the "verbatim" property documented in their
   frontmatter. Add an `ignores:` glob to `.markdownlint-cli2.yaml`.

2. Reframe the checkpoint/storage note on `skills/common/remote-execution.md`.

   Reviewer @mxinO noted (PR #1239) that the previous "compute nodes may
   not share the same filesystem as login nodes" framing is misleading —
   compute nodes on a given cluster do share storage with the login
   node. The real issue is that workstation filesystems aren't mounted
   on the cluster at all. Also drops the dlcluster-specific row, which
   @mxinO flagged as an internal quirk that shouldn't ship publicly.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Addresses three overlapping review comments on slurm-setup.md:62
from PR #1239:

- @mxinO: NGC/HF/Docker tokens aren't SLURM-specific — wanted a
  general credential setup guide referenced from multiple skills.
- CodeRabbit: `$oauthtoken` needs to be called out as a literal NGC
  login string, not a shell variable to substitute.
- Copilot: the previous snippet overwrote `~/.config/enroot/.credentials`
  unconditionally, clobbering entries for other registries.

New `skills/common/credentials.md` covers HF_TOKEN, NGC API key (Docker
+ enroot paths), and Docker Hub. The NGC/enroot block uses an
append-if-missing pattern (`grep -q ... || echo ... >>`) and spells
out that `$oauthtoken` is a literal, kept unexpanded via single quotes.

`slurm-setup.md` now keeps only the pyxis-specific signpost — one
paragraph pointing at `credentials.md` for the actual setup.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
PR #1252 (kaix-nv) landed with the joint trim text I proposed in the
coordination comment. The Monitoring block in evaluation/SKILL.md now
points at all three skills:

- `monitor` for durable cross-session job tracking
- `launching-evals` for one-shot status / debug / analyze
- `accessing-mlflow` for querying past runs

All merges auto-resolved (evaluation/SKILL.md, ptq/SKILL.md,
deployment/SKILL.md). Nothing from our branch was lost —
credentials.md, remote-execution.md reframe, markdownlint `ignores:`,
and the vendored skills all survived.

The Monitoring-trim item on our plan is now satisfied by #1252 rather
than a direct edit on this branch.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1 Edwardf0t1 changed the title [1/N] Polish evaluation skills and common skills based on an E2E workflow testing [1/N] Polish evaluation skills and common skills based on an E2E workflow testing, vendor two Claude skills from NeMo Evaluator Apr 19, 2026
Documents the new Claude Code evaluation-related skills and the shared
credentials.md common doc, mirroring the style of the existing PTQ skill
entry in the 0.44 release notes.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
- Benchmark-specific info learned during launching/analyzing evals should be added to `references/benchmarks/`
- **PPP** = Slurm account (the `account` field in cluster_config.yaml). When the user says "change PPP to X", update the account value (e.g., `coreai_dlalgo_compeval` → `coreai_dlalgo_llm`).
- **Slurm job pairs**: NEL (nemo-evaluator-launcher) submits paired Slurm jobs — a RUNNING job + a PENDING restart job (for when the 4h walltime expires). Never cancel the pending restart jobs — they are expected and necessary.
- **HF cache requirement**: For configs with `HF_HUB_OFFLINE=1`, models must be pre-downloaded to the HF cache on each cluster before launching. **Before running a model on a new cluster, always ask the user if the model is already cached there.** If not, on the cluster login node: `python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub` then `HF_HOME=/lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface hf download <model>`. Without this, vLLM will fail with `LocalEntryNotFoundError`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change full luster path to some mount path or dummy path


# List available evaluation tasks (by default, only shows tasks from the latest released containers)
uv run nemo-evaluator-launcher ls tasks
uv run nemo-evaluator-launcher ls tasks --from_container gitlab-master.nvidia.com/dl/joc/competitive_evaluation/nvidia-core-evals/ci-llm/long-context-eval:dev-2025-12-16T14-37-1693de28-amd64
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove internal gitlab containers and PPP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants