From 4bf825329b176d08a2b3f6e97ec0e36624aa068e Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Sun, 12 Apr 2026 02:35:06 -0700
Subject: [PATCH 01/16] Polish eval skills

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/common/remote-execution.md | 11 +++++++++++
 .claude/skills/common/slurm-setup.md      | 14 ++++++++++++++
 .claude/skills/evaluation/SKILL.md        | 13 +++++++++++++
 3 files changed, 38 insertions(+)

diff --git a/.claude/skills/common/remote-execution.md b/.claude/skills/common/remote-execution.md
index 7c99a5c2a9..2e538fa466 100644
--- a/.claude/skills/common/remote-execution.md
+++ b/.claude/skills/common/remote-execution.md
@@ -28,6 +28,17 @@ clusters:
 default_cluster: my-cluster
 ```
 
+### Checkpoint and storage availability
+
+Cluster compute nodes may not share the same filesystem as login nodes or other clusters. Before running any workload that references a checkpoint path, verify the path is accessible from compute nodes:
+
+| Cluster type | Compute-node storage | NOT accessible from compute nodes |
+|-------------|---------------------|----------------------------------|
+| JET clusters (oci-hsg, cw, oci-nrt) | `/lustre/fsw/...` | Workstation NFS (`/home/scratch.*`), other cluster mounts |
+| dlcluster | `/home/omniml_data_*`, `/home/scratch.*` | `/lustre/` paths |
+
+If a checkpoint was produced on a different cluster or workstation, copy it to the target cluster's accessible storage before submitting jobs. NEL and SLURM do NOT sync checkpoints automatically.
+
 See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.
 
 ---
diff --git a/.claude/skills/common/slurm-setup.md b/.claude/skills/common/slurm-setup.md
index 37b9fbd56a..f26731d883 100644
--- a/.claude/skills/common/slurm-setup.md
+++ b/.claude/skills/common/slurm-setup.md
@@ -51,6 +51,20 @@ srun \
     "
 ```
 
+### Container registry credentials (pyxis)
+
+If `srun --container-image` uses an image from a private registry (e.g., `nvcr.io/nvidia/...`), pyxis/enroot needs credentials on the cluster. Check for existing credentials and add if missing:
+
+```bash
+cat ~/.config/enroot/.credentials 2>/dev/null || echo "No credentials"
+# To add NGC credentials:
+mkdir -p ~/.config/enroot
+echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials
+chmod 600 ~/.config/enroot/.credentials
+```
+
+Without this, `srun` will fail with `401 Unauthorized` when pulling from `nvcr.io`.
+
 Submit and capture the job ID:
 
 ```bash
diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
index f8eab5561b..714d9fa522 100644
--- a/.claude/skills/evaluation/SKILL.md
+++ b/.claude/skills/evaluation/SKILL.md
@@ -286,6 +286,19 @@ After job submission, you can monitor progress using:
 
 ---
 
+### NEL CI and Cluster-Specific Notes
+
+For running evaluations on NVIDIA JET clusters (oci-hsg, cw, oci-nrt) or SLURM clusters like dlcluster, read `references/nel-ci-guide.md`. It covers:
+- NEL CI GitLab trigger pattern vs NEL SLURM executor
+- Cluster-specific GPU counts and storage paths
+- Checkpoint availability (compute nodes may not share login node filesystems)
+- Environment variable prefixes (`host:`, `lit:`) for SLURM executor
+- SGLang must bind `--host 0.0.0.0` for health checks
+- Directory setup and `chmod 777` for JET service account access
+- Common issues (NGC auth, gated datasets, walltime, `NEL_OTHER_OVERRIDES` space-splitting)
+
+---
+
 Direct users with issues to:
 
 - **GitHub Issues:** <https://github.com/NVIDIA-NeMo/Evaluator/issues>

From 2e84f3ba52bb63361d7f8dacc8152887a3160b6a Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Sun, 12 Apr 2026 02:37:18 -0700
Subject: [PATCH 02/16] Polish eval skills

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .../evaluation/references/nel-ci-guide.md     | 189 ++++++++++++++++++
 1 file changed, 189 insertions(+)
 create mode 100644 .claude/skills/evaluation/references/nel-ci-guide.md

diff --git a/.claude/skills/evaluation/references/nel-ci-guide.md b/.claude/skills/evaluation/references/nel-ci-guide.md
new file mode 100644
index 0000000000..771558f6cd
--- /dev/null
+++ b/.claude/skills/evaluation/references/nel-ci-guide.md
@@ -0,0 +1,189 @@
+# NEL CI Evaluation Guide
+
+NEL CI is the recommended entry point for running evaluations on NVIDIA JET infrastructure. This guide covers patterns for evaluating quantized checkpoints using both the NEL SLURM executor (direct) and the NEL CI GitLab pipeline.
+
+Reference repo: `gitlab-master.nvidia.com/dl/JoC/competitive_evaluation/nemo-evaluator-launcher-ci`
+
+---
+
+## 1. Two Execution Paths
+
+| Path | When to use | How it works |
+|------|-------------|--------------|
+| **NEL SLURM executor** | You have SSH access to the cluster, checkpoint is on cluster storage | `nel run --config config.yaml` from your workstation; NEL SSHes to cluster and submits sbatch jobs |
+| **NEL CI GitLab pipeline** | You want managed infrastructure, MLflow export, reproducible configs | Trigger via GitLab API or UI; JET orchestrates everything |
+
+### NEL SLURM executor
+
+Best for iterative development and debugging. Run from any machine with SSH access to the cluster:
+
+```bash
+export DUMMY_API_KEY=dummy
+export HF_TOKEN=<your_token>
+
+nel run --config eval_config.yaml \
+    -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10  # test first
+```
+
+### NEL CI trigger
+
+Best for production evaluations with MLflow tracking. See the trigger script pattern in section 4.
+
+---
+
+## 2. Cluster Reference
+
+| Cluster | GPUs/Node | Architecture | Max Walltime | Storage | Notes |
+|---------|-----------|-------------|--------------|---------|-------|
+| oci-hsg | 4 | GB200 | 4 hours | `/lustre/` | Set `tensor_parallel_size=4` |
+| cw | 8 | H100 | — | `/lustre/` | — |
+| oci-nrt | 8 | H100 | — | `/lustre/` | Numerics configs |
+| dlcluster | 4 (B100 partition) | B100 | 8 hours | `/home/omniml_data_*` | No `/lustre/`; use local NFS paths |
+
+**Important**: `deployment.tensor_parallel_size` determines how many GPUs are requested. If this exceeds the cluster's GPUs per node, the job fails with a memory allocation error.
+
+---
+
+## 3. Checkpoint Availability
+
+The checkpoint must be on a filesystem accessible from the cluster's **compute nodes** (not just login nodes).
+
+| Cluster type | Accessible storage | NOT accessible |
+|-------------|-------------------|----------------|
+| JET clusters (oci-hsg, cw, oci-nrt) | `/lustre/fsw/...` | Workstation paths (`/home/scratch.*`), NFS mounts from other clusters |
+| dlcluster | `/home/omniml_data_*`, `/home/scratch.*` | `/lustre/` (not available) |
+
+If the checkpoint is on a workstation, **copy it to cluster storage first**:
+
+```bash
+rsync -av /path/to/local/checkpoint \
+    <cluster-login>:/lustre/fsw/portfolios/coreai/users/$USER/checkpoints/
+```
+
+For dlcluster, the checkpoint paths are directly accessible since the NFS mounts are shared between login and compute nodes.
+
+---
+
+## 4. NEL CI Trigger Pattern
+
+For JET clusters, trigger evaluations via the GitLab API. Use `NEL_DEPLOYMENT_COMMAND` (not `NEL_OTHER_OVERRIDES` with `deployment.extra_args`) because `NEL_OTHER_OVERRIDES` splits values on spaces, breaking multi-flag commands.
+
+```bash
+export GITLAB_TOKEN=<your_gitlab_token>
+
+curl -k --request POST \
+  --header "PRIVATE-TOKEN: ${GITLAB_TOKEN}" \
+  --header "Content-Type: application/json" \
+  --data '{
+    "ref": "main",
+    "variables": [
+      {"key": "NEL_CONFIG_PATH", "value": "configs/AA/minimax_m2_5_lbd_lax.yaml"},
+      {"key": "NEL_ACCOUNT", "value": "coreai_dlalgo_modelopt"},
+      {"key": "NEL_CLUSTER", "value": "oci-hsg"},
+      {"key": "NEL_CHECKPOINT_OR_ARTIFACT", "value": "/lustre/.../checkpoint"},
+      {"key": "NEL_DEPLOYMENT_IMAGE", "value": "vllm/vllm-openai:v0.19.0"},
+      {"key": "NEL_TASKS", "value": "simple_evals.gpqa_diamond_aa_v3"},
+      {"key": "NEL_DEPLOYMENT_COMMAND", "value": "vllm serve /checkpoint --host 0.0.0.0 --port 8000 --tensor-parallel-size 4 --quantization modelopt_fp4 --trust-remote-code --served-model-name my-model"},
+      {"key": "NEL_OTHER_OVERRIDES", "value": "deployment.tensor_parallel_size=4 execution.walltime=04:00:00"},
+      {"key": "NEL_HF_HOME", "value": "/lustre/.../cache/huggingface"},
+      {"key": "NEL_VLLM_CACHE", "value": "/lustre/.../cache/vllm"},
+      {"key": "NEL_CLUSTER_OUTPUT_DIR", "value": "/lustre/.../nv-eval-rundirs"}
+    ]
+  }' \
+  "https://gitlab-master.nvidia.com/api/v4/projects/221804/pipeline"
+```
+
+---
+
+## 5. Environment Variables
+
+### SLURM executor format
+
+Env vars in NEL SLURM configs require explicit prefixes:
+
+| Prefix | Meaning | Example |
+|--------|---------|---------|
+| `host:VAR_NAME` | Read from the host environment where `nel run` is executed | `host:HF_TOKEN` |
+| `lit:value` | Literal string value | `lit:dummy` |
+
+```yaml
+evaluation:
+  env_vars:
+    DUMMY_API_KEY: host:DUMMY_API_KEY
+    HF_TOKEN: host:HF_TOKEN
+```
+
+### JET executor format
+
+JET configs reference JET secrets with `$SECRET_NAME`:
+
+```yaml
+execution:
+  env_vars:
+    evaluation:
+      HF_TOKEN: $COMPEVAL_HF_TOKEN
+```
+
+### Gated datasets
+
+Tasks that download gated HuggingFace datasets (e.g., GPQA, HLE) need `HF_TOKEN` passed to the evaluation container. Set it at the evaluation level or per-task:
+
+```yaml
+evaluation:
+  env_vars:
+    HF_TOKEN: host:HF_TOKEN  # SLURM executor
+  tasks:
+    - name: simple_evals.gpqa_diamond
+      env_vars:
+        HF_TOKEN: host:HF_TOKEN
+```
+
+---
+
+## 6. Serving Framework Notes
+
+### vLLM
+
+- Binds to `0.0.0.0` by default — health checks work out of the box
+- For NVFP4: `--quantization modelopt_fp4`
+- For unsupported models (e.g., ministral3): may need custom `deployment.command` to patch the framework before serving (see `deployment/references/unsupported-models.md`)
+
+### SGLang
+
+- **Must include `--host 0.0.0.0`** — SGLang defaults to `127.0.0.1` which blocks health checks from the eval client
+- Must include `--port 8000` to match NEL's expected port
+- For NVFP4: `--quantization modelopt_fp4`
+
+---
+
+## 7. Common Issues
+
+| Issue | Cause | Fix |
+|-------|-------|-----|
+| `401 Unauthorized` pulling eval container | NGC credentials not set on cluster | Set up `~/.config/enroot/.credentials` with NGC API key |
+| `PermissionError: /hf-cache/...` | HF cache dir not writable by svc-jet | Set `NEL_HF_HOME` to your own `chmod 777` directory |
+| Health check stuck at `000` | Server binding to localhost | Add `--host 0.0.0.0` to deployment command (SGLang) |
+| `Memory required by task is not available` | TP size exceeds GPUs/node | Set `tensor_parallel_size` to match cluster (4 for oci-hsg, dlcluster B100) |
+| TIMEOUT after eval completes | Walltime too short for eval + MLflow export | Set `execution.walltime=04:00:00` |
+| Gated dataset auth failure | `HF_TOKEN` not passed to eval container | Add `env_vars.HF_TOKEN` at evaluation or task level |
+| `NEL_OTHER_OVERRIDES` splits `extra_args` | Space-separated parsing breaks multi-flag values | Use `NEL_DEPLOYMENT_COMMAND` instead |
+| Checkpoint not found in container | Path not on cluster compute-node filesystem | Copy checkpoint to `/lustre/` (or cluster-accessible path) first |
+| `trusted_eval` type mismatch in MLflow export | NEL writes boolean `true` instead of string `"true"` | Fix with `sed -i "s/trusted_eval: true/trusted_eval: 'true'/"` in export config |
+
+---
+
+## 8. Directory Setup for JET Clusters
+
+Before running evaluations on a JET cluster, create writable directories:
+
+```bash
+ssh <cluster-login>
+mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface
+mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm
+mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs
+chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface
+chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm
+chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs
+```
+
+`chmod 777` is required because `svc-jet` (JET service account) runs containers and needs write access.

From e952bcdcefe6d45c928d3911dfa7a3e2a9517819 Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Sun, 12 Apr 2026 02:46:14 -0700
Subject: [PATCH 03/16] update

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/evaluation/references/nel-ci-guide.md | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/.claude/skills/evaluation/references/nel-ci-guide.md b/.claude/skills/evaluation/references/nel-ci-guide.md
index 771558f6cd..5c71ed7144 100644
--- a/.claude/skills/evaluation/references/nel-ci-guide.md
+++ b/.claude/skills/evaluation/references/nel-ci-guide.md
@@ -126,12 +126,16 @@ execution:
 
 ### Gated datasets
 
-Tasks that download gated HuggingFace datasets (e.g., GPQA, HLE) need `HF_TOKEN` passed to the evaluation container. Set it at the evaluation level or per-task:
+Tasks that download gated HuggingFace datasets (e.g., GPQA, HLE) need `HF_TOKEN` passed to the evaluation container.
+
+**NEL CI (JET)**: Handled automatically — the `COMPEVAL_HF_TOKEN` JET secret is pre-configured by the eval platform team. No user action needed; you don't even need personal access to the gated dataset.
+
+**NEL SLURM executor**: You must provide your own HF token, AND your HuggingFace account must have been granted access to the gated dataset (e.g., request access at https://huggingface.co/datasets/Idavidrein/gpqa for GPQA).
 
 ```yaml
 evaluation:
   env_vars:
-    HF_TOKEN: host:HF_TOKEN  # SLURM executor
+    HF_TOKEN: host:HF_TOKEN  # SLURM executor — reads from your local env
   tasks:
     - name: simple_evals.gpqa_diamond
       env_vars:

From 2cb3b39cc5ddc2c0a3fea11af33e7b526603c790 Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Sun, 12 Apr 2026 13:55:38 -0700
Subject: [PATCH 04/16] Add end-to-end workflow doc and cross-skill references
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Add common/end-to-end-workflow.md documenting the PTQ → Deploy → Eval
  pipeline, workspace continuity, unsupported model handling, NEL
  deployment.command pattern, and NEL CI vs SLURM executor decision table
- Add cross-skill workspace flow to workspace-management.md
- Add "Next steps" to ptq/SKILL.md pointing to deployment/evaluation
- Add pipeline integration note to evaluation/SKILL.md

Depends on PR #1236 (deployment/references/unsupported-models.md).

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/common/end-to-end-workflow.md  | 70 +++++++++++++++++++
 .claude/skills/common/workspace-management.md | 19 +++++
 .claude/skills/evaluation/SKILL.md            |  4 +-
 .claude/skills/ptq/SKILL.md                   |  2 +
 4 files changed, 94 insertions(+), 1 deletion(-)
 create mode 100644 .claude/skills/common/end-to-end-workflow.md

diff --git a/.claude/skills/common/end-to-end-workflow.md b/.claude/skills/common/end-to-end-workflow.md
new file mode 100644
index 0000000000..1dae03c2e5
--- /dev/null
+++ b/.claude/skills/common/end-to-end-workflow.md
@@ -0,0 +1,70 @@
+# End-to-End Workflow: PTQ → Deploy → Eval
+
+This document ties together the three domain skills (PTQ, Deployment, Evaluation) for the common workflow of quantizing a model, deploying it, and evaluating accuracy.
+
+## Pipeline Overview
+
+```text
+PTQ (quantize)          → Deployment (serve)         → Evaluation (benchmark)
+─────────────────         ──────────────────           ────────────────────────
+hf_ptq.py                vLLM / SGLang / TRT-LLM      NEL (SLURM or JET)
+  ↓                         ↓                            ↓
+NVFP4/FP8 checkpoint      OpenAI-compatible API        MMLU, GSM8K, GPQA scores
+  (safetensors)            (http://host:8000)           (results.yml)
+```
+
+## Workspace Continuity
+
+All three stages share the same workspace directory. The PTQ output becomes the deployment input, and eval results land alongside:
+
+```text
+workspaces/model-name-format/
+  output/              ← PTQ checkpoint (safetensors + config.json)
+  eval_results/        ← NEL evaluation artifacts (results.yml per task)
+  eval_config.yaml     ← NEL config for evaluation
+  scripts/             ← Custom run scripts (if needed)
+  logs/                ← SLURM job logs
+```
+
+When starting a deployment or evaluation step, always check for an existing workspace from a prior PTQ run:
+
+```bash
+ls workspaces/
+```
+
+## Unsupported Models
+
+Models not in the verified support matrices require extra work at each stage:
+
+| Stage | What can go wrong | Reference |
+|-------|-------------------|-----------|
+| **PTQ** | Unknown architecture, FP8 source checkpoint, VLM structure | `ptq/references/unsupported-models.md` |
+| **Deployment** | Missing architecture mapping, weight key mismatches, quant/unquant layer confusion | `deployment/references/unsupported-models.md` |
+| **Evaluation** | Framework patches needed in deployment container, gated datasets, cluster storage | `evaluation/references/nel-ci-guide.md` |
+
+Each stage has its own debug loop (run → read error → diagnose → patch → re-run). Fixes from one stage often inform the next — e.g., if PTQ required a transformers upgrade, deployment and evaluation will too.
+
+## NEL Evaluation with Custom Deployments
+
+When the serving framework needs runtime patches (e.g., transformers upgrade, model handler fix), override `deployment.command` in the NEL config to inject fixes before serving:
+
+```yaml
+deployment:
+  command: >-
+    pip install "transformers>=5.0.0.dev0" --pre -q &&
+    sed -i 's/old_pattern/new_pattern/' /path/to/framework/file.py &&
+    ${deployment.base_command}
+```
+
+This works with both NEL SLURM executor and NEL CI (via `NEL_DEPLOYMENT_COMMAND`).
+
+## Decision: NEL SLURM Executor vs NEL CI (JET)
+
+| Factor | NEL SLURM executor | NEL CI (JET) |
+|--------|-------------------|--------------|
+| **When to use** | Iterative debugging, checkpoint on non-JET cluster, custom patches needed | Production evals, MLflow tracking, reproducible configs |
+| **Checkpoint location** | Any cluster you have SSH access to | Must be on JET cluster `/lustre/` storage |
+| **Secrets (HF_TOKEN, NGC)** | Provide your own via `host:` env vars | Managed centrally via JET secrets |
+| **Container patches** | Override `deployment.command` | Use `NEL_DEPLOYMENT_COMMAND` |
+| **MLflow export** | Manual setup | Automatic |
+| **Gated datasets** | Your HF account needs access | Handled by `COMPEVAL_HF_TOKEN` |
diff --git a/.claude/skills/common/workspace-management.md b/.claude/skills/common/workspace-management.md
index bd32916632..5d85e91186 100644
--- a/.claude/skills/common/workspace-management.md
+++ b/.claude/skills/common/workspace-management.md
@@ -92,6 +92,21 @@ rsync -a --quiet \
     "$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT/<name>/"
 ```
 
+## Cross-Skill Workspace Flow
+
+Workspaces carry over across the PTQ → Deploy → Eval pipeline. Each stage adds to the same directory:
+
+```text
+workspaces/model-name-format/
+  output/              ← PTQ: quantized checkpoint
+  eval_results/        ← Evaluation: NEL artifacts (results.yml per task)
+  eval_config.yaml     ← Evaluation: NEL config
+  scripts/             ← Deployment/PTQ: custom run scripts
+  logs/                ← All: SLURM job logs
+```
+
+See `skills/common/end-to-end-workflow.md` for the full pipeline.
+
 ## Example Flow
 
 ```text
@@ -104,6 +119,10 @@ User: "deploy the model I just quantized"
 Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
        → reuse, find checkpoint at workspaces/qwen3-0.6b-nvfp4/output/
 
+User: "evaluate the quantized model on MMLU and GSM8K"
+Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
+       → reuse, write eval_config.yaml, results to workspaces/qwen3-0.6b-nvfp4/eval_results/
+
 User: "now quantize Llama-3.1-8B with fp8"
 Agent: ls workspaces/ → no llama
        → mkdir workspaces/llama-3.1-8b-fp8
diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
index 714d9fa522..5174e7befa 100644
--- a/.claude/skills/evaluation/SKILL.md
+++ b/.claude/skills/evaluation/SKILL.md
@@ -12,10 +12,12 @@ license: Apache-2.0
 
 You're an expert in NeMo Evaluator Launcher! Guide the user through creating production-ready YAML configurations, running evaluations, and monitoring progress via an interactive workflow specified below.
 
-### Workspace (multi-user / Slack bot)
+### Workspace and Pipeline Integration
 
 If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Check for existing workspaces — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications.
 
+This skill is often the final stage of the PTQ → Deploy → Eval pipeline. If the model required runtime patches during deployment (transformers upgrade, framework source fixes), carry those patches into the NEL config via `deployment.command`. See `skills/common/end-to-end-workflow.md` for the full pipeline.
+
 ### Workflow
 
 ```text
diff --git a/.claude/skills/ptq/SKILL.md b/.claude/skills/ptq/SKILL.md
index 932f62ec2c..79074dbd6e 100644
--- a/.claude/skills/ptq/SKILL.md
+++ b/.claude/skills/ptq/SKILL.md
@@ -113,6 +113,8 @@ ls -lh <output_path>/
 
 Report the path and size to the user.
 
+**Next steps**: If the user wants to deploy or evaluate the quantized checkpoint, use the **deployment** or **evaluation** skill. The checkpoint workspace carries over — see `skills/common/end-to-end-workflow.md` for the full PTQ → Deploy → Eval pipeline. If the model required patches during PTQ (e.g., transformers upgrade), the same fixes will likely be needed at deployment and evaluation time.
+
 ## Key API Rules
 
 - `mtq.register()` classes **must** define `_setup()` and call it from `__init__`

From 1b94fc98b13054193d388d69b4ca6079ba7f3e64 Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Sun, 12 Apr 2026 17:55:40 -0700
Subject: [PATCH 05/16] fix format

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/evaluation/references/nel-ci-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.claude/skills/evaluation/references/nel-ci-guide.md b/.claude/skills/evaluation/references/nel-ci-guide.md
index 5c71ed7144..208088ad2d 100644
--- a/.claude/skills/evaluation/references/nel-ci-guide.md
+++ b/.claude/skills/evaluation/references/nel-ci-guide.md
@@ -130,7 +130,7 @@ Tasks that download gated HuggingFace datasets (e.g., GPQA, HLE) need `HF_TOKEN`
 
 **NEL CI (JET)**: Handled automatically — the `COMPEVAL_HF_TOKEN` JET secret is pre-configured by the eval platform team. No user action needed; you don't even need personal access to the gated dataset.
 
-**NEL SLURM executor**: You must provide your own HF token, AND your HuggingFace account must have been granted access to the gated dataset (e.g., request access at https://huggingface.co/datasets/Idavidrein/gpqa for GPQA).
+**NEL SLURM executor**: You must provide your own HF token, AND your HuggingFace account must have been granted access to the gated dataset (e.g., request access at <https://huggingface.co/datasets/Idavidrein/gpqa> for GPQA).
 
 ```yaml
 evaluation:

From b1be817ac2130b0a8b9eaade6063c027adee208f Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Sun, 12 Apr 2026 21:41:25 -0700
Subject: [PATCH 06/16] Add NEL CI learnings: wrapper script pattern,
 cross-cluster copy, Hydra escaping

- Add wrapper script pattern for complex deployment commands that break
  Hydra's override parser (put serve.sh in checkpoint dir, reference as
  bash /checkpoint/serve.sh)
- Add NEL_CONFIG_BASE64 + Python trigger pattern for custom configs
- Add cross-cluster checkpoint copy via tar pipe
- Add Hydra LexerNoViableAltException and Bad Request to common issues

Learned from triggering full AA evaluation (MMLU-PRO, GPQA Diamond,
LiveCodeBench, SCICODE, AIME 2025, Terminal-Bench Hard) for
Devstral-Small-2-24B NVFP4 on oci-hsg via NEL CI.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .../evaluation/references/nel-ci-guide.md     | 86 ++++++++++++++++++-
 1 file changed, 84 insertions(+), 2 deletions(-)

diff --git a/.claude/skills/evaluation/references/nel-ci-guide.md b/.claude/skills/evaluation/references/nel-ci-guide.md
index 208088ad2d..42cde09e23 100644
--- a/.claude/skills/evaluation/references/nel-ci-guide.md
+++ b/.claude/skills/evaluation/references/nel-ci-guide.md
@@ -60,13 +60,26 @@ rsync -av /path/to/local/checkpoint \
     <cluster-login>:/lustre/fsw/portfolios/coreai/users/$USER/checkpoints/
 ```
 
+**Cross-cluster copy** (e.g., dlcluster → oci-hsg): If the two clusters can't SSH to each other directly, pipe through your workstation without staging to disk:
+
+```bash
+ssh user@source-cluster "tar czf - -C /path/to/checkpoint ." | \
+    ssh user@target-cluster "tar xzf - -C /lustre/.../checkpoints/model-name"
+```
+
+After copying, set permissions for svc-jet: `chmod -R 777 /lustre/.../checkpoints/model-name`
+
 For dlcluster, the checkpoint paths are directly accessible since the NFS mounts are shared between login and compute nodes.
 
 ---
 
 ## 4. NEL CI Trigger Pattern
 
-For JET clusters, trigger evaluations via the GitLab API. Use `NEL_DEPLOYMENT_COMMAND` (not `NEL_OTHER_OVERRIDES` with `deployment.extra_args`) because `NEL_OTHER_OVERRIDES` splits values on spaces, breaking multi-flag commands.
+For JET clusters, trigger evaluations via the GitLab API.
+
+### Simple deployment (standard models)
+
+For models that work with stock vLLM/SGLang, use `NEL_DEPLOYMENT_COMMAND` directly:
 
 ```bash
 export GITLAB_TOKEN=<your_gitlab_token>
@@ -82,7 +95,6 @@ curl -k --request POST \
       {"key": "NEL_CLUSTER", "value": "oci-hsg"},
       {"key": "NEL_CHECKPOINT_OR_ARTIFACT", "value": "/lustre/.../checkpoint"},
       {"key": "NEL_DEPLOYMENT_IMAGE", "value": "vllm/vllm-openai:v0.19.0"},
-      {"key": "NEL_TASKS", "value": "simple_evals.gpqa_diamond_aa_v3"},
       {"key": "NEL_DEPLOYMENT_COMMAND", "value": "vllm serve /checkpoint --host 0.0.0.0 --port 8000 --tensor-parallel-size 4 --quantization modelopt_fp4 --trust-remote-code --served-model-name my-model"},
       {"key": "NEL_OTHER_OVERRIDES", "value": "deployment.tensor_parallel_size=4 execution.walltime=04:00:00"},
       {"key": "NEL_HF_HOME", "value": "/lustre/.../cache/huggingface"},
@@ -93,6 +105,74 @@ curl -k --request POST \
   "https://gitlab-master.nvidia.com/api/v4/projects/221804/pipeline"
 ```
 
+### Complex deployment (unsupported models needing runtime patches)
+
+If the model needs runtime patches (e.g., transformers upgrade, framework source fixes), **do NOT put multi-step commands in `NEL_DEPLOYMENT_COMMAND`** — Hydra's override parser will break on nested quotes, `&&`, `$()`, etc.
+
+Instead, use the **wrapper script pattern**: place a `serve.sh` in the checkpoint directory on the cluster, then point `NEL_DEPLOYMENT_COMMAND` to it.
+
+**Step 1** — Write wrapper script to the checkpoint directory on the cluster:
+
+```bash
+ssh <cluster-login> 'cat > /lustre/.../checkpoint/serve.sh << '"'"'EOF'"'"'
+#!/bin/bash
+set -e
+pip install "transformers>=5.0.0.dev0" "huggingface_hub>=0.32.0" --pre -q
+# Patch vLLM for ministral3 support (example)
+MISTRAL3_PY=$(find /usr/local/lib -path "*/vllm/model_executor/models/mistral3.py" 2>/dev/null | head -1)
+sed -i "s/old_pattern/new_pattern/" "$MISTRAL3_PY"
+exec vllm serve /checkpoint --host 0.0.0.0 --port 8000 \
+    --tensor-parallel-size 4 --quantization modelopt_fp4 \
+    --trust-remote-code --served-model-name my-model --gpu-memory-utilization 0.9
+EOF
+chmod 777 /lustre/.../checkpoint/serve.sh'
+```
+
+**Step 2** — Set `NEL_DEPLOYMENT_COMMAND` to the wrapper:
+
+```
+{"key": "NEL_DEPLOYMENT_COMMAND", "value": "bash /checkpoint/serve.sh"}
+```
+
+This works because the checkpoint is mounted at `/checkpoint` inside the container. The script is Hydra-safe (no special characters in the override value).
+
+### Custom configs with `NEL_CONFIG_BASE64`
+
+When using a custom config (not from the repo), use `NEL_CONFIG_BASE64` instead of `NEL_CONFIG_PATH`. This requires setting `NEL_UNTRUSTED_EVAL=true`:
+
+```python
+import json, base64, subprocess, os
+
+with open("my_config.yaml") as f:
+    config_b64 = base64.b64encode(f.read().encode()).decode()
+
+payload = {
+    "ref": "main",
+    "variables": [
+        {"key": "NEL_CONFIG_BASE64", "value": config_b64},
+        {"key": "NEL_ACCOUNT", "value": "coreai_dlalgo_modelopt"},
+        {"key": "NEL_CLUSTER", "value": "oci-hsg"},
+        {"key": "NEL_CHECKPOINT_OR_ARTIFACT", "value": "/lustre/.../checkpoint"},
+        {"key": "NEL_DEPLOYMENT_IMAGE", "value": "vllm/vllm-openai:v0.19.0"},
+        {"key": "NEL_DEPLOYMENT_COMMAND", "value": "bash /checkpoint/serve.sh"},
+        {"key": "NEL_UNTRUSTED_EVAL", "value": "true"},
+        # ... other variables
+    ]
+}
+
+# Use Python to construct JSON (avoids shell escaping issues with curl)
+token = os.environ["GITLAB_TOKEN"]
+subprocess.run(
+    ["curl", "-k", "--request", "POST",
+     "--header", f"PRIVATE-TOKEN: {token}",
+     "--header", "Content-Type: application/json",
+     "--data", json.dumps(payload),
+     "https://gitlab-master.nvidia.com/api/v4/projects/221804/pipeline"],
+)
+```
+
+> **Tip**: Use Python (not bash) to construct the JSON payload for `curl`. Shell escaping of base64 strings and nested quotes is error-prone.
+
 ---
 
 ## 5. Environment Variables
@@ -173,6 +253,8 @@ evaluation:
 | `NEL_OTHER_OVERRIDES` splits `extra_args` | Space-separated parsing breaks multi-flag values | Use `NEL_DEPLOYMENT_COMMAND` instead |
 | Checkpoint not found in container | Path not on cluster compute-node filesystem | Copy checkpoint to `/lustre/` (or cluster-accessible path) first |
 | `trusted_eval` type mismatch in MLflow export | NEL writes boolean `true` instead of string `"true"` | Fix with `sed -i "s/trusted_eval: true/trusted_eval: 'true'/"` in export config |
+| `LexerNoViableAltException` in Hydra | `NEL_DEPLOYMENT_COMMAND` contains quotes, `&&`, `$()` | Use wrapper script pattern (section 4): put script in checkpoint dir, set command to `bash /checkpoint/serve.sh` |
+| `Bad Request` from GitLab API trigger | Shell escaping mangled the JSON payload | Use Python to construct JSON (section 4) instead of bash heredocs/string interpolation |
 
 ---
 

From 7dcede44f3677c8689bba9750ba43a5933ba5d68 Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Sun, 12 Apr 2026 21:44:14 -0700
Subject: [PATCH 07/16] fix format

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/evaluation/references/nel-ci-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.claude/skills/evaluation/references/nel-ci-guide.md b/.claude/skills/evaluation/references/nel-ci-guide.md
index 42cde09e23..1caec64627 100644
--- a/.claude/skills/evaluation/references/nel-ci-guide.md
+++ b/.claude/skills/evaluation/references/nel-ci-guide.md
@@ -130,7 +130,7 @@ chmod 777 /lustre/.../checkpoint/serve.sh'
 
 **Step 2** — Set `NEL_DEPLOYMENT_COMMAND` to the wrapper:
 
-```
+```json
 {"key": "NEL_DEPLOYMENT_COMMAND", "value": "bash /checkpoint/serve.sh"}
 ```
 

From 8176fc7089685f503eb2f32a5c686bc618de5362 Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Sun, 12 Apr 2026 22:32:52 -0700
Subject: [PATCH 08/16] Add served_model_name mismatch to NEL CI common issues
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When using NEL_DEPLOYMENT_COMMAND with a custom --served-model-name,
deployment.served_model_name must also be overridden via
NEL_OTHER_OVERRIDES — NEL uses the config field (not the actual serve
command) to set the eval client's model_id. Without this, the client
sends the checkpoint path as model_id, causing 404 errors.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/evaluation/references/nel-ci-guide.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.claude/skills/evaluation/references/nel-ci-guide.md b/.claude/skills/evaluation/references/nel-ci-guide.md
index 1caec64627..846d0236c8 100644
--- a/.claude/skills/evaluation/references/nel-ci-guide.md
+++ b/.claude/skills/evaluation/references/nel-ci-guide.md
@@ -255,6 +255,7 @@ evaluation:
 | `trusted_eval` type mismatch in MLflow export | NEL writes boolean `true` instead of string `"true"` | Fix with `sed -i "s/trusted_eval: true/trusted_eval: 'true'/"` in export config |
 | `LexerNoViableAltException` in Hydra | `NEL_DEPLOYMENT_COMMAND` contains quotes, `&&`, `$()` | Use wrapper script pattern (section 4): put script in checkpoint dir, set command to `bash /checkpoint/serve.sh` |
 | `Bad Request` from GitLab API trigger | Shell escaping mangled the JSON payload | Use Python to construct JSON (section 4) instead of bash heredocs/string interpolation |
+| `The model <path> does not exist` (404) | Eval client uses checkpoint path as model_id instead of served_model_name | Add `deployment.served_model_name=<name>` to `NEL_OTHER_OVERRIDES` to match `--served-model-name` in your serve command |
 
 ---
 

From b0748dd9d82092210dc08ffaf9a3d328dd74ccbc Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Sat, 18 Apr 2026 18:23:17 -0700
Subject: [PATCH 09/16] Vendor launching-evals and accessing-mlflow skills from
 NVIDIA-NeMo/Evaluator

Both are vendored verbatim from commit 01899f8 with SHA-pin provenance in
frontmatter. `launching-evals` covers run/monitor/debug/analyze flows for NEL
evaluations; `accessing-mlflow` covers MLflow run querying via mlflow-mcp.

These complement (do not duplicate) our existing `evaluation` skill, which
remains focused on config generation with ModelOpt-specific additions.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/accessing-mlflow/SKILL.md      | 104 ++++++++++
 .claude/skills/launching-evals/SKILL.md       |  69 +++++++
 .../references/analyze-results.md             |  57 ++++++
 .../benchmarks/swebench-general-info.md       | 188 ++++++++++++++++++
 .../benchmarks/terminal-bench-general-info.md | 122 ++++++++++++
 .../terminal-bench-trace-analysis.md          | 145 ++++++++++++++
 .../references/check-progress.md              |  24 +++
 .../references/debug-failed-runs.md           | 130 ++++++++++++
 .../references/run-evaluation.md              |  26 +++
 .claude/skills/launching-evals/tests.json     |  46 +++++
 10 files changed, 911 insertions(+)
 create mode 100644 .claude/skills/accessing-mlflow/SKILL.md
 create mode 100644 .claude/skills/launching-evals/SKILL.md
 create mode 100644 .claude/skills/launching-evals/references/analyze-results.md
 create mode 100644 .claude/skills/launching-evals/references/benchmarks/swebench-general-info.md
 create mode 100644 .claude/skills/launching-evals/references/benchmarks/terminal-bench-general-info.md
 create mode 100644 .claude/skills/launching-evals/references/benchmarks/terminal-bench-trace-analysis.md
 create mode 100644 .claude/skills/launching-evals/references/check-progress.md
 create mode 100644 .claude/skills/launching-evals/references/debug-failed-runs.md
 create mode 100644 .claude/skills/launching-evals/references/run-evaluation.md
 create mode 100644 .claude/skills/launching-evals/tests.json

diff --git a/.claude/skills/accessing-mlflow/SKILL.md b/.claude/skills/accessing-mlflow/SKILL.md
new file mode 100644
index 0000000000..337a027bd9
--- /dev/null
+++ b/.claude/skills/accessing-mlflow/SKILL.md
@@ -0,0 +1,104 @@
+---
+name: accessing-mlflow
+description: Query and browse evaluation results stored in MLflow. Use when the user wants to look up runs by invocation ID, compare metrics across models, fetch artifacts (configs, logs, results), or set up the MLflow MCP server. ALWAYS triggers on mentions of MLflow, experiment results, run comparison, invocation IDs in the context of results, or MLflow MCP setup.
+license: Apache-2.0
+# Vendored verbatim from NVIDIA NeMo Evaluator (commit 01899f8)
+# https://github.com/NVIDIA-NeMo/Evaluator/tree/01899f89e8f31116efbca56e8f87fbd8513e24ac/packages/nemo-evaluator-launcher/.claude/skills/accessing-mlflow
+# To re-sync: scripts/sync-upstream-skills.sh
+# Note: this skill depends on the mlflow-mcp MCP server (https://github.com/kkruglik/mlflow-mcp)
+# configured in the user's Claude Code setup.
+---
+
+# Accessing MLflow
+
+## MCP Server
+
+[mlflow-mcp](https://github.com/kkruglik/mlflow-mcp) gives agents direct access to MLflow — query runs, compare metrics, browse artifacts, all through natural language.
+
+## ID Convention
+
+When the user provides a hex ID (e.g. `71f3f3199ea5e1f0`) without specifying what it is, assume it is an **invocation_id** (not an MLflow run_id). An invocation_id identifies a launcher invocation and is stored as both a tag and a param on MLflow runs. One invocation can produce multiple MLflow runs (one per task). You may need to search across multiple experiments if you don't know which experiment the run belongs to.
+
+## Querying Runs
+
+```python
+# Find runs by invocation_id
+MLflow:search_runs_by_tags(experiment_id, {"invocation_id": "<invocation_id>"})
+
+# Query for example model/task runs
+MLflow:query_runs(experiment_id, "tags.model LIKE '%<model>%'")
+MLflow:query_runs(experiment_id, "tags.task_name LIKE '%<task_name>%'")
+
+# Get a config from run's artifacts
+MLflow:get_artifact_content(run_id, "config.yml")
+
+# Get nested stats from run's artifacts
+MLflow:get_artifact_content(run_id, "artifacts/eval_factory_metrics.json")
+```
+
+NOTE: You WILL NOT find PENDING, RUNNING, KILLED, or FAILED runs in MLflow! Only SUCCESSFUL runs are exported to MLflow.
+
+## Workflow Tips
+
+When comparing metrics across runs, fetch the data via MCP, then run the computation in Python for exact results rather than doing math in-context:
+
+```bash
+uv run --with pandas python3 << 'EOF'
+import pandas as pd
+# ... compute deltas, averages, etc.
+EOF
+```
+
+## Artifacts Structure
+
+```
+<harness>.<task>/
+├── artifacts/
+│   ├── config.yml                # Fully resolved config used during the evaluation
+│   ├── launcher_unresolved_config.yaml # Unresolved config passed to the launcher
+│   ├── results.yml               # All results in YAML format
+│   ├── eval_factory_metrics.json # Runtime stats (latency, tokens count, memory)
+│   ├── report.html               # Request-Response Pairs samples in HTML format (if enabled)
+│   └── report.json               # Request-Response Pairs samples in JSON format (if enabled)
+└── logs/
+    ├── client-*.log              # Evaluation client
+    ├── server-*-N.log            # Deployment per node
+    ├── slurm-*.log               # Slurm job
+    └── proxy-*.log               # Request proxy
+```
+
+## Troubleshooting
+
+If the MLflow MCP server fails to load or its tools are unavailable:
+
+1. **`uvx` not found** — install [uv](https://docs.astral.sh/uv/getting-started/installation/):
+   ```bash
+   curl -LsSf https://astral.sh/uv/install.sh | sh
+   ```
+2. **MCP server not configured** — add the config and restart the agent:
+
+   **For Claude Code** — add to `.claude/settings.json` (project or user level), under `"mcpServers"`:
+   ```json
+   "MLflow": {
+     "command": "uvx",
+     "args": ["mlflow-mcp"],
+     "env": {
+       "MLFLOW_TRACKING_URI": "https://<your-mlflow-server>/"
+     }
+   }
+   ```
+
+   **For Cursor** — edit `~/.cursor/mcp.json` (Settings > Tools & MCP > New MCP Server):
+   ```json
+   {
+     "mcpServers": {
+       "MLflow": {
+         "command": "uvx",
+         "args": ["mlflow-mcp"],
+         "env": {
+           "MLFLOW_TRACKING_URI": "https://<your-mlflow-server>/"
+         }
+       }
+     }
+   }
+   ```
diff --git a/.claude/skills/launching-evals/SKILL.md b/.claude/skills/launching-evals/SKILL.md
new file mode 100644
index 0000000000..34ef50bdd5
--- /dev/null
+++ b/.claude/skills/launching-evals/SKILL.md
@@ -0,0 +1,69 @@
+---
+name: launching-evals
+description: Run, monitor, analyze, and debug LLM evaluations via nemo-evaluator-launcher. Covers running evaluations, checking status and live progress, debugging failed runs, exporting artifacts and logs, and analyzing results. ALWAYS triggers on mentions of running evaluations, checking progress, debugging failed evals, analyzing or analysing runs or results, run directories or artifact paths on clusters, Slurm job issues, invocation IDs, or inspecting logs (client logs, server logs, SSH to cluster, tail logs, grep logs). Do NOT use for creating or modifying evaluation configs.
+license: Apache-2.0
+# Vendored verbatim from NVIDIA NeMo Evaluator (commit 01899f8)
+# https://github.com/NVIDIA-NeMo/Evaluator/tree/01899f89e8f31116efbca56e8f87fbd8513e24ac/packages/nemo-evaluator-launcher/.claude/skills/launching-evals
+# To re-sync: scripts/sync-upstream-skills.sh
+---
+
+# NeMo Evaluator Skill
+
+## Quick Reference
+
+### nemo-evaluator-launcher CLI
+
+```bash
+# Run evaluation
+uv run nemo-evaluator-launcher run --config <path.yaml>
+uv run nemo-evaluator-launcher run --config <path.yaml> -t <a_single_task_to_be_run_by_name>
+uv run nemo-evaluator-launcher run --config <path.yaml> -t <task_name_1> -t <task_name_2> ...
+uv run nemo-evaluator-launcher run --config <path.yaml> -o evaluation.nemo_evaluator_config.config.params.limit_samples=10 ...
+
+# Preview the resolved config and the sbatch script without running the evaluation
+uv run nemo-evaluator-launcher run --config <path.yaml> --dry-run
+
+# Check status (--json for machine-readable output)
+uv run nemo-evaluator-launcher status <invocation_id> --json
+
+# Get evaluation run info (output paths, slurm job IDs, cluster hostname, etc.)
+uv run nemo-evaluator-launcher info <invocation_id>
+
+# Copy just the logs (quick — good for debugging)
+uv run nemo-evaluator-launcher info <invocation_id> --copy-logs ./evaluation-results/
+
+# For artifacts: use `nel info` to discover paths. If remote, SSH to explore and rsync what you need.
+# If local, just read directly from the paths shown by `nel info`.
+# ssh <user>@<hostname> "ls <artifacts_path>/"
+# rsync -avzP <user>@<hostname>:<artifacts_path>/{results.yml,eval_factory_metrics.json,config.yml} ./evaluation-results/<invocation_id>.<job_index>/artifacts/
+
+# List past runs
+uv run nemo-evaluator-launcher ls runs --since 1d   
+
+# List available evaluation tasks (by default, only shows tasks from the latest released containers)
+uv run nemo-evaluator-launcher ls tasks
+uv run nemo-evaluator-launcher ls tasks --from_container gitlab-master.nvidia.com/dl/joc/competitive_evaluation/nvidia-core-evals/ci-llm/long-context-eval:dev-2025-12-16T14-37-1693de28-amd64
+```
+
+## Workflow
+
+The complete evaluation workflow is divided into the following steps you should follow IN ORDER.
+
+1. Create or modify a config using the `nel-assistant` skill. If the user provides a past run, use its `config.yml` artifact as a starting point.
+2. Run the evaluation. See `references/run-evaluation.md` when executing this step.
+3. Check progress (while RUNNING). See `references/check-progress.md` when executing this step.
+4. Post-run actions (when terminal state reached):
+   1. When the evaluation status is `SUCCESS`, analyze the results. See `references/analyze-results.md` when executing this step.
+   2. When the evaluation status is `FAILED`, debug the failed run. See `references/debug-failed-runs.md` when executing this step.
+
+# Key Facts
+
+- Benchmark-specific info learned during launching/analyzing evals should be added to `references/benchmarks/`
+- **PPP** = Slurm account (the `account` field in cluster_config.yaml). When the user says "change PPP to X", update the account value (e.g., `coreai_dlalgo_compeval` → `coreai_dlalgo_llm`).
+- **Slurm job pairs**: NEL (nemo-evaluator-launcher) submits paired Slurm jobs — a RUNNING job + a PENDING restart job (for when the 4h walltime expires). Never cancel the pending restart jobs — they are expected and necessary.
+- **HF cache requirement**: For configs with `HF_HUB_OFFLINE=1`, models must be pre-downloaded to the HF cache on each cluster before launching. **Before running a model on a new cluster, always ask the user if the model is already cached there.** If not, on the cluster login node: `python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub` then `HF_HOME=/lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface hf download <model>`. Without this, vLLM will fail with `LocalEntryNotFoundError`.
+- **`data_parallel_size` is per node**: `dp_size=1` with `num_nodes=8` means 8 model instances total (one per node), load-balanced by haproxy. Do NOT interpret `dp_size` as the global replica count.
+- **`payload_modifier` interceptor**: The `params_to_remove` list (e.g. `[max_tokens, max_completion_tokens]`) strips those fields from the outgoing payload, intentionally lifting output length limits so reasoning models can think as long as they need.
+- **Auto-export git workaround**: The export container (`python:3.12-slim`) lacks `git`. When installing the launcher from a git URL, set `auto_export.launcher_install_cmd` to install git first (e.g., `apt-get update -qq && apt-get install -qq -y git && pip install "nemo-evaluator-launcher[all] @ git+...#subdirectory=packages/nemo-evaluator-launcher"`).
+- **Do NOT use `nemo-evaluator-launcher export --dest local`** — it only writes a summary JSON (`processed_results.json`), it does NOT copy actual logs or artifacts despite accepting `--copy_logs` and `--copy-artifacts` flags. `nel info --copy-artifacts` works but copies everything (very slow for large benchmarks). Preferred approach: use `nel info` to discover paths — if local, read directly; if remote, SSH to explore and rsync only what you need. Note that `nel info` prints standard artifacts but benchmarks produce additional artifacts in subdirs — explore to find them.
+
diff --git a/.claude/skills/launching-evals/references/analyze-results.md b/.claude/skills/launching-evals/references/analyze-results.md
new file mode 100644
index 0000000000..fd49d40046
--- /dev/null
+++ b/.claude/skills/launching-evals/references/analyze-results.md
@@ -0,0 +1,57 @@
+# Analyze the results
+
+Copy this checklist and track your progress:
+
+```
+Analysis progress:
+- [ ] Step 1: Gather information
+- [ ] Step 2: Scan logs for runtime problems (per run)
+- [ ] Step 3: Validate config and methodology (per run)
+- [ ] Step 4: Report findings
+```
+
+Steps 2-3 are executed for EACH run separately.
+
+## Step 1: Gather information
+
+**IMPORTANT**: Copy what you need (and only what you need) locally BEFORE analysis — each SSH command requires user approval, so remote one-by-one reads are disruptive, and copying too much is slow.
+
+- Get one or more successful invocation IDs to analyze from the user. You might already have the invocation ID in your memory from the previous step.
+- Get paths: `uv run nemo-evaluator-launcher info <invocation_id>`
+- If artifacts are local, read them directly from the paths shown by `nel info`.
+- If artifacts are remote:
+  - Copy logs: `uv run nemo-evaluator-launcher info <invocation_id> --copy-logs ./evaluation-results/`
+  - Rsync analysis-relevant artifacts: `rsync -avzP <user>@<host>:<artifacts_path>/{results.yml,eval_factory_metrics.json,config.yml} ./evaluation-results/<invocation_id>.<job_index>/artifacts/`
+- For MLflow access, see the `accessing-mlflow` skill.
+- Read benchmark-specific analysis notes from `references/benchmarks/` if available for the evaluated benchmarks.
+  - For Terminal Bench agent trace analysis, follow the procedure in `references/benchmarks/terminal-bench-trace-analysis.md`.
+
+## Step 2: Scan logs for runtime problems
+
+Access logs from locally copied files (`./evaluation-results/<invocation_id>.<job_index>/logs/`). Do NOT read logs via SSH — use the local copies from Step 1.
+
+Check logs for silent errors that may invalidate results:
+
+1. **Tool calling failures**: Search `client-*.log` for "failed" tests, `server-*.log` for "invalid tool call"
+2. **Unfinished reasoning**: Check `server-*.log` for `finish_reason: length`, or truncation warnings in `client-*.log`
+3. **API errors**: HTTP status != 200 in `client-*.log`, trace to `server-*.log` or `proxy-*.log`
+4. **Config mismatches**: Compare `config.yml` params with actual values in `server-*.log` startup and `client-*.log` command
+5. **Performance anomalies**: Low throughput, 0% prefix cache hit rate in `server-*.log`
+6. **Cached responses**: Count "Returning cached response" in `client-*.log`
+7. **KV cache preemptions**: Search `server-*.log` for `PreemptionMode.RECOMPUTE`. If found, consider increasing `tensor_parallel_size` (even at the cost of `data_parallel_size`) to relieve KV cache memory pressure.
+
+## Step 3: Validate config and methodology
+
+1. **Methodology consistency**: Verify same benchmark versions, prompt templates, sampling params, and infrastructure across all models. Flag discrepancies.
+2. **HF model card compliance**: Read the model's HuggingFace model card. Flag any deviations in inference parameters (temperature, top_p, max_new_tokens, deployment args, reasoning flags, etc.).
+3. **Reasoning model validation**: Verify temp > 0, top_p > 0, `max_tokens` = null (allow full output length).  
+   NOTE: `use_reasoning: False` in adapter_config does NOT mean reasoning is disabled — it only controls the reasoning interceptor. Whether reasoning is active depends on the model's own controls (deployment args, system prompt, API payload fields, etc.).
+4. **Non-reasoning model validation**: Verify `max_tokens` = 16k
+5. **Max model length**: Verify `max-model-len` = 131072 (leaderboard-recommended). Long context benchmarks (AA LCR, RULER) and agentic benchmarks may require a longer `max-model-len`.
+6. **RULER tasks**: Check thinking disabled, walltime=4h, rope-scaling for Qwen models
+7. **AA baseline comparison**: Compare results against Artificial Analysis published scores. Exact match not expected — flag significant deviations.
+
+## Step 4: Report findings
+
+Present key metrics from `results.yml` in a table and summarize the metrics from `eval_factory_metrics.json` in a concise manner (include only the most important metrics or anomalies). If multiple runs, include side-by-side comparison of metrics (e.g. accuracy, latency, tokens count, memory). Summarize any issues found. Recommend improvements if applicable.
+
diff --git a/.claude/skills/launching-evals/references/benchmarks/swebench-general-info.md b/.claude/skills/launching-evals/references/benchmarks/swebench-general-info.md
new file mode 100644
index 0000000000..03a423940c
--- /dev/null
+++ b/.claude/skills/launching-evals/references/benchmarks/swebench-general-info.md
@@ -0,0 +1,188 @@
+# SWE-bench
+
+SWE-bench uses the OpenHands harness.
+
+## TL;DR
+
+If you only need the run score:
+
+- `artifacts/results.yml`
+- `artifacts/.../swebench_summary.json`
+
+If you need the official per-instance eval result:
+
+- `artifacts/.../output.report.json`
+
+If you need per-instance token usage or rich debug data:
+
+- `artifacts/.../output.jsonl`
+
+If you need the quickest failure triage:
+
+- `artifacts/.../output_errors.jsonl`
+- `artifacts/.../logs/instance_<id>.log`
+
+## Retries, Attempts, Resume
+
+- `max_retries`
+  Inner retry loop inside one attempt. It is used when running an instance throws an exception, for example sandbox/runtime startup failures, conversation crashes, tunnel problems, polling errors, or other hard execution errors. It is not triggered just because the produced patch was bad or the instance scored unresolved; those are handled by the critic and outer attempts. Each retry creates a fresh workspace/runtime rather than continuing the failed environment. Total executions per attempt are `max_retries + 1`.
+
+- `max_attempts`
+  Outer iterative attempts. Attempt 1 runs instances not yet completed in `output.critic_attempt_1.jsonl`. Attempt `N>1` only runs instances that the critic judged failed in attempt `N-1`.
+
+- `critic`
+  Controls whether another outer attempt is scheduled. The critic evaluates the conversation history plus the produced git patch.
+
+- Default behavior
+  SWE-bench here defaults to `PassCritic`, so `max_attempts` is mostly inert unless you switch critics. In practice, most rerun behavior comes from `max_retries`, not `max_attempts`.
+
+- Resume
+  When rerunning into the same output dir, the harness reads existing `output.critic_attempt_N.jsonl` files and uses them as its source of truth. If an instance already has a non-error row for that attempt, it is skipped. If it only has an error row, it is treated as unfinished and is run again. This resume behavior does not depend on the critic.
+
+- Context-window errors
+  `ContextWindowExceed` is treated as non-recoverable inside the inner retry loop, so the remaining inner retries are skipped immediately. That only answers the inner `max_retries` question. The instance can still run again later if you rerun/resume into the same output dir, because hard-error rows are treated as unfinished even with `PassCritic`. In short:
+  inner retry = exception handling inside one attempt;
+  outer attempt = critic says previous output failed;
+  resume rerun = this attempt only has an error row so far.
+  This can also produce more raw request-level `400`s in metrics than final hard-failed instances, because a run can hit one `400` and still finish as a soft `status=stuck` case with a partial patch.
+
+## What Matters
+
+- `swebench_summary.json`
+  Single-run summary: `submitted_instances`, `resolved_instances`, `accuracy`.
+
+- `output.report.json`
+  Official eval output. Top-level keys: `dataset`, `evaluation_method`, `model_name_or_path`, `resolved`, `resolved_count`, `results`, `total_instances`. Each `results` row has `instance_id`, `resolved`, `error`, `exit_code`.
+
+- `output.jsonl`
+  One final JSON row per benchmark instance. In the inspected run, rows included `instance_id`, `error`, `attempt`, `metrics`, `runtime_runs`, `test_result`, `instruction`, and the full `instance` payload. For verification, the useful part is `metrics.accumulated_token_usage.completion_tokens`.
+
+- `output_errors.jsonl`
+  Same row shape as `output.jsonl`, but only for hard failures. Read this first when debugging a bad run.
+
+- `output.swebench.jsonl`
+  Minimal prediction file for official SWE-bench eval. Fields: `instance_id`, `model_name_or_path`, `model_patch`.
+
+- `metadata.json`
+  Run setup/config snapshot. Includes `dataset`, `dataset_split`, `max_iterations`, `conversation_timeout`, `max_attempts`, `max_retries`, `skip_failed_samples`, `workspace_type`, `llm`, `sandbox_config`, `prompt_path`, and `eval_output_dir`.
+
+- `agent_logs/.../run.json`
+  Compact run summary. Fields: `run_id`, `status`, `duration`, `tasks`, `config`, `metadata`, timestamps.
+
+- `agent_logs/.../tasks.jsonl`
+  Attempt-level task records. Fields: `task_id`, `attempt_id`, `status`, `reward`, `duration`, `termination`, `error`, `trajectory`, `artifacts`, timestamps. `trajectory.usage` has aggregated `prompt_tokens`, `completion_tokens`, `reasoning_tokens`, `content_tokens`.
+
+- `logs/instance_<id>.log`
+  Best per-instance raw text log: sandbox startup, repo/setup steps, tool calls, agent/server messages, and failure traces.
+
+## Live Progress
+
+During a running evaluation, the official result files (`output.report.json`, `swebench_summary.json`, `results.yml`) do not exist yet. Use `tasks.jsonl` for live progress — it is written incrementally as each instance finishes its agent conversation.
+
+### Restart-safe progress tracking
+
+`tasks.jsonl` is **append-only**. When a run is restarted (e.g. after SLURM wall-time kill), errored instances are retried and new entries are appended. The same `task_id` can appear multiple times. Raw line counts will exceed 500 for a 500-task benchmark.
+
+**Always deduplicate by `task_id`** (last entry wins) to get accurate progress. Use the script below for both single-run and multi-restart scenarios.
+
+There are two sources of truth for progress, each useful for different things:
+
+| File | Best for | Notes |
+|------|----------|-------|
+| `tasks.jsonl` | Live progress with rich detail (status, duration, termination reason) | Append-only, needs dedup by `task_id` |
+| `output.critic_attempt_1.jsonl` | What the harness considers "done" for resume | Instance with non-error row = skipped on next restart; error row = retried |
+
+**Quick status count** (run from the cluster where the job is running):
+
+```bash
+# Replace TASKS_JSONL with the actual path:
+# artifacts/.../agent_logs/.../tasks.jsonl
+#
+# Deduplicates by task_id (last entry wins), so this works correctly
+# even after multiple restarts where tasks.jsonl has >500 lines.
+python3 -c "
+import json, collections, sys
+latest = {}
+for line in open(sys.argv[1]):
+    line = line.strip()
+    if not line: continue
+    rec = json.loads(line)
+    tid = rec.get('task_id', 'unknown')
+    latest[tid] = rec.get('status', 'unknown')
+counts = collections.Counter(latest.values())
+total = len(latest)
+for s, c in sorted(counts.items()): print(f'  {s}: {c}')
+print(f'  TOTAL unique: {total}/500')
+remaining = 500 - total
+print(f'  REMAINING: {remaining}')
+" TASKS_JSONL
+```
+
+Expected output while running (even after restarts):
+```
+  error: 3
+  success: 120
+  TOTAL unique: 123/500
+  REMAINING: 377
+```
+
+Note: `success` here means the instance was resolved; `error` means a hard runtime failure (context window exceeded, timeout, etc.); `failure` means an evaluable patch was produced but did not resolve the instance. During a run, `failure` counts only appear after the official SWE-bench eval step rewrites `tasks.jsonl`, so mid-run you mostly see `success` and `error`.
+
+After a restart, previously-errored instances that now succeed will show as `success` (the latest entry overwrites the old `error` entry in the deduplication).
+
+**What NOT to use:**
+- `Progress: N/T evaluated` in client logs — only emitted at the very end, not useful for in-flight monitoring.
+- Raw line count of `tasks.jsonl` — will exceed 500 after restarts due to append-only behavior.
+- `output.critic_attempt_1.jsonl` for progress display — also append-only with duplicates, and has less detail (no `status`/`termination`/`duration`). However, it is the file the harness reads to decide what to skip vs retry on restart.
+
+## Instance IDs
+
+- Format
+  SWE-bench instance IDs are dataset-defined and use `<org>__<repo>-<number>`, for example `django__django-11333`.
+
+- Meaning
+  `django__django` corresponds to repo `django/django`. The trailing number is the benchmark instance number within that repo, not a retry/run suffix added by our harness.
+
+- Canonical key
+  The harness loads `row["instance_id"]` directly from the dataset and uses the full string as the canonical task key for inference and evaluation metadata lookup.
+
+- Practical implication
+  `django__django-11333` and `django__django-16116` are different SWE-bench tasks from the same repo. They can differ in `problem_statement`, `base_commit`, `test_patch`, and expected test outcomes (`FAIL_TO_PASS`, `PASS_TO_PASS`).
+
+- What is `test_patch`?
+  Dataset-provided test-only patch used during evaluation, not scoring input from the model. In `eval_infer.py`, the harness loads `meta["test_patch"]`, applies the model patch first, then applies `test_patch`, then runs the benchmark test script. The prompt template does not include `test_patch`; it only includes `problem_statement` and tells the agent not to modify tests. Practical meaning: the model is expected to change non-test source files, while benchmark-owned test updates/scaffolding are applied afterward during evaluation.
+
+## Failure Modes
+
+SWE-bench does not have a clean TB-style `failure_mode` enum. Also, conversation termination is not the same thing as the final per-instance outcome: SWE-bench can still collect and evaluate a partial patch after `status=stuck` or even some `status=error` terminations, so an instance can still end up officially resolved.
+
+Where to look:
+
+- `tasks.jsonl`
+  Best lightweight source for final per-instance status and termination reason.
+  Use top-level `status` for the final per-instance outcome (`success` / `failure` / `error`).
+  Use `termination.reason` for how the conversation ended (`finish_tool`, `finished_no_finish_tool`, `status=error`, `status=stuck`, etc.).
+- `output_errors.jsonl`
+  Best source for concrete hard-failure messages.
+- `output.report.json`
+  Best source for official `resolved` / `unresolved`, but its `error` field is not a reliable failure reason.
+
+What to expect:
+
+- `status=success`
+  In top-level `tasks.jsonl.status`, this means the instance was resolved in the final official SWE-bench evaluation. This is assigned after evaluation rewrites `tasks.jsonl`, not merely because the agent called `finish` or the run ended cleanly.
+  Separate note: `run.json` can also say run-level `status=success`, but that only means the overall evaluation process finished cleanly.
+- `status=failure`
+  In top-level `tasks.jsonl.status`, the attempt produced something evaluable, but the final official SWE-bench evaluation did not mark the instance as resolved.
+- `status=error`
+  In top-level `tasks.jsonl.status`, this means a hard runtime failure. This is where agent timeout and similar non-soft errors land.
+  Typical examples:
+  `Run timed out after <N> seconds`; `Remote conversation ended with error`; `Remote conversation not found (404). The runtime may have been deleted.`; `Polling failed with HTTP <code>`; `LLMContextWindowExceededError` / `ContextWindowExceededError`.
+  Exception: `MaxIterationsReached` still uses conversation execution status `error`, but OpenHands treats that specific error code as a normal stop and SWE-bench continues with patch collection/eval.
+  In the inspected Nemotron-Super run, all 5 such cases were context-window exceeded after retries.
+- `termination.reason = status=stuck`
+  This is a conversation end state, not a final per-instance status. Check it in `tasks.jsonl.termination.reason`.
+  It means OpenHands stopped the conversation after detecting a no-progress pattern after the last user message.
+  Default triggers:
+  4 repeated identical action + observation pairs; 3 repeated identical action + error pairs; 3 consecutive agent-only messages; 6-step alternating repeated action/observation pattern.
+  After that, SWE-bench may still collect a patch and later mark the instance as top-level `status=success` or `status=failure`.
diff --git a/.claude/skills/launching-evals/references/benchmarks/terminal-bench-general-info.md b/.claude/skills/launching-evals/references/benchmarks/terminal-bench-general-info.md
new file mode 100644
index 0000000000..b1735e623c
--- /dev/null
+++ b/.claude/skills/launching-evals/references/benchmarks/terminal-bench-general-info.md
@@ -0,0 +1,122 @@
+# Terminal Bench
+
+Terminal Bench is an agentic benchmark where models interact with a terminal environment to solve tasks.
+
+## Key files
+
+- `terminal_bench/agents/terminus_2/terminus_2.py` — main agent implementation
+- `terminal_bench/agents/failure_mode.py` — failure mode definitions
+- `terminal_bench/harness/harness.py` — harness and result aggregation
+- `core_evals/nvidia_terminal_bench/framework.yml` — default config values
+
+### Key Facts
+
+- **Task-first ordering**: `task1.1-of-N, task1.2-of-N, ..., task2.1-of-N, ...` — mid-run results are biased toward early tasks.
+
+## Failure Modes
+
+All failure modes (see `failure_mode.py`):
+- `UNSET` — no failure mode triggered (task ran to completion)
+- `NONE` — explicitly set: no failure (task solved)
+- `UNSOLVED` — task not completed within constraints
+- `TOKEN_LIMIT_EXCEEDED` — agent hit `max_input_tokens_per_task` (cumulative input tokens across all turns). Shows as `outcome: token_limit_exceeded` in `task_status.json`.
+- `PARSE_ERROR` — harness couldn't parse the **test output** (`post-test.txt`), e.g. pytest output missing `short test summary info`
+- `FATAL_LLM_PARSE_ERROR` — unrecoverable LLM/agent response parse error
+- `CONTEXT_LENGTH_EXCEEDED` — input exceeded model's context window (see [Context Recovery](#context-recovery))
+- `OUTPUT_LENGTH_EXCEEDED` — response truncated by `max_completion_tokens`; agent retries; recorded when all retries exhausted. Shows as `finish_reason: length` in `eval_factory_metrics.json`.
+- `TEST_TIMEOUT` — test verification timed out
+- `AGENT_TIMEOUT` — agent execution timed out (see [Mitigating Agent Timeouts](#mitigating-agent-timeouts))
+- `UNKNOWN_AGENT_ERROR` — unexpected agent error (stops eval on default policy)
+- `AGENT_INSTALLATION_FAILED` — agent setup failed (stops eval on default policy)
+- `UNKNOWN` — unknown harness error (stops eval on default policy)
+
+`failed_samples_policy` (default: `default`) — only stops on "no fair chance" failures: `UNKNOWN`, `UNKNOWN_AGENT_ERROR`, `AGENT_INSTALLATION_FAILED`. All other failures continue with score 0.
+
+## Artifacts
+
+All paths relative to `<output_dir>/<invocation>/terminal-bench-hard/`.
+
+### Client logs
+
+`logs/client-*.log` — contains rich/ANSI formatting (binary), always use `grep -a`. Shows live progress (`Running tasks (X/Y, Accuracy: Z%)`) and crash diagnostics.
+
+### Run-level artifacts
+
+Path: `artifacts/terminal-bench/`
+
+| File | Written | Updated | Content |
+|------|---------|---------|---------|
+| `tb.lock` | Run start | Never | Full resolved config: invocation args, agent kwargs (`max_episodes`, `temperature`, `max_input_tokens_per_task`), run config (`n_concurrent_trials`, `global_agent_timeout_sec`, `failed_samples_policy`), ECS/sandbox settings. Best for reproducing runs. |
+| `run_metadata.json` | Run start | Once at end | `model_name`, `dataset_name`/`dataset_version`, `n_concurrent_trials`, `task_ids`, `start_time`/`end_time`, `accuracy`, `pass_at_k` |
+| `task_status.json` | After 1st task | After each task | One entry per task (not per trial). `status` (success/failed), `outcome`, `trial_name`. "Success is sticky" — once a task succeeds, later failures don't overwrite. 48 entries total. |
+| `tb_results.json` | After 1st task | After each task | See below |
+
+**Mid-run**: `task_status.json` and `tb_results.json` grow incrementally. `run_metadata.json` exists but lacks final metrics.
+
+#### `tb_results.json` details
+
+The richest single artifact.
+
+**Per-trial fields:**
+- `is_resolved` (bool) — ground truth for whether the task was solved. Use this, not `passed` or `score`.
+- `failure_mode`, `parser_results` (dict of test name → "passed"/"failed")
+- `instruction` — full task description given to the agent
+- Token usage: `total_input_tokens`, `total_output_tokens`
+- `trajectory_length` — number of agent episodes (turns)
+- Timestamps: `trial_started_at`, `agent_started_at/ended_at`, `test_started_at/ended_at`
+- `recording_path` — asciinema `.cast` file for replaying terminal sessions
+- `error_type`, `error_message` — populated on crashes
+
+**Aggregate fields:**
+- `pass_at_k`, `accuracy`, `n_resolved`, `n_unresolved`
+- `resolved_ids`, `unresolved_ids`
+- `failure_mode_counts`, `error_type_counts`, `token_limit_exceeded_count`
+- `total_input_tokens`, `total_output_tokens` — run-wide totals
+
+Per-trial `artifacts/terminal-bench/<task>/<trial>/results.json` files are the source — `tb_results.json` aggregates them (same schema).
+
+### Per-trial artifacts
+
+Path: `artifacts/terminal-bench/<task>/<trial>/`
+
+**Agent logs** (`agent-logs/episode-N/`, N = 0, 1, 2, ...):
+- `prompt.txt` — full prompt sent to the model (system instructions + task + terminal state)
+- `response.txt` — model's raw response (JSON with `analysis`, `plan`, `commands`, `task_complete`)
+- `debug.json` — LiteLLM trace: model, messages, optional_params, `reasoning`/`reasoning_content` (chain-of-thought), token usage, `llm_api_duration_ms`, response headers
+
+**Panes** (`panes/`) — terminal screen snapshots:
+- `pre-agent.txt` — before agent starts (initial prompt)
+- `post-agent.txt` — after agent finishes (all commands and outputs)
+- `post-test.txt` — after test verification. If `failure_mode: parse_error`, check this first; for pytest tasks the summary block may be missing.
+
+Panes are useful for quick triage without reading episode logs.
+
+## Troubleshooting
+
+### Mitigating Agent Timeouts
+
+High `AGENT_TIMEOUT` rates (e.g. 85%+) are caused by inference contention: too many concurrent agent sessions competing for the same vLLM instance.
+
+Two levers reduce contention: **lower parallelism** (fewer concurrent tasks) and **scale inference** (more deployment nodes / data-parallel replicas). Scaling inference has diminishing returns — requesting 32–64 nodes means long queue times and harder Slurm scheduling. The recommended approach combines both:
+
+**Split into independent single-sample runs with lower parallelism (8x1 pattern):**
+
+Instead of one run with `n_samples: 8, parallelism: 100`, submit 8 independent runs each with `n_samples: 1` and reduced `parallelism: 24`. This scales horizontally with multiple smaller jobs.
+
+### Context Recovery
+
+When the agent's input exceeds the model's context window, terminus_2 has two recovery paths. Both rely on `litellm.get_max_tokens(model_name)` to determine the context limit.
+
+**Proactive path** (`_check_proactive_summarization`): Fires when `free_tokens < 8000` *before* the API call. Summarizes while the **full** conversation history is still available. This is the healthier path.
+
+**Reactive path** (on `ContextLengthExceededError`): Fires after the API *rejects* a request:
+1. **Unwind** (`_unwind_messages_to_free_tokens`): Drops the most recent user+assistant pairs until `free_tokens >= 4000`. Destructive — removed messages are permanently lost.
+2. **Summarize** (`_summarize`): Asks the model (using truncated history) to summarize, generates questions from summary + `capture_pane()`, answers from truncated history, resets `chat._messages` to just 3 messages (original instruction + Q&A).
+
+**Reactive path flaw**: Unwind drops recent messages *before* summarize runs. The terminal reflects those actions but the summary doesn't contain them. Only `capture_pane()` partially compensates.
+
+**LiteLLM context limit is often wrong**: `litellm.get_max_tokens()` returns the *advertised* context window, not the deployment limit. For unknown models it falls back to 1M tokens; for `--max-model-len` smaller than default, it reports the full spec. When the limit is too high, unwind removes nothing, summarize hits the same error, and recovery is a no-op — propagates as `CONTEXT_LENGTH_EXCEEDED`.
+
+## Agent Trace Analysis
+
+See `references/benchmarks/terminal-bench-trace-analysis.md` for analyzing per-task agent traces, extracting behavior patterns, and categorizing failures.
diff --git a/.claude/skills/launching-evals/references/benchmarks/terminal-bench-trace-analysis.md b/.claude/skills/launching-evals/references/benchmarks/terminal-bench-trace-analysis.md
new file mode 100644
index 0000000000..5e973a1620
--- /dev/null
+++ b/.claude/skills/launching-evals/references/benchmarks/terminal-bench-trace-analysis.md
@@ -0,0 +1,145 @@
+# Terminal Bench: Agent Trace Analysis
+
+Analyze agent traces from a terminal-bench evaluation run.
+
+```
+Trace analysis progress:
+- [ ] Step 1: Locate artifacts
+- [ ] Step 2: Analyze each task
+- [ ] Step 3: Produce summary table
+- [ ] Step 4: Episode-level deep dive (optional)
+```
+
+## Step 1: Locate artifacts
+
+- Agent logs: `artifacts/terminal-bench/agent_logs/default/tasks.jsonl`
+- Per-task artifacts: `artifacts/terminal-bench/{task_name}/{trial_name}/`
+  - `results.json` - test results and metadata
+  - `panes/post-agent.txt` - terminal state after agent finished
+  - `panes/post-test.txt` - terminal state after tests ran
+
+## Step 2: Analyze each task
+
+For each task, extract:
+
+**Metadata:** Task ID, status (success/failure/error), duration (convert to hours and minutes), token usage (input/output), test results breakdown (passed/failed counts).
+
+**Agent behavior:**
+1. **Approach:** What strategy did the agent use? (read-then-write, iterative debugging, single-shot, etc.)
+2. **Key Commands:** Summarize the critical shell commands executed from post-agent.txt
+3. **Reasoning Quality:** Was the plan coherent? Did it address the task requirements?
+
+**For successful tasks:** What made the approach work? Was it efficient or did it take unnecessary steps? Key success factors (domain knowledge, clean implementation, etc.)
+
+**For failed tasks:**
+- **Failure Mode:** Categorize as: environment/setup issues, algorithm/logic errors, timeout/resource limits, or task misunderstanding
+- **Stuck Loops:** Did the agent repeat failed attempts without adapting?
+- **Root Cause:** Single-sentence summary of why it failed
+- **Missed Opportunities:** What should the agent have done differently?
+
+Present each task using this format:
+
+```
+## [Task Name]
+
+**Status:** PASS/FAIL (X/Y tests) | **Duration:** Hh MMmin | **Tokens:** Xk in / Xk out
+**Task:** One-sentence description of what the task required
+**Agent Approach:**
+1. Step 1
+2. Step 2
+...
+
+**[For failures only] Why It Failed:**
+* Bullet points with specific errors/issues from the logs
+
+**[For successes] Key Success Factors / [For failures] Root Cause:**
+* Summary
+```
+
+## Step 3: Produce summary table
+
+| Task | Status | Duration | Tokens | Failure Mode |
+|------|--------|----------|--------|--------------|
+| ... | PASS/FAIL | Hh MMmin | Xk | - or category |
+
+## Step 4: Episode-level deep dive (optional)
+
+When you need to trace exactly where the agent went wrong, check:
+
+`artifacts/terminal-bench/{task}/*/agent-logs/episode-N/`
+
+- `response.txt` - Agent's explicit reasoning (`analysis`, `plan` fields)
+- `prompt.txt` - What terminal state the agent saw before acting
+
+Use cases: identify the specific episode where the agent made a wrong decision, check if the plan was reasonable but execution failed, debug loops where the agent repeated the same failing approach.
+
+Skip this for general pass/fail summaries and performance comparisons.
+
+## Examples
+
+### Successful: cross-entropy-method
+
+```
+Status: PASSED (22/22 tests) | Duration: 28 min | Tokens: 38.5k in / 3.3k out
+Task: Implement three core RL methods: PointEnv.step(), CrossEntropyMethod.optimize(), and evaluate_plans_memoized() with caching.
+Agent Approach:
+1. Examined existing code structure (ls -la, cat cross_entropy.py)
+2. Wrote complete implementations using a heredoc (cat > cross_entropy.py << 'EOF')
+3. Implemented step() with position clipping and goal distance check
+4. Implemented memoization with prefix caching (tuple keys for hashability)
+5. Implemented cross-entropy optimization with elite selection
+6. Ran tests which all passed
+
+Key Success Factors:
+* Clear algorithmic understanding (cross-entropy method, memoization)
+* Clean single-shot implementation without debugging loops
+* Proper numpy handling (clipping, distance calculation)
+```
+
+### Successful: oom (cache HuggingFace model)
+
+```
+Status: PASSED (1/1 test) | Duration: 9 min | Tokens: 4k in / 432 out
+Task: Cache the albert/albert-base-v2 model for offline use.
+Agent Approach:
+1. Ran huggingface-cli download albert/albert-base-v2
+2. Downloaded 12 files (~270MB total)
+3. Task complete - straightforward execution
+
+Key Success Factors:
+* Simple task with direct solution
+* Minimal steps required (single command)
+```
+
+### Failed: lean4-proof
+
+```
+Status: FAILED (6/11 tests) | Duration: 3h 20min | Tokens: 521k in / 15.8k out
+Task: Install Lean v4.21.0 with Mathlib and complete 3 formal proofs.
+Agent Failures:
+1. Version Mismatch Hell: Agent tried to pin Mathlib to lean-4.21 but branch doesn't exist
+2. Toolchain Override: Mathlib auto-updated to v4.27.0-rc1, breaking v4.21.0 requirement
+3. Incompatible Linter Options: error: Unknown option `linter.unusedTactic`
+4. Git Authentication Failures: Multiple failed clones requiring auth
+5. Import Ordering Errors: set_option inserted before import
+
+Root Cause: Agent couldn't reconcile Lean v4.21.0 requirement with Mathlib4's latest versions. Spent 3+ hours in a loop trying various revision formats without success.
+```
+
+### Failed: feal-differential-cryptanalysis
+
+```
+Status: FAILED (0/1 test) | Duration: 42 min | Tokens: 7.4k in / 1.6k out
+Task: Implement a differential cryptanalysis attack to recover key[5] from a FEAL-like cipher.
+Agent Approach:
+1. Read feal.py to understand the cipher (4-round Feistel)
+2. Wrote attack.py with basic differential attack logic
+
+Why It Failed:
+* The differential characteristic chosen was likely incorrect for this FEAL variant
+* Attack logic assumed key[5] maps to round 4 key directly, but the cipher uses key[round+2] indexing
+* No iterative refinement or multi-round differential propagation analysis
+* Agent stopped after single implementation attempt without testing/debugging
+
+Root Cause: Cryptanalysis tasks require precise differential trail analysis. The agent's heuristic approach didn't account for the specific F-function and key schedule.
+```
diff --git a/.claude/skills/launching-evals/references/check-progress.md b/.claude/skills/launching-evals/references/check-progress.md
new file mode 100644
index 0000000000..3f888fe629
--- /dev/null
+++ b/.claude/skills/launching-evals/references/check-progress.md
@@ -0,0 +1,24 @@
+# Check progress of a running evaluation
+
+Follow the three phases and track your progress in the output.
+
+1. **INPUT** -> EXPLORE -> ACT
+2. ~~INPUT~~ -> **EXPLORE** -> ACT
+3. ~~INPUT~~ -> ~~EXPLORE~~ -> **ACT**
+
+## 1. INPUT
+
+- **Invocation ID**: The evaluation to monitor.
+
+## 2. EXPLORE
+
+1. **Get status & task name**: `uv run nemo-evaluator-launcher status <invocation_id> --json`
+2. **Check for benchmark-specific docs**: Read files in `references/benchmarks/` matching the task name (e.g., `terminal-bench-general-info.md` for `terminal-bench-*` tasks). These contain monitoring commands and benchmark-specific context.
+3. **Get output paths from config**: `uv run nemo-evaluator-launcher info <invocation_id>` → find `output_dir` and cluster hostname.
+
+## 3. ACT
+
+1. Report status, slurm job ID, task name from step 2.1
+2. **If RUNNING**: SSH to cluster and check the live progress in the `client-*.log` file. Use the monitoring command from benchmark docs if exists.
+3. **If SUCCESS**: Pivot to analyzing results. See `references/analyze-results.md`.
+4. **If FAILED**: Pivot to debugging failed runs. See `references/debug-failed-runs.md`.
diff --git a/.claude/skills/launching-evals/references/debug-failed-runs.md b/.claude/skills/launching-evals/references/debug-failed-runs.md
new file mode 100644
index 0000000000..e94d3bb89f
--- /dev/null
+++ b/.claude/skills/launching-evals/references/debug-failed-runs.md
@@ -0,0 +1,130 @@
+# Debug failed runs
+
+Copy this checklist and track your progress:
+
+```
+Debug progress:
+- [ ] Step 1: Gather from the user
+- [ ] Step 2: Get job info
+- [ ] Step 3: Copy and check logs
+- [ ] Step 4: Apply fix
+- [ ] Step 5: Verify fix
+```
+
+## Step 1: Gather from the user
+
+- **Invocation ID**: The failed run to debug.
+- **Error symptoms** (optional): What the user observed (timeout, OOM, etc.).
+
+## Step 2: Get job info
+
+```bash
+uv run nemo-evaluator-launcher status <invocation_id> --json
+uv run nemo-evaluator-launcher info <invocation_id>
+```
+
+Extract from output:
+
+- **Status**: Job state per task
+- **Logs path**: Remote path to logs directory
+- **Slurm Job ID**: Job ID for log filenames
+- **Hostname**: Cluster login node for SSH
+
+## Step 3: Copy and check logs
+
+**IMPORTANT**: Copy what you need (and only what you need) locally BEFORE analysis — each SSH command requires user approval, so remote one-by-one reads are disruptive, and copying too much is slow.
+
+```bash
+uv run nemo-evaluator-launcher info <invocation_id> --copy-logs /tmp/debug-logs
+```
+
+```bash
+LOGS=/tmp/debug-logs/<job_id>/logs
+
+# Check logs in order:
+# 1. slurm log - job-level errors (scheduling, walltime, preemption)
+cat $LOGS/slurm-*.log
+
+# 2. server log - deployment errors (OOM, missing model, bad args, driver mismatch)
+tail -200 $LOGS/server-*-0.log
+grep -i -E '(error|exception|failed|OOM|killed)' $LOGS/server-*-0.log | tail -50
+
+# 3. proxy log - load balancer errors (multi-instance only)
+cat $LOGS/proxy-*.log 2>/dev/null
+
+# 4. client log - evaluation errors (dataset, scorer, timeout, rate limiting)
+tail -200 $LOGS/client-*.log
+```
+
+- **slurm-*.log** — Job-level errors (health check timeouts, account/partition errors, walltime exceeded, preemption)
+- **server-*-N.log** — Deployment errors (CUDA OOM, missing model/checkpoint, bad extra_args, GPU driver mismatch, image pull failure)
+- **proxy-*.log** — HAProxy load balancer errors (only present with multi-instance deployments)
+- **client-*.log** — Evaluation errors (dataset access, scorer errors, timeouts, rate limiting)
+
+**IMPORTANT**: Always check BOTH server AND client logs. Client logs show symptoms (e.g., `unknown_agent_error`, `failed_samples_policy`), server logs show actual cause.
+
+## Step 4: Apply fix
+
+**Common fixes:**
+
+- **CUDA OOM**: Increase `deployment.tensor_parallel_size` to shard across more GPUs. For multi-node: increase `execution.num_nodes` and set `deployment.pipeline_parallel_size`. As last resort: add `--max-model-len <lower_value>` to `deployment.extra_args`. Do NOT quantize as a first fix — scale compute instead.
+- **Missing model/checkpoint**: `FileNotFoundError` or `RepositoryNotFoundError` or `GatedRepoError: 403` — verify `deployment.checkpoint_path` or `deployment.hf_model_handle`. For gated models, set `HF_TOKEN` via `deployment.env_vars`.
+- **Bad `extra_args`**: `unrecognized arguments` or `unexpected keyword argument` — check flags against deployment engine version. Some flags change between versions (e.g., `--rope-scaling` removed in vLLM > 0.11.0).
+- **Image pull failure**: `manifest not found` or `pyxis: child 1 failed` — verify image tag exists. Drop `:5005` from GitLab container registry URLs.
+- **GPU driver mismatch**: `CUDA driver version is insufficient` — use an older container image matching the host CUDA driver.
+- **Health check timeout / connection refused**: Server didn't start — check server logs first. Increase `execution.endpoint_readiness_timeout` (seconds). SLURM default: `null` (falls back to walltime).
+- **Server crashed mid-eval**: `Connection reset by peer` — check server logs for OOM. Reduce `parallelism` (concurrent requests). Check SLURM logs for preemption or walltime exceeded.
+- **Missing dataset**: `DatasetNotFoundError` or `GatedRepoError: 403` — accept the license on HuggingFace, set `HF_TOKEN` in `evaluation[].env_vars`.
+- **Scorer errors**: `ScorerError` or `KeyError` — check model output format, `adapter_config`, and `max_new_tokens`.
+- **Timeout**: `TimeoutError` or `Request timed out` — increase `evaluation[].nemo_evaluator_config.config.params.request_timeout`. Reduce `max_new_tokens` or `parallelism` if overloaded.
+- **Config validation**: `MissingMandatoryValue` (unfilled `???`), `ValidationError` (type mismatch), `ScannerError` (invalid YAML) — run `--dry-run` to catch these upfront.
+- **Walltime exceeded**: `CANCELLED DUE TO TIME LIMIT` — NEL submits paired restart jobs that automatically resume when walltime expires, so this is often expected behavior, not a failure. Only increase `execution.walltime` if the evaluation isn't making progress across restarts.
+- **Preemption**: `CANCELLED DUE TO PREEMPTION` — the paired restart job should automatically resume. If it doesn't, use non-preemptible partition, or re-run.
+- **Container not found**: Applies to both `deployment.image` and task-level eval container. Drop `:5005` from GitLab registry URLs.
+- Troubleshooting docs: list files with WebFetch `https://api.github.com/repos/NVIDIA-NeMo/Evaluator/contents/docs/troubleshooting`, then fetch relevant ones from `https://raw.githubusercontent.com/NVIDIA-NeMo/Evaluator/main/docs/troubleshooting/<file>`
+
+**Fix Slurm invalid account/partition:**
+
+```bash
+# Get cluster hostname from nel info
+uv run nemo-evaluator-launcher info <invocation_id>
+
+# Check available accounts on the cluster
+ssh <user>@<hostname> "sacctmgr show user <user> withassoc format=Account%30,Partition%20 --noheader"
+```
+
+**Fix HuggingFace API 429 Rate Limiting:**
+
+Always set `HF_TOKEN` in both `deployment.env_vars` and `evaluation[].env_vars`, even for public models. To pre-cache:
+
+```bash
+ssh <user>@<cluster-hostname>
+python3 -m venv .venv && source .venv/bin/activate && pip install -U huggingface_hub
+export HF_HOME=<your-cache-path>/huggingface
+export HF_TOKEN=<your-token>
+huggingface-cli download <org>/<model>
+```
+
+Then set `HF_HUB_OFFLINE: 1` in config's env_vars.
+
+**Correctness warning** — these fixes affect evaluation results:
+- `--max-model-len` — restricts context window, may truncate prompts
+- `temperature` — sampling randomness
+- `top_p` — nucleus sampling threshold
+- `max_new_tokens` — output truncation if too low
+
+## Step 5: Verify fix
+
+```bash
+# 1. Dry-run (validates config without running)
+uv run nemo-evaluator-launcher run --config <config> --dry-run
+
+# 2. Smoke test (10 samples)
+uv run nemo-evaluator-launcher run --config <config> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10
+
+# 3. Single failing task only
+uv run nemo-evaluator-launcher run --config <config> -t <failed_task> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10
+
+# 4. Monitor
+uv run nemo-evaluator-launcher status <new_invocation_id> --json
+```
diff --git a/.claude/skills/launching-evals/references/run-evaluation.md b/.claude/skills/launching-evals/references/run-evaluation.md
new file mode 100644
index 0000000000..5eac18746c
--- /dev/null
+++ b/.claude/skills/launching-evals/references/run-evaluation.md
@@ -0,0 +1,26 @@
+# Run the evaluation
+
+Follow the three phases and track your progress in the output.
+
+1. **INPUT** -> EXPLORE -> ACT
+2. ~~INPUT~~ -> **EXPLORE** -> ACT
+3. ~~INPUT~~ -> ~~EXPLORE~~ -> **ACT**
+
+## 1. INPUT
+
+Gather requirements from the user:
+
+- **Config path**: The YAML config file to run. NOTE: You might already have the config path in your memory from the previous step.
+- **Credentials**: Some tasks require environment variables (e.g., `AWS_ACCESS_KEY_ID`, `HF_TOKEN`). Check if there is a `.env` file in the workspace root. If not, ask the user to create one with the credentials exported in it.
+- **Task filter** (optional): Specific tasks to run via `-t <task_name>`.
+- **Overrides** (optional): Any `-o key=value` overrides.
+- **Dry-run first?** (optional): Preview with `--dry-run` before submitting.
+
+## 2. EXPLORE
+
+- Preview the resolved config and the sbatch script by adding `--dry-run` flag to the final command.
+
+## 3. ACT
+
+1. Submit the evaluation: `uv run nemo-evaluator-launcher run --config <path.yaml> ...`
+   - NEL automatically reads `.env` from the workspace root — no need to source it manually.
diff --git a/.claude/skills/launching-evals/tests.json b/.claude/skills/launching-evals/tests.json
new file mode 100644
index 0000000000..6d6935e98a
--- /dev/null
+++ b/.claude/skills/launching-evals/tests.json
@@ -0,0 +1,46 @@
+[
+  {
+    "skills": ["launching-evals"],
+    "query": "Check the status of evaluation run <any running invocation id>",
+    "files": [],
+    "expected_behavior": [
+      "Runs `uv run nemo-evaluator-launcher status <invocation_id> --json` to check job status",
+      "Reports the Slurm job status (PENDING, RUNNING, SUCCESS, FAILED, KILLED)",
+      "Does NOT claim server is healthy just because status is RUNNING - acknowledges RUNNING only means Slurm job is running, not that vLLM server started successfully"
+    ]
+  },
+  {
+    "skills": ["launching-evals"],
+    "query": "Check the progress of <running TerminalBench invocation id> evaluation.",
+    "files": [],
+    "expected_behavior": [
+      "Reads the launching-evals skill to understand the evaluation workflow",
+      "Runs `nemo-evaluator-launcher status --json` to get the invocation ID, task name (terminal-bench-hard), and confirm the run is active",
+      "Reads the benchmark-specific documentation at `references/benchmarks/terminal-bench-general-info.md` to find the live progress monitoring command",
+      "Runs `nemo-evaluator-launcher info` to find the `output_dir` path on the cluster",
+      "SSHs to the cluster and runs the grep command from the benchmark docs to extract the live progress (e.g., 'Running tasks (133/144, Accuracy: 20.30%)')"
+    ]
+  },
+  {
+    "skills": ["launching-evals"],
+    "query": "Check the progress",
+    "files": [],
+    "expected_behavior": [
+      "Runs `uv run nemo-evaluator-launcher status <invocation_id> --json` first to check if job is still running",
+      "If status is RUNNING, proceeds to check benchmark-specific progress (e.g., grep 'Running tasks' from client logs)",
+      "If status is FAILED, immediately pivots to debugging: checks BOTH client logs AND server logs before diagnosing root cause",
+      "Does NOT stop at client log errors (e.g., 'unknown_agent_error') - always checks server logs for the underlying cause (e.g., vLLM validation errors, CUDA OOM, context overflow)"
+    ]
+  },
+  {
+    "skills": ["launching-evals"],
+    "query": "Check the status, is it running in deed?",
+    "files": [],
+    "expected_behavior": [
+      "Runs `uv run nemo-evaluator-launcher status <invocation_id> --json` to check job status",
+      "If status is FAILED, checks BOTH client logs AND server logs before diagnosing root cause",
+      "Does NOT conclude root cause from client logs alone - client logs often show symptoms (e.g., 'unknown_agent_error', 'failed_samples_policy') while server logs show the actual cause (e.g., CUDA OOM, context length overflow, vLLM validation errors)",
+      "Provides diagnosis only after reviewing both log sources to avoid misleading the user with incomplete information"
+    ]
+  }
+]

From 03dfca7868086ed6cba530338f84e6c8c29f572b Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Sat, 18 Apr 2026 18:41:05 -0700
Subject: [PATCH 10/16] Add sync-upstream-skills.sh to re-vendor upstream NEL
 skills
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The script re-downloads launching-evals/ and accessing-mlflow/ from
NVIDIA-NeMo/Evaluator at a pinned SHA (DEFAULT_SHA in the script) and
re-applies our provenance frontmatter. Idempotent — running repeatedly
at the same SHA produces no diff.

Also fixes the re-sync path in both SKILL.md frontmatters to the actual
script location (.claude/scripts/).

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/scripts/sync-upstream-skills.sh  | 135 +++++++++++++++++++++++
 .claude/skills/accessing-mlflow/SKILL.md |   2 +-
 .claude/skills/launching-evals/SKILL.md  |   2 +-
 3 files changed, 137 insertions(+), 2 deletions(-)
 create mode 100755 .claude/scripts/sync-upstream-skills.sh

diff --git a/.claude/scripts/sync-upstream-skills.sh b/.claude/scripts/sync-upstream-skills.sh
new file mode 100755
index 0000000000..a42b6c82d5
--- /dev/null
+++ b/.claude/scripts/sync-upstream-skills.sh
@@ -0,0 +1,135 @@
+#!/usr/bin/env bash
+# Re-vendor upstream Claude skills from NVIDIA-NeMo/Evaluator at a pinned SHA.
+#
+# Scope: only skills we vendor verbatim (launching-evals, accessing-mlflow).
+# The `evaluation` skill is a *modified* fork of upstream nel-assistant and is
+# NOT managed by this script — update it manually when pulling upstream changes.
+#
+# Usage:
+#   .claude/scripts/sync-upstream-skills.sh            # re-vendor at the pinned SHA
+#   UPSTREAM_SHA=<sha> .claude/scripts/sync-upstream-skills.sh   # bump to a new SHA
+#
+# Requires: gh, base64, awk. Run from the repo root.
+#
+# The script overwrites .claude/skills/<skill>/ with upstream contents and
+# re-applies our provenance lines into each SKILL.md frontmatter. If you have
+# local changes to a vendored skill, they will be lost — that is expected,
+# since vendored-verbatim skills should not be modified locally.
+
+set -euo pipefail
+
+# Pinned upstream commit. Bump this (or pass UPSTREAM_SHA=...) when syncing.
+DEFAULT_SHA="01899f89e8f31116efbca56e8f87fbd8513e24ac"
+SHA="${UPSTREAM_SHA:-$DEFAULT_SHA}"
+SHORT_SHA="${SHA:0:7}"
+
+UPSTREAM_REPO="NVIDIA-NeMo/Evaluator"
+UPSTREAM_BASE="packages/nemo-evaluator-launcher/.claude/skills"
+DEST_BASE=".claude/skills"
+
+if [[ ! -d "$DEST_BASE" ]]; then
+    echo "error: run from the repo root (expected $DEST_BASE/ to exist)" >&2
+    exit 1
+fi
+
+echo "Syncing upstream skills from $UPSTREAM_REPO @ $SHORT_SHA"
+
+fetch_tree() {
+    local skill="$1"
+    local path="$2"
+    gh api "repos/$UPSTREAM_REPO/contents/$UPSTREAM_BASE/$skill/$path?ref=$SHA" \
+        -q '.[] | "\(.type)\t\(.name)"'
+}
+
+fetch_file() {
+    local src="$1"
+    local dst="$2"
+    mkdir -p "$(dirname "$dst")"
+    gh api "repos/$UPSTREAM_REPO/contents/$src?ref=$SHA" -q '.content' | base64 -d > "$dst"
+}
+
+fetch_skill_recursive() {
+    local skill="$1"
+    local subpath="${2:-}"
+    local remote="$UPSTREAM_BASE/$skill"
+    [[ -n "$subpath" ]] && remote="$remote/$subpath"
+
+    local entries
+    entries=$(gh api "repos/$UPSTREAM_REPO/contents/$remote?ref=$SHA" -q '.[] | "\(.type)\t\(.name)"')
+
+    while IFS=$'\t' read -r type name; do
+        local rel_path
+        if [[ -n "$subpath" ]]; then
+            rel_path="$subpath/$name"
+        else
+            rel_path="$name"
+        fi
+
+        if [[ "$type" == "file" ]]; then
+            local dst="$DEST_BASE/$skill/$rel_path"
+            echo "  fetch: $dst"
+            fetch_file "$UPSTREAM_BASE/$skill/$rel_path" "$dst"
+        elif [[ "$type" == "dir" ]]; then
+            fetch_skill_recursive "$skill" "$rel_path"
+        fi
+    done <<< "$entries"
+}
+
+# Inject our provenance lines into a SKILL.md frontmatter, right after the
+# `description:` line. Idempotent — removes any existing provenance block first.
+inject_provenance() {
+    local skill="$1"
+    local extra_note="${2:-}"
+    local path="$DEST_BASE/$skill/SKILL.md"
+
+    awk -v sha="$SHA" -v short="$SHORT_SHA" -v skill="$skill" -v extra="$extra_note" '
+        BEGIN { in_fm = 0; injected = 0; fm_end_seen = 0 }
+        # Skip any pre-existing provenance or license lines we own
+        /^license: Apache-2\.0$/ && in_fm && !fm_end_seen { next }
+        /^# Vendored verbatim/ && in_fm && !fm_end_seen { next }
+        /^# https:\/\/github\.com\/NVIDIA-NeMo\/Evaluator\/tree\// && in_fm && !fm_end_seen { next }
+        /^# To re-sync:/ && in_fm && !fm_end_seen { next }
+        /^# Note: this skill depends on the mlflow-mcp/ && in_fm && !fm_end_seen { next }
+        /^# configured in the user/ && in_fm && !fm_end_seen { next }
+        {
+            print
+            if ($0 == "---") {
+                if (in_fm == 0) { in_fm = 1 }
+                else { in_fm = 0; fm_end_seen = 1 }
+            }
+            if (in_fm && !injected && $0 ~ /^description: /) {
+                print "license: Apache-2.0"
+                print "# Vendored verbatim from NVIDIA NeMo Evaluator (commit " short ")"
+                print "# https://github.com/NVIDIA-NeMo/Evaluator/tree/" sha "/packages/nemo-evaluator-launcher/.claude/skills/" skill
+                print "# To re-sync: .claude/scripts/sync-upstream-skills.sh"
+                if (extra != "") {
+                    n = split(extra, lines, "\\|")
+                    for (i = 1; i <= n; i++) print "# " lines[i]
+                }
+                injected = 1
+            }
+        }
+    ' "$path" > "$path.tmp"
+    mv "$path.tmp" "$path"
+}
+
+for skill in launching-evals accessing-mlflow; do
+    echo ""
+    echo "== $skill =="
+    rm -rf "${DEST_BASE:?}/$skill"
+    fetch_skill_recursive "$skill"
+
+    case "$skill" in
+        accessing-mlflow)
+            inject_provenance "$skill" \
+                "Note: this skill depends on the mlflow-mcp MCP server (https://github.com/kkruglik/mlflow-mcp)|configured in the user's Claude Code setup."
+            ;;
+        *)
+            inject_provenance "$skill"
+            ;;
+    esac
+done
+
+echo ""
+echo "Done. Review with: git diff $DEST_BASE/launching-evals $DEST_BASE/accessing-mlflow"
+echo "If the SHA changed, update DEFAULT_SHA at the top of this script before committing."
diff --git a/.claude/skills/accessing-mlflow/SKILL.md b/.claude/skills/accessing-mlflow/SKILL.md
index 337a027bd9..690716637a 100644
--- a/.claude/skills/accessing-mlflow/SKILL.md
+++ b/.claude/skills/accessing-mlflow/SKILL.md
@@ -4,7 +4,7 @@ description: Query and browse evaluation results stored in MLflow. Use when the
 license: Apache-2.0
 # Vendored verbatim from NVIDIA NeMo Evaluator (commit 01899f8)
 # https://github.com/NVIDIA-NeMo/Evaluator/tree/01899f89e8f31116efbca56e8f87fbd8513e24ac/packages/nemo-evaluator-launcher/.claude/skills/accessing-mlflow
-# To re-sync: scripts/sync-upstream-skills.sh
+# To re-sync: .claude/scripts/sync-upstream-skills.sh
 # Note: this skill depends on the mlflow-mcp MCP server (https://github.com/kkruglik/mlflow-mcp)
 # configured in the user's Claude Code setup.
 ---
diff --git a/.claude/skills/launching-evals/SKILL.md b/.claude/skills/launching-evals/SKILL.md
index 34ef50bdd5..47ba236821 100644
--- a/.claude/skills/launching-evals/SKILL.md
+++ b/.claude/skills/launching-evals/SKILL.md
@@ -4,7 +4,7 @@ description: Run, monitor, analyze, and debug LLM evaluations via nemo-evaluator
 license: Apache-2.0
 # Vendored verbatim from NVIDIA NeMo Evaluator (commit 01899f8)
 # https://github.com/NVIDIA-NeMo/Evaluator/tree/01899f89e8f31116efbca56e8f87fbd8513e24ac/packages/nemo-evaluator-launcher/.claude/skills/launching-evals
-# To re-sync: scripts/sync-upstream-skills.sh
+# To re-sync: .claude/scripts/sync-upstream-skills.sh
 ---
 
 # NeMo Evaluator Skill

From 9cb309b2f16340ca771a717c94c4277276bb2523 Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Sat, 18 Apr 2026 19:34:48 -0700
Subject: [PATCH 11/16] Delete end-to-end-workflow.md per review feedback

Reviewers on PR #1239 (kaix-nv, mxinO) flagged the e2e workflow doc as
unnecessary: the skill descriptions already route Claude to chain PTQ,
deployment, and evaluation skills, and the content duplicated
workspace-management.md or lived better inside the evaluation skill's
nel-ci-guide.md references.

Removes the file and its three cross-references (evaluation/SKILL.md,
ptq/SKILL.md, workspace-management.md). The "carry PTQ patches forward
to deploy/eval" insight is preserved as a one-liner in evaluation/SKILL.md.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/common/end-to-end-workflow.md  | 70 -------------------
 .claude/skills/common/workspace-management.md |  2 -
 .claude/skills/evaluation/SKILL.md            |  2 +-
 .claude/skills/ptq/SKILL.md                   |  2 +-
 4 files changed, 2 insertions(+), 74 deletions(-)
 delete mode 100644 .claude/skills/common/end-to-end-workflow.md

diff --git a/.claude/skills/common/end-to-end-workflow.md b/.claude/skills/common/end-to-end-workflow.md
deleted file mode 100644
index 1dae03c2e5..0000000000
--- a/.claude/skills/common/end-to-end-workflow.md
+++ /dev/null
@@ -1,70 +0,0 @@
-# End-to-End Workflow: PTQ → Deploy → Eval
-
-This document ties together the three domain skills (PTQ, Deployment, Evaluation) for the common workflow of quantizing a model, deploying it, and evaluating accuracy.
-
-## Pipeline Overview
-
-```text
-PTQ (quantize)          → Deployment (serve)         → Evaluation (benchmark)
-─────────────────         ──────────────────           ────────────────────────
-hf_ptq.py                vLLM / SGLang / TRT-LLM      NEL (SLURM or JET)
-  ↓                         ↓                            ↓
-NVFP4/FP8 checkpoint      OpenAI-compatible API        MMLU, GSM8K, GPQA scores
-  (safetensors)            (http://host:8000)           (results.yml)
-```
-
-## Workspace Continuity
-
-All three stages share the same workspace directory. The PTQ output becomes the deployment input, and eval results land alongside:
-
-```text
-workspaces/model-name-format/
-  output/              ← PTQ checkpoint (safetensors + config.json)
-  eval_results/        ← NEL evaluation artifacts (results.yml per task)
-  eval_config.yaml     ← NEL config for evaluation
-  scripts/             ← Custom run scripts (if needed)
-  logs/                ← SLURM job logs
-```
-
-When starting a deployment or evaluation step, always check for an existing workspace from a prior PTQ run:
-
-```bash
-ls workspaces/
-```
-
-## Unsupported Models
-
-Models not in the verified support matrices require extra work at each stage:
-
-| Stage | What can go wrong | Reference |
-|-------|-------------------|-----------|
-| **PTQ** | Unknown architecture, FP8 source checkpoint, VLM structure | `ptq/references/unsupported-models.md` |
-| **Deployment** | Missing architecture mapping, weight key mismatches, quant/unquant layer confusion | `deployment/references/unsupported-models.md` |
-| **Evaluation** | Framework patches needed in deployment container, gated datasets, cluster storage | `evaluation/references/nel-ci-guide.md` |
-
-Each stage has its own debug loop (run → read error → diagnose → patch → re-run). Fixes from one stage often inform the next — e.g., if PTQ required a transformers upgrade, deployment and evaluation will too.
-
-## NEL Evaluation with Custom Deployments
-
-When the serving framework needs runtime patches (e.g., transformers upgrade, model handler fix), override `deployment.command` in the NEL config to inject fixes before serving:
-
-```yaml
-deployment:
-  command: >-
-    pip install "transformers>=5.0.0.dev0" --pre -q &&
-    sed -i 's/old_pattern/new_pattern/' /path/to/framework/file.py &&
-    ${deployment.base_command}
-```
-
-This works with both NEL SLURM executor and NEL CI (via `NEL_DEPLOYMENT_COMMAND`).
-
-## Decision: NEL SLURM Executor vs NEL CI (JET)
-
-| Factor | NEL SLURM executor | NEL CI (JET) |
-|--------|-------------------|--------------|
-| **When to use** | Iterative debugging, checkpoint on non-JET cluster, custom patches needed | Production evals, MLflow tracking, reproducible configs |
-| **Checkpoint location** | Any cluster you have SSH access to | Must be on JET cluster `/lustre/` storage |
-| **Secrets (HF_TOKEN, NGC)** | Provide your own via `host:` env vars | Managed centrally via JET secrets |
-| **Container patches** | Override `deployment.command` | Use `NEL_DEPLOYMENT_COMMAND` |
-| **MLflow export** | Manual setup | Automatic |
-| **Gated datasets** | Your HF account needs access | Handled by `COMPEVAL_HF_TOKEN` |
diff --git a/.claude/skills/common/workspace-management.md b/.claude/skills/common/workspace-management.md
index 5d85e91186..f797e7870e 100644
--- a/.claude/skills/common/workspace-management.md
+++ b/.claude/skills/common/workspace-management.md
@@ -105,8 +105,6 @@ workspaces/model-name-format/
   logs/                ← All: SLURM job logs
 ```
 
-See `skills/common/end-to-end-workflow.md` for the full pipeline.
-
 ## Example Flow
 
 ```text
diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
index f254f75e5d..bcaeb8433d 100644
--- a/.claude/skills/evaluation/SKILL.md
+++ b/.claude/skills/evaluation/SKILL.md
@@ -16,7 +16,7 @@ You're an expert in NeMo Evaluator Launcher! Guide the user through creating pro
 
 If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Check for existing workspaces — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications.
 
-This skill is often the final stage of the PTQ → Deploy → Eval pipeline. If the model required runtime patches during deployment (transformers upgrade, framework source fixes), carry those patches into the NEL config via `deployment.command`. See `skills/common/end-to-end-workflow.md` for the full pipeline.
+This skill is often the final stage of the PTQ → Deploy → Eval pipeline. If the model required runtime patches during deployment (transformers upgrade, framework source fixes), carry those patches into the NEL config via `deployment.command`.
 
 ### Workflow
 
diff --git a/.claude/skills/ptq/SKILL.md b/.claude/skills/ptq/SKILL.md
index 73597751ec..8ce8f12477 100644
--- a/.claude/skills/ptq/SKILL.md
+++ b/.claude/skills/ptq/SKILL.md
@@ -135,7 +135,7 @@ Report the path and size to the user.
 
 Validate the exported checkpoint's quantization pattern matches the recipe. Quantization config patterns can silently miss layers if the model uses non-standard naming (e.g., Gemma4 `experts.*` missed by `*mlp*` patterns) — this only surfaces later as deployment failures. Read `references/checkpoint-validation.md` for the validation script, expected patterns per recipe, and common pattern gaps.
 
-**Next steps**: If the user wants to deploy or evaluate the quantized checkpoint, use the **deployment** or **evaluation** skill. The checkpoint workspace carries over — see `skills/common/end-to-end-workflow.md` for the full PTQ → Deploy → Eval pipeline. If the model required patches during PTQ (e.g., transformers upgrade), the same fixes will likely be needed at deployment and evaluation time.
+**Next steps**: If the user wants to deploy or evaluate the quantized checkpoint, use the **deployment** or **evaluation** skill. The checkpoint workspace carries over. If the model required patches during PTQ (e.g., transformers upgrade), the same fixes will likely be needed at deployment and evaluation time.
 
 ## Key API Rules
 

From 290f4323a24512fb4e840a4f361ea65f128db1f7 Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Sat, 18 Apr 2026 19:40:04 -0700
Subject: [PATCH 12/16] Move nel-ci-guide.md to Model-Optimizer-Internal per
 review feedback

Reviewer @shengliangxu flagged that the NEL CI evaluation guide contains
NVIDIA-internal infrastructure (JET clusters, svc-jet service account,
gitlab-master NEL CI triggers, COMPEVAL_HF_TOKEN, internal lustre paths)
and should not ship in the public repo.

The file has been moved to Model-Optimizer-Internal:agent/nel-ci-guide.md
(see internal MR: zhiyu/add-nel-ci-guide-to-agent). This commit removes
the public copy and the "NEL CI and Cluster-Specific Notes" section from
evaluation/SKILL.md that referenced it.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/evaluation/SKILL.md            |  13 -
 .../evaluation/references/nel-ci-guide.md     | 276 ------------------
 2 files changed, 289 deletions(-)
 delete mode 100644 .claude/skills/evaluation/references/nel-ci-guide.md

diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
index bcaeb8433d..6f61f1d980 100644
--- a/.claude/skills/evaluation/SKILL.md
+++ b/.claude/skills/evaluation/SKILL.md
@@ -319,19 +319,6 @@ After job submission, you can monitor progress using:
 
 ---
 
-### NEL CI and Cluster-Specific Notes
-
-For running evaluations on NVIDIA JET clusters (oci-hsg, cw, oci-nrt) or SLURM clusters like dlcluster, read `references/nel-ci-guide.md`. It covers:
-- NEL CI GitLab trigger pattern vs NEL SLURM executor
-- Cluster-specific GPU counts and storage paths
-- Checkpoint availability (compute nodes may not share login node filesystems)
-- Environment variable prefixes (`host:`, `lit:`) for SLURM executor
-- SGLang must bind `--host 0.0.0.0` for health checks
-- Directory setup and `chmod 777` for JET service account access
-- Common issues (NGC auth, gated datasets, walltime, `NEL_OTHER_OVERRIDES` space-splitting)
-
----
-
 Direct users with issues to:
 
 - **GitHub Issues:** <https://github.com/NVIDIA-NeMo/Evaluator/issues>
diff --git a/.claude/skills/evaluation/references/nel-ci-guide.md b/.claude/skills/evaluation/references/nel-ci-guide.md
deleted file mode 100644
index 846d0236c8..0000000000
--- a/.claude/skills/evaluation/references/nel-ci-guide.md
+++ /dev/null
@@ -1,276 +0,0 @@
-# NEL CI Evaluation Guide
-
-NEL CI is the recommended entry point for running evaluations on NVIDIA JET infrastructure. This guide covers patterns for evaluating quantized checkpoints using both the NEL SLURM executor (direct) and the NEL CI GitLab pipeline.
-
-Reference repo: `gitlab-master.nvidia.com/dl/JoC/competitive_evaluation/nemo-evaluator-launcher-ci`
-
----
-
-## 1. Two Execution Paths
-
-| Path | When to use | How it works |
-|------|-------------|--------------|
-| **NEL SLURM executor** | You have SSH access to the cluster, checkpoint is on cluster storage | `nel run --config config.yaml` from your workstation; NEL SSHes to cluster and submits sbatch jobs |
-| **NEL CI GitLab pipeline** | You want managed infrastructure, MLflow export, reproducible configs | Trigger via GitLab API or UI; JET orchestrates everything |
-
-### NEL SLURM executor
-
-Best for iterative development and debugging. Run from any machine with SSH access to the cluster:
-
-```bash
-export DUMMY_API_KEY=dummy
-export HF_TOKEN=<your_token>
-
-nel run --config eval_config.yaml \
-    -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10  # test first
-```
-
-### NEL CI trigger
-
-Best for production evaluations with MLflow tracking. See the trigger script pattern in section 4.
-
----
-
-## 2. Cluster Reference
-
-| Cluster | GPUs/Node | Architecture | Max Walltime | Storage | Notes |
-|---------|-----------|-------------|--------------|---------|-------|
-| oci-hsg | 4 | GB200 | 4 hours | `/lustre/` | Set `tensor_parallel_size=4` |
-| cw | 8 | H100 | — | `/lustre/` | — |
-| oci-nrt | 8 | H100 | — | `/lustre/` | Numerics configs |
-| dlcluster | 4 (B100 partition) | B100 | 8 hours | `/home/omniml_data_*` | No `/lustre/`; use local NFS paths |
-
-**Important**: `deployment.tensor_parallel_size` determines how many GPUs are requested. If this exceeds the cluster's GPUs per node, the job fails with a memory allocation error.
-
----
-
-## 3. Checkpoint Availability
-
-The checkpoint must be on a filesystem accessible from the cluster's **compute nodes** (not just login nodes).
-
-| Cluster type | Accessible storage | NOT accessible |
-|-------------|-------------------|----------------|
-| JET clusters (oci-hsg, cw, oci-nrt) | `/lustre/fsw/...` | Workstation paths (`/home/scratch.*`), NFS mounts from other clusters |
-| dlcluster | `/home/omniml_data_*`, `/home/scratch.*` | `/lustre/` (not available) |
-
-If the checkpoint is on a workstation, **copy it to cluster storage first**:
-
-```bash
-rsync -av /path/to/local/checkpoint \
-    <cluster-login>:/lustre/fsw/portfolios/coreai/users/$USER/checkpoints/
-```
-
-**Cross-cluster copy** (e.g., dlcluster → oci-hsg): If the two clusters can't SSH to each other directly, pipe through your workstation without staging to disk:
-
-```bash
-ssh user@source-cluster "tar czf - -C /path/to/checkpoint ." | \
-    ssh user@target-cluster "tar xzf - -C /lustre/.../checkpoints/model-name"
-```
-
-After copying, set permissions for svc-jet: `chmod -R 777 /lustre/.../checkpoints/model-name`
-
-For dlcluster, the checkpoint paths are directly accessible since the NFS mounts are shared between login and compute nodes.
-
----
-
-## 4. NEL CI Trigger Pattern
-
-For JET clusters, trigger evaluations via the GitLab API.
-
-### Simple deployment (standard models)
-
-For models that work with stock vLLM/SGLang, use `NEL_DEPLOYMENT_COMMAND` directly:
-
-```bash
-export GITLAB_TOKEN=<your_gitlab_token>
-
-curl -k --request POST \
-  --header "PRIVATE-TOKEN: ${GITLAB_TOKEN}" \
-  --header "Content-Type: application/json" \
-  --data '{
-    "ref": "main",
-    "variables": [
-      {"key": "NEL_CONFIG_PATH", "value": "configs/AA/minimax_m2_5_lbd_lax.yaml"},
-      {"key": "NEL_ACCOUNT", "value": "coreai_dlalgo_modelopt"},
-      {"key": "NEL_CLUSTER", "value": "oci-hsg"},
-      {"key": "NEL_CHECKPOINT_OR_ARTIFACT", "value": "/lustre/.../checkpoint"},
-      {"key": "NEL_DEPLOYMENT_IMAGE", "value": "vllm/vllm-openai:v0.19.0"},
-      {"key": "NEL_DEPLOYMENT_COMMAND", "value": "vllm serve /checkpoint --host 0.0.0.0 --port 8000 --tensor-parallel-size 4 --quantization modelopt_fp4 --trust-remote-code --served-model-name my-model"},
-      {"key": "NEL_OTHER_OVERRIDES", "value": "deployment.tensor_parallel_size=4 execution.walltime=04:00:00"},
-      {"key": "NEL_HF_HOME", "value": "/lustre/.../cache/huggingface"},
-      {"key": "NEL_VLLM_CACHE", "value": "/lustre/.../cache/vllm"},
-      {"key": "NEL_CLUSTER_OUTPUT_DIR", "value": "/lustre/.../nv-eval-rundirs"}
-    ]
-  }' \
-  "https://gitlab-master.nvidia.com/api/v4/projects/221804/pipeline"
-```
-
-### Complex deployment (unsupported models needing runtime patches)
-
-If the model needs runtime patches (e.g., transformers upgrade, framework source fixes), **do NOT put multi-step commands in `NEL_DEPLOYMENT_COMMAND`** — Hydra's override parser will break on nested quotes, `&&`, `$()`, etc.
-
-Instead, use the **wrapper script pattern**: place a `serve.sh` in the checkpoint directory on the cluster, then point `NEL_DEPLOYMENT_COMMAND` to it.
-
-**Step 1** — Write wrapper script to the checkpoint directory on the cluster:
-
-```bash
-ssh <cluster-login> 'cat > /lustre/.../checkpoint/serve.sh << '"'"'EOF'"'"'
-#!/bin/bash
-set -e
-pip install "transformers>=5.0.0.dev0" "huggingface_hub>=0.32.0" --pre -q
-# Patch vLLM for ministral3 support (example)
-MISTRAL3_PY=$(find /usr/local/lib -path "*/vllm/model_executor/models/mistral3.py" 2>/dev/null | head -1)
-sed -i "s/old_pattern/new_pattern/" "$MISTRAL3_PY"
-exec vllm serve /checkpoint --host 0.0.0.0 --port 8000 \
-    --tensor-parallel-size 4 --quantization modelopt_fp4 \
-    --trust-remote-code --served-model-name my-model --gpu-memory-utilization 0.9
-EOF
-chmod 777 /lustre/.../checkpoint/serve.sh'
-```
-
-**Step 2** — Set `NEL_DEPLOYMENT_COMMAND` to the wrapper:
-
-```json
-{"key": "NEL_DEPLOYMENT_COMMAND", "value": "bash /checkpoint/serve.sh"}
-```
-
-This works because the checkpoint is mounted at `/checkpoint` inside the container. The script is Hydra-safe (no special characters in the override value).
-
-### Custom configs with `NEL_CONFIG_BASE64`
-
-When using a custom config (not from the repo), use `NEL_CONFIG_BASE64` instead of `NEL_CONFIG_PATH`. This requires setting `NEL_UNTRUSTED_EVAL=true`:
-
-```python
-import json, base64, subprocess, os
-
-with open("my_config.yaml") as f:
-    config_b64 = base64.b64encode(f.read().encode()).decode()
-
-payload = {
-    "ref": "main",
-    "variables": [
-        {"key": "NEL_CONFIG_BASE64", "value": config_b64},
-        {"key": "NEL_ACCOUNT", "value": "coreai_dlalgo_modelopt"},
-        {"key": "NEL_CLUSTER", "value": "oci-hsg"},
-        {"key": "NEL_CHECKPOINT_OR_ARTIFACT", "value": "/lustre/.../checkpoint"},
-        {"key": "NEL_DEPLOYMENT_IMAGE", "value": "vllm/vllm-openai:v0.19.0"},
-        {"key": "NEL_DEPLOYMENT_COMMAND", "value": "bash /checkpoint/serve.sh"},
-        {"key": "NEL_UNTRUSTED_EVAL", "value": "true"},
-        # ... other variables
-    ]
-}
-
-# Use Python to construct JSON (avoids shell escaping issues with curl)
-token = os.environ["GITLAB_TOKEN"]
-subprocess.run(
-    ["curl", "-k", "--request", "POST",
-     "--header", f"PRIVATE-TOKEN: {token}",
-     "--header", "Content-Type: application/json",
-     "--data", json.dumps(payload),
-     "https://gitlab-master.nvidia.com/api/v4/projects/221804/pipeline"],
-)
-```
-
-> **Tip**: Use Python (not bash) to construct the JSON payload for `curl`. Shell escaping of base64 strings and nested quotes is error-prone.
-
----
-
-## 5. Environment Variables
-
-### SLURM executor format
-
-Env vars in NEL SLURM configs require explicit prefixes:
-
-| Prefix | Meaning | Example |
-|--------|---------|---------|
-| `host:VAR_NAME` | Read from the host environment where `nel run` is executed | `host:HF_TOKEN` |
-| `lit:value` | Literal string value | `lit:dummy` |
-
-```yaml
-evaluation:
-  env_vars:
-    DUMMY_API_KEY: host:DUMMY_API_KEY
-    HF_TOKEN: host:HF_TOKEN
-```
-
-### JET executor format
-
-JET configs reference JET secrets with `$SECRET_NAME`:
-
-```yaml
-execution:
-  env_vars:
-    evaluation:
-      HF_TOKEN: $COMPEVAL_HF_TOKEN
-```
-
-### Gated datasets
-
-Tasks that download gated HuggingFace datasets (e.g., GPQA, HLE) need `HF_TOKEN` passed to the evaluation container.
-
-**NEL CI (JET)**: Handled automatically — the `COMPEVAL_HF_TOKEN` JET secret is pre-configured by the eval platform team. No user action needed; you don't even need personal access to the gated dataset.
-
-**NEL SLURM executor**: You must provide your own HF token, AND your HuggingFace account must have been granted access to the gated dataset (e.g., request access at <https://huggingface.co/datasets/Idavidrein/gpqa> for GPQA).
-
-```yaml
-evaluation:
-  env_vars:
-    HF_TOKEN: host:HF_TOKEN  # SLURM executor — reads from your local env
-  tasks:
-    - name: simple_evals.gpqa_diamond
-      env_vars:
-        HF_TOKEN: host:HF_TOKEN
-```
-
----
-
-## 6. Serving Framework Notes
-
-### vLLM
-
-- Binds to `0.0.0.0` by default — health checks work out of the box
-- For NVFP4: `--quantization modelopt_fp4`
-- For unsupported models (e.g., ministral3): may need custom `deployment.command` to patch the framework before serving (see `deployment/references/unsupported-models.md`)
-
-### SGLang
-
-- **Must include `--host 0.0.0.0`** — SGLang defaults to `127.0.0.1` which blocks health checks from the eval client
-- Must include `--port 8000` to match NEL's expected port
-- For NVFP4: `--quantization modelopt_fp4`
-
----
-
-## 7. Common Issues
-
-| Issue | Cause | Fix |
-|-------|-------|-----|
-| `401 Unauthorized` pulling eval container | NGC credentials not set on cluster | Set up `~/.config/enroot/.credentials` with NGC API key |
-| `PermissionError: /hf-cache/...` | HF cache dir not writable by svc-jet | Set `NEL_HF_HOME` to your own `chmod 777` directory |
-| Health check stuck at `000` | Server binding to localhost | Add `--host 0.0.0.0` to deployment command (SGLang) |
-| `Memory required by task is not available` | TP size exceeds GPUs/node | Set `tensor_parallel_size` to match cluster (4 for oci-hsg, dlcluster B100) |
-| TIMEOUT after eval completes | Walltime too short for eval + MLflow export | Set `execution.walltime=04:00:00` |
-| Gated dataset auth failure | `HF_TOKEN` not passed to eval container | Add `env_vars.HF_TOKEN` at evaluation or task level |
-| `NEL_OTHER_OVERRIDES` splits `extra_args` | Space-separated parsing breaks multi-flag values | Use `NEL_DEPLOYMENT_COMMAND` instead |
-| Checkpoint not found in container | Path not on cluster compute-node filesystem | Copy checkpoint to `/lustre/` (or cluster-accessible path) first |
-| `trusted_eval` type mismatch in MLflow export | NEL writes boolean `true` instead of string `"true"` | Fix with `sed -i "s/trusted_eval: true/trusted_eval: 'true'/"` in export config |
-| `LexerNoViableAltException` in Hydra | `NEL_DEPLOYMENT_COMMAND` contains quotes, `&&`, `$()` | Use wrapper script pattern (section 4): put script in checkpoint dir, set command to `bash /checkpoint/serve.sh` |
-| `Bad Request` from GitLab API trigger | Shell escaping mangled the JSON payload | Use Python to construct JSON (section 4) instead of bash heredocs/string interpolation |
-| `The model <path> does not exist` (404) | Eval client uses checkpoint path as model_id instead of served_model_name | Add `deployment.served_model_name=<name>` to `NEL_OTHER_OVERRIDES` to match `--served-model-name` in your serve command |
-
----
-
-## 8. Directory Setup for JET Clusters
-
-Before running evaluations on a JET cluster, create writable directories:
-
-```bash
-ssh <cluster-login>
-mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface
-mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm
-mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs
-chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface
-chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm
-chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs
-```
-
-`chmod 777` is required because `svc-jet` (JET service account) runs containers and needs write access.

From 7824b24901b91028da1b9f2e375a69013de7d09b Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Sat, 18 Apr 2026 20:17:40 -0700
Subject: [PATCH 13/16] Unblock CI and address mxinO review on
 remote-execution.md
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two fixes:

1. Exclude vendored upstream skills from markdownlint.

   `.claude/skills/launching-evals/` and `.claude/skills/accessing-mlflow/`
   are vendored verbatim from NVIDIA-NeMo/Evaluator and re-synced via
   .claude/scripts/sync-upstream-skills.sh. Markdownlint wanted to
   reformat them (trailing blank lines, spacing around fences), but
   fixing would violate the "verbatim" property documented in their
   frontmatter. Add an `ignores:` glob to `.markdownlint-cli2.yaml`.

2. Reframe the checkpoint/storage note on `skills/common/remote-execution.md`.

   Reviewer @mxinO noted (PR #1239) that the previous "compute nodes may
   not share the same filesystem as login nodes" framing is misleading —
   compute nodes on a given cluster do share storage with the login
   node. The real issue is that workstation filesystems aren't mounted
   on the cluster at all. Also drops the dlcluster-specific row, which
   @mxinO flagged as an internal quirk that shouldn't ship publicly.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/common/remote-execution.md | 13 ++++++-------
 .markdownlint-cli2.yaml                   |  6 ++++++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/.claude/skills/common/remote-execution.md b/.claude/skills/common/remote-execution.md
index 2e538fa466..be770aef93 100644
--- a/.claude/skills/common/remote-execution.md
+++ b/.claude/skills/common/remote-execution.md
@@ -28,16 +28,15 @@ clusters:
 default_cluster: my-cluster
 ```
 
-### Checkpoint and storage availability
+### Staging checkpoints from your workstation
 
-Cluster compute nodes may not share the same filesystem as login nodes or other clusters. Before running any workload that references a checkpoint path, verify the path is accessible from compute nodes:
+Workstation filesystems (`/home/scratch.*`, local NFS) are **not** mounted on the cluster. If a checkpoint was produced on your workstation, copy it to the cluster's own storage before submitting any job that references it — NEL and SLURM do NOT sync checkpoints automatically.
 
-| Cluster type | Compute-node storage | NOT accessible from compute nodes |
-|-------------|---------------------|----------------------------------|
-| JET clusters (oci-hsg, cw, oci-nrt) | `/lustre/fsw/...` | Workstation NFS (`/home/scratch.*`), other cluster mounts |
-| dlcluster | `/home/omniml_data_*`, `/home/scratch.*` | `/lustre/` paths |
+```bash
+rsync -av /path/to/local/checkpoint <cluster-login>:<cluster-workspace>/checkpoints/
+```
 
-If a checkpoint was produced on a different cluster or workstation, copy it to the target cluster's accessible storage before submitting jobs. NEL and SLURM do NOT sync checkpoints automatically.
+Use the `workspace` path from your cluster config as the destination. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
 
 See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.
 
diff --git a/.markdownlint-cli2.yaml b/.markdownlint-cli2.yaml
index 38a1c1f0fe..de3bbba7b3 100644
--- a/.markdownlint-cli2.yaml
+++ b/.markdownlint-cli2.yaml
@@ -10,3 +10,9 @@ config:
   MD036: false # no-emphasis-as-heading - allow **bold** as section markers
   MD041: false # first-line-heading
   MD059: false # no-hard-tabs
+
+# Vendored upstream skills — kept byte-identical to upstream via
+# .claude/scripts/sync-upstream-skills.sh; do not reformat.
+ignores:
+  - ".claude/skills/launching-evals/**"
+  - ".claude/skills/accessing-mlflow/**"

From 31c4fe842af1d42f3aba21ae1d862a7a3a9ab0a3 Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Sat, 18 Apr 2026 20:24:39 -0700
Subject: [PATCH 14/16] fix format

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/scripts/sync-upstream-skills.sh | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/.claude/scripts/sync-upstream-skills.sh b/.claude/scripts/sync-upstream-skills.sh
index a42b6c82d5..c0fa5f32ac 100755
--- a/.claude/scripts/sync-upstream-skills.sh
+++ b/.claude/scripts/sync-upstream-skills.sh
@@ -1,4 +1,19 @@
 #!/usr/bin/env bash
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # Re-vendor upstream Claude skills from NVIDIA-NeMo/Evaluator at a pinned SHA.
 #
 # Scope: only skills we vendor verbatim (launching-evals, accessing-mlflow).

From 645a5458c2003c87446fa638f863029892e12be0 Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Sat, 18 Apr 2026 20:27:00 -0700
Subject: [PATCH 15/16] Split credential setup out of slurm-setup.md into
 credentials.md
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Addresses three overlapping review comments on slurm-setup.md:62
from PR #1239:

- @mxinO: NGC/HF/Docker tokens aren't SLURM-specific — wanted a
  general credential setup guide referenced from multiple skills.
- CodeRabbit: `$oauthtoken` needs to be called out as a literal NGC
  login string, not a shell variable to substitute.
- Copilot: the previous snippet overwrote `~/.config/enroot/.credentials`
  unconditionally, clobbering entries for other registries.

New `skills/common/credentials.md` covers HF_TOKEN, NGC API key (Docker
+ enroot paths), and Docker Hub. The NGC/enroot block uses an
append-if-missing pattern (`grep -q ... || echo ... >>`) and spells
out that `$oauthtoken` is a literal, kept unexpanded via single quotes.

`slurm-setup.md` now keeps only the pyxis-specific signpost — one
paragraph pointing at `credentials.md` for the actual setup.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 .claude/skills/common/credentials.md | 60 ++++++++++++++++++++++++++++
 .claude/skills/common/slurm-setup.md | 12 +-----
 2 files changed, 61 insertions(+), 11 deletions(-)
 create mode 100644 .claude/skills/common/credentials.md

diff --git a/.claude/skills/common/credentials.md b/.claude/skills/common/credentials.md
new file mode 100644
index 0000000000..dd45445fd4
--- /dev/null
+++ b/.claude/skills/common/credentials.md
@@ -0,0 +1,60 @@
+# Credentials Setup
+
+Tokens and registry credentials that ModelOpt workflows need across local and cluster environments. Not SLURM-specific — referenced from PTQ, deployment, evaluation, and slurm-setup skills.
+
+## HuggingFace token (`HF_TOKEN`)
+
+Required for gated models (e.g., Llama, Mistral, some Nemotron variants) and gated datasets (e.g., GPQA, HLE).
+
+Generate at <https://huggingface.co/settings/tokens>, then export:
+
+```bash
+export HF_TOKEN=hf_...
+```
+
+Persist in `~/.bashrc` or a project-local `.env` file. For remote clusters, check whether the cluster's shell config already sets it: `ssh <cluster-login> 'env | grep -c HF_TOKEN'`.
+
+## NGC API key (for `nvcr.io`)
+
+Required for pulling NGC images (`nvcr.io/nvidia/pytorch:...`, `nvcr.io/nvidia/vllm:...`) via Docker, `srun --container-image`, or enroot.
+
+Generate at <https://ngc.nvidia.com/setup/api-key>.
+
+### Docker
+
+```bash
+docker login nvcr.io -u '$oauthtoken' -p <NGC_API_KEY>
+```
+
+### Enroot (SLURM / pyxis)
+
+Add an entry to `~/.config/enroot/.credentials` on the cluster. The file may already hold credentials for other registries — **append rather than overwrite**:
+
+```bash
+mkdir -p ~/.config/enroot
+CREDS=~/.config/enroot/.credentials
+touch "$CREDS"
+grep -q '^machine nvcr.io ' "$CREDS" || \
+    echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' >> "$CREDS"
+chmod 600 "$CREDS"
+```
+
+> **Note**: `$oauthtoken` is a **literal string** required by NGC, not a shell variable. Do not replace it and do not let your shell expand it — the single quotes above keep it literal.
+
+Without this, `srun --container-image=nvcr.io/...` fails with `401 Unauthorized` when the compute node tries to pull.
+
+## Docker Hub login
+
+Only needed if you hit rate limits pulling public images:
+
+```bash
+docker login
+```
+
+## Summary
+
+| Credential | Used for | Set via |
+|---|---|---|
+| `HF_TOKEN` | Gated HF models / datasets | Env var (`export HF_TOKEN=...`) or `.env` |
+| NGC API key | `nvcr.io` image pulls | `docker login` or `~/.config/enroot/.credentials` |
+| Docker Hub | Rate-limited public image pulls | `docker login` |
diff --git a/.claude/skills/common/slurm-setup.md b/.claude/skills/common/slurm-setup.md
index 3126702e44..f7d99c7543 100644
--- a/.claude/skills/common/slurm-setup.md
+++ b/.claude/skills/common/slurm-setup.md
@@ -53,17 +53,7 @@ srun \
 
 ### Container registry credentials (pyxis)
 
-If `srun --container-image` uses an image from a private registry (e.g., `nvcr.io/nvidia/...`), pyxis/enroot needs credentials on the cluster. Check for existing credentials and add if missing:
-
-```bash
-cat ~/.config/enroot/.credentials 2>/dev/null || echo "No credentials"
-# To add NGC credentials:
-mkdir -p ~/.config/enroot
-echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials
-chmod 600 ~/.config/enroot/.credentials
-```
-
-Without this, `srun` will fail with `401 Unauthorized` when pulling from `nvcr.io`.
+If `srun --container-image` uses an image from a private registry (e.g., `nvcr.io/nvidia/...`), pyxis/enroot needs registry credentials on the cluster in `~/.config/enroot/.credentials`. See `skills/common/credentials.md` for the NGC / Docker / HF token setup. Without this, `srun` fails with `401 Unauthorized` when the compute node pulls.
 
 Submit and capture the job ID:
 

From c664a30c0b9944cccc8a762dae9ea0eba7148687 Mon Sep 17 00:00:00 2001
From: Zhiyu Cheng <zhiyuc@nvidia.com>
Date: Sat, 18 Apr 2026 22:30:13 -0700
Subject: [PATCH 16/16] Add CHANGELOG entry for evaluation skills polish

Documents the new Claude Code evaluation-related skills and the shared
credentials.md common doc, mirroring the style of the existing PTQ skill
entry in the 0.44 release notes.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
---
 CHANGELOG.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/CHANGELOG.rst b/CHANGELOG.rst
index 80dea0e43e..b4e9fe6bfe 100755
--- a/CHANGELOG.rst
+++ b/CHANGELOG.rst
@@ -15,6 +15,7 @@ Changelog
 - Enable PTQ workflow for the Step3.5-Flash MoE model with NVFP4 W4A4 + FP8 KV cache quantization. See `modelopt_recipes/models/Step3.5-Flash/nvfp4-mlp-only.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/models/Step3.5-Flash/nvfp4-mlp-only.yaml>`_ for more details.
 - Add support for vLLM fakequant reload using ModelOpt state for HF models. See `examples/vllm_serve/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/vllm_serve#load-qatptq-model-and-serve-in-vllm-wip>`_ for more details.
 - [Early Testing] Add Claude Code PTQ skill (``.claude/skills/ptq/``) for agent-assisted post-training quantization. The skill guides the agent through environment detection, model support checking, format selection, and execution via the launcher or manual SLURM/Docker/bare GPU paths. Includes handling for unlisted models with custom module patching. This feature is in early testing — use with caution.
+- [Early Testing] Polish Claude Code evaluation skill (``.claude/skills/evaluation/``) for agent-assisted LLM accuracy benchmarking via NeMo Evaluator Launcher. Adds two companion skills vendored verbatim from `NVIDIA-NeMo/Evaluator <https://github.com/NVIDIA-NeMo/Evaluator>`_: ``launching-evals`` (run/check/debug/analyze NEL evaluations) and ``accessing-mlflow`` (query MLflow runs, compare metrics, fetch artifacts). Re-sync at a pinned upstream SHA via ``.claude/scripts/sync-upstream-skills.sh``. Also adds a shared ``skills/common/credentials.md`` covering HF / NGC / Docker token setup referenced by multiple skills. This feature is in early testing — use with caution.
 - Add performant layerwise calibration for large models that don't fit on GPU (e.g. DeepSeek-R1, Kimi-K2). See `modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml>`_ for usage. Layerwise calibration also supports PTQ with intermediate progress saving — useful when long PTQ runs get hit with Slurm timeouts. See `modelopt_recipes/general/ptq/nvfp4_default-none_kv_gptq.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_default-none_kv_gptq.yaml>`_ for usage.
 
 **Backward Breaking Changes**