feat: add AIME-2026 benchmark. by xxman-google · Pull Request #2469 · NVIDIA-NeMo/RL

xxman-google · 2026-05-12T05:17:59Z

What does this PR do ?

Add AIME'26 eval benchmark.

Many new models just came out this year may incorporated data from 2025. I tested the new benchmark with Qwen3.5-9B model and it scored 89.2% averaging 16 runs.

Issues

N/A

Usage

Launch eval through the following command:

uv run examples/run_eval.py \
data.dataset_name="aime2026" \
data.prompt_file=examples/prompts/cot.txt \
cluster.gpus_per_node=8 \
generation.vllm_cfg.max_model_len=81920 \
generation.model_name="Qwen/Qwen3.5-9B" \
generation.temperature=1.0 \
generation.top_p=0.95 \
generation.top_k=20 \
eval.num_tests_per_prompt=16

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

copy-pr-bot · 2026-05-12T05:18:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yuki-97

thanks for adding this. overall lgtm! one minor comment.

could you also help to add an unit test for the dataset like those in tests/unit/data/datasets/test_eval_dataset.py?

xxman-google · 2026-05-12T16:37:03Z

Comments addressed. PTAL @yuki-97

yuki-97

lgtm, and offline verified test_aime_dataset works (since it's skipped now).

yuki-97 · 2026-05-13T07:36:17Z

/ok to test ba1b343

yuki-97 · 2026-05-13T07:59:09Z

hi @xxman-google , could you help to fix the lint fail? other fails are not related to this PR, we'll take a look.

Signed-off-by: Xuehan Xiong <xxman@google.com>

xxman-google · 2026-05-14T16:37:30Z

hi @xxman-google , could you help to fix the lint fail? other fails are not related to this PR, we'll take a look.

@yuki-97 Done. PTAL again.

yuki-97 · 2026-05-14T16:42:49Z

/ok to test 89cdaa5

yuki-97 · 2026-05-14T16:52:35Z

/ok to test 5c32bac

yuki-97 · 2026-05-14T16:54:33Z

@kajalj22 could you help to take a look? I've tried both Lfast and L1, but both of them fail when building container (different reason).

xxman-google requested review from a team as code owners May 12, 2026 05:18

github-actions Bot added Documentation Improvements or additions to documentation community-request labels May 12, 2026

yuki-97 reviewed May 12, 2026

View reviewed changes

Comment thread nemo_rl/data/datasets/eval_datasets/__init__.py Outdated

yuki-97 mentioned this pull request May 12, 2026

feat: add HMMT eval benchmark. #2468

Open

4 tasks

xxman-google force-pushed the xx/aime26 branch from 16cc255 to ba1b343 Compare May 12, 2026 16:36

xxman-google requested a review from a team as a code owner May 12, 2026 16:36

yuki-97 previously approved these changes May 13, 2026

View reviewed changes

yuki-97 added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 13, 2026

copy-pr-bot Bot temporarily deployed to public May 13, 2026 07:36 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 13, 2026 07:36 Failure

copy-pr-bot Bot temporarily deployed to public May 13, 2026 07:36 Inactive

copy-pr-bot Bot temporarily deployed to public May 13, 2026 07:37 Inactive

copy-pr-bot Bot temporarily deployed to public May 13, 2026 07:40 Inactive

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 13, 2026

copy-pr-bot Bot had a problem deploying to nemo-ci May 14, 2026 02:09 Failure

feat: add AIME-2026 benchmark.

89cdaa5

Signed-off-by: Xuehan Xiong <xxman@google.com>

xxman-google dismissed yuki-97’s stale review via 89cdaa5 May 14, 2026 16:36

xxman-google force-pushed the xx/aime26 branch from ba1b343 to 89cdaa5 Compare May 14, 2026 16:36

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) labels May 14, 2026

copy-pr-bot Bot temporarily deployed to public May 14, 2026 16:43 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 14, 2026 16:43 Failure

copy-pr-bot Bot temporarily deployed to public May 14, 2026 16:43 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 14, 2026 16:47 Failure

copy-pr-bot Bot temporarily deployed to public May 14, 2026 16:47 Inactive

Merge branch 'main' into xx/aime26

5c32bac

copy-pr-bot Bot temporarily deployed to public May 14, 2026 16:52 Inactive

copy-pr-bot Bot temporarily deployed to public May 14, 2026 16:53 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 14, 2026 16:53 Failure

copy-pr-bot Bot temporarily deployed to public May 14, 2026 16:53 Inactive

svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label May 14, 2026

copy-pr-bot Bot had a problem deploying to nemo-ci May 15, 2026 14:10 Failure

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add AIME-2026 benchmark.#2469

feat: add AIME-2026 benchmark.#2469
xxman-google wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
xxman-google:xx/aime26

xxman-google commented May 12, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

yuki-97 left a comment •

edited

Loading

Uh oh!

Uh oh!

xxman-google commented May 12, 2026

Uh oh!

yuki-97 left a comment

Uh oh!

yuki-97 commented May 13, 2026

Uh oh!

yuki-97 commented May 13, 2026 •

edited

Loading

Uh oh!

xxman-google commented May 14, 2026

Uh oh!

yuki-97 commented May 14, 2026

Uh oh!

yuki-97 commented May 14, 2026

Uh oh!

yuki-97 commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xxman-google commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

yuki-97 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xxman-google commented May 12, 2026

Uh oh!

yuki-97 left a comment

Choose a reason for hiding this comment

Uh oh!

yuki-97 commented May 13, 2026

Uh oh!

yuki-97 commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xxman-google commented May 14, 2026

Uh oh!

yuki-97 commented May 14, 2026

Uh oh!

yuki-97 commented May 14, 2026

Uh oh!

yuki-97 commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xxman-google commented May 12, 2026 •

edited

Loading

yuki-97 left a comment •

edited

Loading

yuki-97 commented May 13, 2026 •

edited

Loading