Skip to content

feat: add AIME-2026 benchmark.#2469

Open
xxman-google wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
xxman-google:xx/aime26
Open

feat: add AIME-2026 benchmark.#2469
xxman-google wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
xxman-google:xx/aime26

Conversation

@xxman-google
Copy link
Copy Markdown
Contributor

@xxman-google xxman-google commented May 12, 2026

What does this PR do ?

Add AIME'26 eval benchmark.

Many new models just came out this year may incorporated data from 2025. I tested the new benchmark with Qwen3.5-9B model and it scored 89.2% averaging 16 runs.

Issues

N/A

Usage

  • Launch eval through the following command:
uv run examples/run_eval.py \
data.dataset_name="aime2026" \
data.prompt_file=examples/prompts/cot.txt \
cluster.gpus_per_node=8 \
generation.vllm_cfg.max_model_len=81920 \
generation.model_name="Qwen/Qwen3.5-9B" \
generation.temperature=1.0 \
generation.top_p=0.95 \
generation.top_k=20 \
eval.num_tests_per_prompt=16

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@xxman-google xxman-google requested review from a team as code owners May 12, 2026 05:18
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added Documentation Improvements or additions to documentation community-request labels May 12, 2026
Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this. overall lgtm! one minor comment.

could you also help to add an unit test for the dataset like those in tests/unit/data/datasets/test_eval_dataset.py?

Comment thread nemo_rl/data/datasets/eval_datasets/__init__.py Outdated
@yuki-97 yuki-97 mentioned this pull request May 12, 2026
4 tasks
@xxman-google xxman-google requested a review from a team as a code owner May 12, 2026 16:36
@xxman-google
Copy link
Copy Markdown
Contributor Author

Comments addressed. PTAL @yuki-97

yuki-97
yuki-97 previously approved these changes May 13, 2026
Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, and offline verified test_aime_dataset works (since it's skipped now).

@yuki-97 yuki-97 added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 13, 2026
@yuki-97
Copy link
Copy Markdown
Contributor

yuki-97 commented May 13, 2026

/ok to test ba1b343

@yuki-97
Copy link
Copy Markdown
Contributor

yuki-97 commented May 13, 2026

hi @xxman-google , could you help to fix the lint fail? other fails are not related to this PR, we'll take a look.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 13, 2026
Signed-off-by: Xuehan Xiong <xxman@google.com>
@xxman-google
Copy link
Copy Markdown
Contributor Author

hi @xxman-google , could you help to fix the lint fail? other fails are not related to this PR, we'll take a look.

@yuki-97 Done. PTAL again.

@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) labels May 14, 2026
@yuki-97
Copy link
Copy Markdown
Contributor

yuki-97 commented May 14, 2026

/ok to test 89cdaa5

@yuki-97
Copy link
Copy Markdown
Contributor

yuki-97 commented May 14, 2026

/ok to test 5c32bac

@yuki-97
Copy link
Copy Markdown
Contributor

yuki-97 commented May 14, 2026

@kajalj22 could you help to take a look? I've tried both Lfast and L1, but both of them fail when building container (different reason).

@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label May 14, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests community-request Documentation Improvements or additions to documentation waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants