Skip to content

Add NuBench training datasets#888

Merged
sevmag merged 7 commits into
graphnet-team:mainfrom
sevmag:features/nu_bench_dataset
May 8, 2026
Merged

Add NuBench training datasets#888
sevmag merged 7 commits into
graphnet-team:mainfrom
sevmag:features/nu_bench_dataset

Conversation

@sevmag
Copy link
Copy Markdown
Collaborator

@sevmag sevmag commented May 7, 2026

Closes #887.

Summary

  • Adds NuBenchDataset (in src/graphnet/datasets/nubench_datasets.py), a single ERDAHostedDataset entry point for the NuBench benchmark suite (arXiv:2511.13111).
  • Ships a registry of the available NuBench datasets — cluster, flower_l, flower_s, flower_xl, hexagon, hexagon_ice_le — each pinned to its ERDA hash and corresponding NuBenchDetector subclass.
  • Handles the NuBench split convention automatically: train/val read from the merged_photons pulsemap, test reads from pulses_no_noise.
  • Re-exports NuBenchDataset, NuBenchSpec, FEATURES_NUBENCH, and TRUTH_NUBENCH from graphnet.datasets.

The NuBench detector classes (Cluster, FlowerL, FlowerS, FlowerXL, Hexagon, Triangle) and geometry tables already exist on main; this PR adds the missing Dataset layer so the benchmarks can actually be trained on.

Usage

from graphnet.models.graphs import KNNGraph
from graphnet.models.detector.nubench import Hexagon
from graphnet.datasets import NuBenchDataset

ds = NuBenchDataset(
    name="hexagon_ice_le",
    download_dir="/path/to/nubench_data",
    data_representation=KNNGraph(detector=Hexagon()),
)

Test plan

  • CI passes (lint + unit tests).
  • Smoke-test NuBenchDataset(name="hexagon_ice_le", ...) end-to-end: download from ERDA, build train/val/test splits, iterate a few batches.
  • Verify available_datasets() lists all entries in the registry.

sevmag and others added 7 commits May 7, 2026 14:12
Adds a CuratedDataset subclass for the Hexagon Ice LE benchmark from
the NuBench suite (~8.6M neutrino events). Uses the pre-computed
NuBench train/test selection parquet files rather than random splitting,
and supports both pulses_no_noise and merged_photons pulsemaps.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace per-dataset HexagonIceLEDataset class with a single
NuBenchDataset entry point driven by a NuBenchSpec registry, so new
NuBench datasets (e.g. triangle) can be added without a new class.
Pulsemap is selected per split (merged_photons for train/val,
pulses_no_noise for test) to match the NuBench convention.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@Aske-Rosted Aske-Rosted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a single comment, otherwise this seems fine to me.

Comment thread src/graphnet/datasets/nubench_datasets.py
@sevmag sevmag requested a review from Aske-Rosted May 8, 2026 02:46
@sevmag sevmag merged commit b12c211 into graphnet-team:main May 8, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add NuBench benchmark training datasets

2 participants