Skip to content

Add CCEP ECoG dataset, Localize SOZ task, and SPES models#1027

Open
mkoretsky1 wants to merge 7 commits intosunlabuiuc:masterfrom
dtavana:ccep_ecog
Open

Add CCEP ECoG dataset, Localize SOZ task, and SPES models#1027
mkoretsky1 wants to merge 7 commits intosunlabuiuc:masterfrom
dtavana:ccep_ecog

Conversation

@mkoretsky1
Copy link
Copy Markdown

Contributors

Mathew Koretsky (NetID: mathewk4) (mathewk4@illinois.edu)
Kevin Splinter (NetID: kevints5) (kevints5@illinois.edu)
Darian Tavana (NetID: dtavana2) (dtavana2@illinois.edu)

Type of Contribution

Full Pipeline: Dataset + Task + Model (Option 4)

Original Paper

Norris, J.; Chari, A.; van Blooijs, D.; Cooray, G. K.; Friston, K.; Tisdall, M. M.; and Rosch, R. E. 2024. Localising the Seizure Onset Zone from Single-Pulse Electrical Stimulation Responses with a CNN Transformer. Proceedings of the 9th Machine Learning for Healthcare Conference, volume 252 of Proceedings of Machine Learning Research. PMLR. (https://proceedings.mlr.press/v252/norris24a.html).

Description

Adds a full PyHealth pipeline for electrode-level seizure onset zone (SOZ) localization from CCEP ECoG recordings, based on Norris et al. (2024). Includes a BIDS-compatible dataset class (CCEPECoGDataset), a task class (LocalizeSOZ) that preprocesses raw stimulation EEG into paired divergent/convergent response tensors, and two models (SPESResNet, SPESTransformer) for binary SOZ classification. An end-to-end example script demonstrates 5-fold patient-level cross-validation with per-fold normalization and an ablation study isolating the contribution of trial variability and the hybrid embedding in the transformer model.

Files to Review

  • pyhealth/datasets/ccep_ecog.py - CCEP ECoG dataset implementation
  • pyhealth/tasks/localize_soz.py - Localize SOZ task implementation
  • pyhealth/models/spes.py - SPES models implementation
  • tests/core/test_ccep_ecog.py - Dataset and task test suite
  • tests/core/test_spes.py - Model test suite
  • examples/ccep_ecog_localize_soz_spes.py - End-to-end example demonstrating dataset, task, and model usage
  • pyproject.toml - Added requirements specific to this implementation

@joshuasteier joshuasteier self-requested a review April 19, 2026 21:04
@joshuasteier
Copy link
Copy Markdown
Collaborator

Hi @mkoretsky1 , Kevin, and Darian, thanks for putting this together. A few items I'd like to see addressed, plus some smaller things.

Blockers

1. The Destrieux atlas .rda file may be missing.

get_destrieux_lobe in localize_soz.py does:

rda_path = Path(__file__).resolve().parent.parent / "destrieux.rda"
_destrieux_df = pyreadr.read_r(str(rda_path))["destrieux"]

That resolves to pyhealth/destrieux.rda, but I don't see the file in the PR diff. If it's not committed, the except Exception: return "Unknown" fallback silently fires on every electrode and every sample gets electrode_lobe = region_to_index["Unknown"]. Could you confirm the .rda is either added to the PR or document how users obtain it? Either way, please change the silent except to log a warning (once per process) when the file is missing so this failure mode is visible.

2. Bare except: with state leak in process_metric_for_analysis.

try:
    ...
    response_df['distances'] = distances
    response_df = response_df.sort_values(by='distances', ascending=True)
except:
    print("Error with subject", subject)

The bare except: catches KeyboardInterrupt and SystemExit, prints, then the code keeps going with response_df and distances in whatever state they were in at the point of failure. Please change to except Exception as e: with a logger.warning and either return None or re-raise — continuing after a caught exception with partial state is dangerous.

3. Metadata CSV written into the data root.

output_path = os.path.join(root, "ccep_ecog-metadata-pyhealth.csv")
df.to_csv(output_path, index=False)

BIDS datasets are often on mounted storage with restricted write permissions. Matt McKenna's fix for the same pattern in #981 shipped as:

def __init__(self, root: str = ".", config_path: Optional[str] = ..., **kwargs):
    self._verify_data(root)
    self._tmp_dir = tempfile.mkdtemp(prefix="pyhealth_ccep_ecog_")
    self._index_data(root, self._tmp_dir)
    super().__init__(
        root=self._tmp_dir,
        tables=["ecog"],
        ...
    )

def __del__(self):
    shutil.rmtree(self._tmp_dir, ignore_errors=True)

Could you mirror that pattern? _index_data writes the CSV into the tmp dir, the tmp dir becomes the root passed to super().__init__ so the YAML's file_path resolves correctly, and __del__ cleans up.

4. dataset.set_shuffle(...) will crash on real data.

In get_spes_dataloader:

def get_spes_dataloader(dataset, batch_size, shuffle=False, norm_stats=None):
    dataset.set_shuffle(shuffle)
    return DataLoader(...)

set_shuffle is only defined on InMemorySampleDataset, not on the disk-backed SampleDataset that BaseDataset.set_task(...) returns. Your example path is CCEPECoGDataset(root=...).set_task(LocalizeSOZ(), ...), which returns the disk-backed class. The first time anyone runs the example end-to-end against real ds004080 data, this will raise AttributeError: 'SampleDataset' object has no attribute 'set_shuffle'.

Simplest fix: drop the set_shuffle call and pass shuffle=shuffle directly to DataLoader(...). The in-memory shuffle hook isn't buying you anything the DataLoader can't do.

Code quality (none blocking)

  1. process_metric_for_analysis has several commented-out np.save('../data/main/...') lines with hardcoded relative paths. Please remove.

  2. process_run_data and process_metric_for_analysis use print(...) for status and error reporting; the dataset file uses logger. Please convert the print calls to logger.warning / logger.info for consistency.

  3. LocalizeSOZ.__call__ iterates over ("ecog", "train", "eval") splits, but only ecog is a real table in the YAML. The dummy patient in the test deliberately returns empty for train/eval. Is iterating over those two extra splits intentional future-proofing, or vestigial? If vestigial, drop them.

  4. 25 commits with several "Update localize_soz.py" messages. Please squash to a small number of logical commits before merge (dataset, task, models, example + tests, docs is a reasonable split).

  5. LocalizeSOZ tests mock process_run_data and process_for_analysis entirely, so the MNE/BIDS epoching and the 13 mm distance filter are not exercised in tests. A single end-to-end test using a tiny synthetic BrainVision file (maybe 4 channels, 2 stim events) would lock in the actual preprocessing behaviour. Not a blocker, but worth considering as a follow-up.

Minor

  1. SPESResNet._sample_channels samples with replacement when len(valid_rows) < self.input_channels. A one-line comment explaining that duplicated channels are acceptable here would help future readers.

  2. In _index_data, has_soz is set from the first electrodes.tsv found and then breaks. If a patient has multiple sessions with different labeling completeness, the single has_soz flag may not reflect the specific session on each downstream row. Low priority but worth a look.

The model implementations are clean, and the per-fold normalization is done correctly. Once the blockers are addressed I'm happy to approve.

@mkoretsky1
Copy link
Copy Markdown
Author

mkoretsky1 commented Apr 19, 2026

@joshuasteier We've fixed the blockers and made some code quality updates. When you have a chance could you please re-review the PR?

The one thing we're unsure about is on 1. Destrieux atlas .rda file. For now, we have hard-coded the brain region indexes into pyhealth/tasks/localize_soz.py. Please let us know if this is sufficient or if you would prefer to have the file itself, and if so, where that file should go in the PyHealth repo. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants