Skip to content

PTB-XL, ECG-QA datasets and preprocess, resampling tasks#1051

Open
yiyunw3 wants to merge 37 commits intosunlabuiuc:masterfrom
jovianw:master
Open

PTB-XL, ECG-QA datasets and preprocess, resampling tasks#1051
yiyunw3 wants to merge 37 commits intosunlabuiuc:masterfrom
jovianw:master

Conversation

@yiyunw3
Copy link
Copy Markdown

@yiyunw3 yiyunw3 commented Apr 21, 2026

Contributor: Jovian Wang (jovianw2@illinois.edu), Matthew Pham (mdpham2@illinois.edu), Yiyun Wang (yiyunw3@illinois.edu)
Contribution Type: Dataset + task
Original paper: https://arxiv.org/abs/2410.14464
Original datasets:

  1. ECG-QA: https://huggingface.co/datasets/jialucode/FSL_ECG_QA_Dataset
  2. PTB-XL: https://physionet.org/content/ptb-xl/1.0.1/

Description

This PR includes a dataset + task contribution.
We added two new PyHealth datasets for PTB-XL and ECG-QA data along with preprocess task, resampling task, and an ECG-QA example to show the possible usages of the datasets.

Our main goal is to reproduce and extend the multimodal meta-learning framework for few-shot ECG question answering as mentioned in the paper by exploring how including more patient information like age and gender in the ECG questions would help improving the overall accuracy of the output.

It's also very beneficial to add the two datasets being used during the process to PyHealth as they haven't previously been included in PyHealth and it would reduce a lot of the complexities for reproduction with the help of PyHealth features.

Files to Review

Datasets

  1. ecgqa.py (ecgqa.yaml)
  2. ptbxl.py (ptbxl.yaml)

The following features applies to both datasets:

  1. Contains option to download the dataset online if no local dataset available.
  2. Initializes the BaseDataset class.
  3. Has proper validation and metadata.

Tasks

  1. ptbxl_resampling.py
  2. ecgqa_preprocess.py

The ptbxl_resampling task is designed to standardize PTB-XL data for the FSL ECG QA model. The task uses Fourier-based interpolation (scipy.signal.resample) to downsample 12-lead ECG signals from 500Hz to 250Hz, effectively transforming the data shape from (12 x 5000) to (12 x 2500) while preserving morphological integrity. Additionally, the task output is formatted for multi-label classification to support the clinical reality of patients having multiple, co-occurring cardiac diagnoses.

The ecg_preprocess task optionally joins QA dataset with an ECG signal dataset (like PTB-XL) on patient_id, creating a combined output for efficient training few-shot training. It also generates a key for episodic sampling.

The output of the two tasks can then be very easily fed into the existing training pipelines for the framework of the few-shot ECG question answering that we are interested in.

Example

ecgqa_fsl.py

This task runs through the full preprocessing pipeline combining the ECG signals from PTB-XL with the questions and answers from the ECG-QA dataset.

The workflow is:

  1. Load and resample the PTB-XL data
  2. Map the signal ids to the ECG-QA patient ids
  3. Load the ECG-QA dataset and combine

Unit tests

  1. test_ecgqa.py
  2. test_ptbxl.py

yiyunw3 and others added 30 commits April 6, 2026 19:09
feat: implement ECG-QA dataset download capability and testing
@jovianw
Copy link
Copy Markdown

jovianw commented Apr 21, 2026

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants