model + task + dataset implementation, tested#1112
Open
sanyamahesh2 wants to merge 2 commits intosunlabuiuc:masterfrom
Open
model + task + dataset implementation, tested#1112sanyamahesh2 wants to merge 2 commits intosunlabuiuc:masterfrom
sanyamahesh2 wants to merge 2 commits intosunlabuiuc:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Contributor: Sanya Mahesh (sanyam2@illinois.edu)
Contribution Type: Dataset + Task + Model (Full Pipeline)
Original Paper: Gélard et al., "BulkRNABert: Cancer prognosis from bulk RNA-seq based language models", bioRxiv 2024. https://doi.org/10.1101/2024.06.18.599483
Description
This PR implements a full PyHealth pipeline including dataset loading and preprocessing, two downstream tasks, and the BulkRNABert model.
The examples script investigates three design choices not ablated in the original paper: binning resolution (B ∈ {32, 64, 128}), frozen backbone vs IA3 vs full fine-tuning, and Cox loss behavior on censored cohorts.
Files to Review
Dataset
Tasks
Model
Unit tests
To run tests:
enter into python3.12 venv
source venv312/bin/activateinstall dependencies
pip install -e .run tests
pytest tests/test_bulk_rna_bert.py tests/test_tcga_rnaseq.py -vFull pipeline/example
Notes
All tests use synthetic data only, no real TCGA download required
The examples script runs entirely on synthetic data. Swap in real TCGA data by replacing the make_synthetic_data call with your downloaded rna_seq.csv and clinical.csv
conftest.py - If torch.uint16 is missing, set torch.uint16 = torch.int16 before any code imports litdata - fixes strict older stacks
Data available at https://portal.gdc.cancer.gov/