Dev/keep embeddings by christyanamarie · Pull Request #1065 · sunlabuiuc/PyHealth

christyanamarie · 2026-04-21T21:55:43Z

Contributors: Lookman Olowo (lolowo2), Desmond Fung (dkfung2), Christiana Beard (cmbeard2)

Contribution Type: Model

Paper: Ahmed Elhussein et al., "KEEP: Integrating Medical Ontologies with Clinical Data for Robust Code Embeddings." CHIL 2025.

Description

This PR reproduces the KEEP embedding pipeline and integrates it into PyHealth task/model workflows.

1. KEEP pipeline integration

Adds the KEEP embedding pipeline (ontology/mapping preparation, co-occurrence construction, training, and export).

2. Task/model integration

Connects KEEP artifacts to downstream task/model workflows and includes compatibility fixes touched during integration.

3. Reproducibility support

Adds or updates examples, tests, and documentation so reviewers can validate behavior end-to-end.

Core library changes

File	Why review
pyhealth/medcode/pretrained_embeddings/keep_emb/run_pipeline.py	KEEP pipeline entrypoints, stage orchestration, training loop, and embedding export
pyhealth/medcode/pretrained_embeddings/keep_emb/*	Ontology integration, ICD/SNOMED mapping behavior, and helper utilities
pyhealth/tasks/* (files changed in this PR)	Task-level compatibility and code_mapping/NDC extraction updates
pyhealth/models/grasp.py	GRASP integration with pretrained embeddings and related model-path updates
pyhealth/models/concare.py; pyhealth/models/rnn.py	Stability/edge-case fixes touched during integration

Tests

File	Why review
tests/*/keep*	Core KEEP coverage and regressions
tests files updated for tasks/models in this PR	NDC extraction, mapping flow, and edge-case behavior validation

Examples and notebooks

File	Why review
examples/keep	End-to-end KEEP reproducibility scripts/notebooks
examples/grasp	Downstream usage of pretrained embeddings in mortality workflows
examples/concaresweep*	Experiment/sweep reproducibility updates

Suggested review order

Core library changes
Tests
Examples and notebooks

Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

Adds optional code_mapping parameter to SequenceProcessor that maps granular medical codes to grouped vocabularies (e.g. ICD9CM→CCSCM) before building the embedding table. Resolves the functional gap from the 1.x→2.0 rewrite where code_mapping was removed. Ref sunlabuiuc#535 Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com>

Two identical notebooks for A/B testing code_mapping impact on mortality prediction. Only difference is the schema override in Step 2. Both use seed=42 for reproducible splits. Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com>

…mapping event.drug returns drug names (e.g. "Aspirin") which produce zero matches in CrossMap NDC→ATC; event.ndc returns actual NDC codes enabling 3/3 feature mapping for mortality and readmission tasks. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

Checks that mortality and readmission task processors build vocabulary from NDC codes (numeric strings) rather than drug names (e.g. "Aspirin"), confirming the event.drug -> event.ndc fix works correctly. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

…e docs - Fix event.drug -> event.ndc in MortalityPredictionMIMIC4 (line 282) - Update readmission task docstrings to reflect NDC extraction Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

DrugRecommendationMIMIC3 used prescriptions/drug (drug names) via Polars column select; changed to prescriptions/ndc to match MIMIC-4 variant and enable NDC->ATC code mapping. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

RNNLayer: clamp sequence lengths to min 1 so pack_padded_sequence does not crash on all-zero masks, matching TCNLayer (tcn.py:186). ConCare: guard covariance divisor with max(n-1, 1) to prevent ZeroDivisionError when attention produces single-element features. Both edge cases are triggered when code_mapping collapses vocabularies and some patients have all codes map to <unk>, producing all-zero embeddings and all-zero masks. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

…mapping Baseline notebook runs GRASP with raw ICD-9/NDC codes. Code_mapping notebook collapses vocab via ICD9CM→CCSCM, ICD9PROC→CCSPROC, NDC→ATC for trainable embeddings on full MIMIC-III. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

- ConCare FinalAttentionQKV: bare .squeeze() removed batch dim when batch_size=1, causing IndexError in softmax. Use .squeeze(-1) and .squeeze(1) to target only the intended dimensions. - ConCare cov(): division by zero when x.size(1)==1. Guard with max(). - GRASP grasp_encoder: remove stale torch.squeeze(hidden_t, 0) that collapsed [1, hidden] to [hidden] with batch_size=1. Both RNNLayer and ConCareLayer already return [batch, hidden]. - GRASP random_init: clamp num_centers to num_points to prevent ValueError when cluster_num > batch_size. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

…mapping Baseline notebook runs GRASP with raw ICD-9/NDC codes. Code_mapping notebook collapses vocab via ICD9CM→CCSCM, ICD9PROC→CCSPROC, NDC→ATC for trainable embeddings on full MIMIC-III. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

Allow tasks to accept a code_mapping dict that upgrades input_schema entries so SequenceProcessor maps raw codes (e.g. ICD9CM) to grouped vocabularies (e.g. CCSCM) at fit/process time. This avoids manual schema manipulation after task construction. - Add code_mapping parameter to BaseTask.__init__() - Thread **kwargs + super().__init__() through all task subclasses with existing __init__ methods (4 readmission tasks, 1 multimodal mortality task) - Add 17 tests covering SequenceProcessor mapping and task-level code_mapping initialization Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

Replace manual task.input_schema override with the new code_mapping parameter on MortalityPredictionMIMIC3(). Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

# Conflicts: # examples/mortality_prediction/mortality_mimic3_grasp_with_code_mapping.ipynb

Mirrors the GRASP+ConCare mortality notebook pipeline exactly (same tables, split, seed, metrics) but sweeps 72 configurations of embedding_dim, hidden_dim, cluster_num, lr, and weight_decay. Results are logged to sweep_results.csv. Supports --root for pointing at local MIMIC-III, --code-mapping, --dev, and --monitor. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

Smaller ConCare configs (embedding_dim=8/16) may learn slower and need more epochs before plateauing. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-authored-by: ddhangdd <43976109+ddhangdd@users.noreply.github.com>

…Option A) Add SNOMED as a first-class code_mapping target in PyHealth's medcode system, enabling any model to use KEEP embeddings via code_mapping=("ICD9CM", "SNOMED") + pretrained_emb_path="keep_snomed.txt". SNOMED vocabulary: - Add pyhealth/medcode/codes/snomed.py (InnerMap subclass) - Register in medcode/__init__.py - SNOMED data generated locally from Athena OMOP download (IHTSDO licensing restricts GCS redistribution). Users need only SNOMED + ICD9CM + ICD10CM vocabularies from https://athena.ohdsi.org/ Medcode file generation: - Add generate_medcode_files.py: produces SNOMED.csv, ICD9CM_to_SNOMED.csv, and ICD10CM_to_SNOMED.csv in PyHealth's medcode cache from Athena data Embedding export: - Add export_embeddings.py: exports keep_snomed.txt (primary, Option A) and keep_icd9.txt / keep_icd10.txt (fallback, Option B) - Round-trip tested with init_embedding_with_pretrained() Also fixes author order across all keep_emb modules. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

… support Add mortality_mimic3_grasp_keep.py example demonstrating the full KEEP pipeline integrated with GRASP. Verified end-to-end: Athena parsing, SNOMED graph, Node2Vec, co-occurrence, GloVe, export, SNOMED code_mapping, GRASP with pretrained embeddings — all complete without errors. - Add examples/mortality_prediction/mortality_mimic3_grasp_keep.py with USE_KEEP toggle for quick comparison of KEEP vs random embeddings - Fix GRASP.__init__: wire pretrained_emb_path through to EmbeddingModel (was silently ignored — EmbeddingModel always got pretrained_emb_path=None) - Fix author order in build_omop_graph, build_cooccurrence, train_node2vec Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

Update mortality_mimic3_grasp_keep.py with proven hyperparams from GRASP+GRU sweep: batch_size=256, hidden_dim=32, cluster_num=8, lr=1e-3, wd=1e-4, monitor=pr_auc (better for 12% imbalanced data). Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

Add run_pipeline.py with run_keep_pipeline() convenience function that wraps all KEEP stages (graph, mappings, medcode files, patient extraction, rollup, co-occurrence, Node2Vec, GloVe, export) into a single call. Simplify mortality_mimic3_grasp_keep.py example: Step 2 goes from 40 lines of pipeline glue to 3 lines. Add hardware info, compute tracking (CodeCarbon + pynvml), loss landscape visualization, and per-run artifact saving (config.json, results.json, loss_landscape.png) to Trainer's output folder. Update keep_emb/__init__.py to export all public functions. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

run_keep_pipeline() now checks for .csv first, falls back to .csv.gz. pandas read_csv handles gzip decompression transparently. This allows users to transfer compressed Athena files (~115MB vs ~924MB) to remote compute environments. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

…code Fix three compounding bugs in train_glove.py that made regularization effectively zero, preventing KEEP from anchoring to Node2Vec ontology. Critical fixes: - Lambda scaling: change .mean() to .sum() in reg loss. With .mean(), effective lambda was divided by batch_size*embedding_dim (102,400x smaller than intended). Now matches reference's sum-over-batch - Row+col regularization: compute reg loss for both i_indices and j_indices per batch, matching reference (was row-only) - L2 path: use torch.norm(p=2) (linear L2) not .pow(2) (squared L2) - Default lambda: 1e-5 (matches reference LAMBD=0.00001, not paper Table 6's 1e-3 which uses different normalization convention) Other fixes: - Return (glove_loss, reg_loss) tuple for separate logging - Add paper vs code deviation table to KeepGloVe docstring - Sort polyhierarchy parents in generate_medcode_files for reproducibility - Update run_pipeline.py lambda default to 1e-5 Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

…epts Two paper-faithfulness fixes to build_hierarchy_graph matching the KEEP paper Appendix A.1.1. Root node: - Before: BFS from 187 standard SNOMED Condition roots → 65,375 nodes (broader than paper, includes body temperature observations, pain findings, family history, etc.) - After: BFS from single concept 4274025 "Disease" → 68,303 nodes (matches paper Appendix A.1.1 exactly) Orphan rescue via CONCEPT_ANCESTOR: - Some Condition concepts have direct "Is a" parents only in Observation domain (e.g., "DRESS syndrome", "Drug-induced HF"). Our domain filter strips those edges, orphaning the concept. - The paper's CONCEPT_ANCESTOR approach naturally includes them via transitive closure. We replicate this via an optional rescue step: load CONCEPT_ANCESTOR, find descendants of root not yet in graph, add edge to closest in-graph ancestor (tie-break by smaller concept_id). - Rescued 93 orphans in real Athena data (68,303 → 68,396 nodes) - Matches keep-mimic4's node count exactly. Changes: - Add root_concept_id parameter (default 4274025) to build_hierarchy_graph - Add ancestor_csv parameter for optional orphan rescue - Add _rescue_orphans() internal function - Update run_pipeline.py to pass CONCEPT_ANCESTOR.csv automatically - Add 9 new tests (5 single-root, 4 orphan rescue) using synthetic fixtures Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

Rewrite build_hierarchy_graph to match the paper's algorithm (Appendix A.1.1). Replaces our two-pass approach (BFS + orphan rescue) with a single-pass approach that mirrors the paper. Before: 1. Load all SNOMED Condition concepts (105K candidates) 2. Load "Is a" edges from CONCEPT_RELATIONSHIP 3. BFS from root, track depth -> 68,303 nodes 4. Look up missing nodes via CONCEPT_ANCESTOR 5. Rescue 93 orphans (add nodes + edges) Total: 68,396 nodes, 152,194 edges After: 1. Load CONCEPT_ANCESTOR, get descendants of root within depth 2. Intersect with SNOMED Condition standard concepts -> 68,396 nodes 3. Load "Is a" edges, filter to node set 4. Rescue 17 orphan edges (node set already complete) Total: 68,396 nodes, 152,222 edges (28 more direct edges than before) Empirical equivalence verified: node sets are identical between the two implementations (0 nodes differ). The paper-faithful approach finds 28 more direct "Is a" edges because it doesn't depend on BFS reachability. Changes: - build_hierarchy_graph: rewrite to use CONCEPT_ANCESTOR for node set - _rescue_orphan_edges: simplified — only adds edges, nodes already in graph - _bfs_fallback: new function for when CONCEPT_ANCESTOR is unavailable - Keeps ancestor_csv parameter for backward compatibility - All 60 existing tests pass unchanged Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

Previously, when an ICD code mapped to multiple SNOMED concepts in Athena's "Maps to" relationships, only the first target was kept and alternatives were silently dropped. This affected ~27% of ICD-10 codes (18,851 codes on real Athena data) — combination codes like "A01.04 Typhoid arthritis" that encode multiple atomic clinical concepts. Change: Dict[str, int] -> Dict[str, List[int]] throughout the pipeline. Files updated: - build_omop_graph.build_icd_to_snomed: returns sorted list per code - build_all_mappings: signature updated - build_cooccurrence.extract_patient_codes_from_df: dense expansion (each ICD occurrence counts as an occurrence of ALL its SNOMED targets) - export_embeddings.export_icd: averages embeddings for multi-target codes - generate_medcode_files.generate_crossmap_csv: emits one row per (ICD, SNOMED) pair — PyHealth's CrossMap aggregates these into lists Verified on real Athena data: - ICD-9: 2,063 multi-target codes (12.1%) - ICD-10: 18,851 multi-target codes (26.7%) — matches paper's ~24% Tests: 66/66 passing (6 new multi-target tests added) Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

…ful) Implements the implicit count filter visible in G2Lab's "_ct_filter" file suffix. Drops SNOMED concepts with zero patient observations after dense rollup, ensuring Stage 1 (Node2Vec) and Stage 2 (GloVe) operate on the same vocabulary. Without this filter, ~92% of exported embeddings would be Node2Vec-only (no Stage 2 refinement) because GloVe's co-occurrence matrix only sees observed concepts. Why this matters for KEEP: - Every exported embedding has undergone both stages (semantic consistency) - Paper reports ~5,686 concepts at depth 5 (we would have 68K without filter) - Node2Vec walks stay in clinically-meaningful regions - GloVe co-occurrence matrix shrinks from 68K*68K (18GB) to ~5K*5K (130MB) Changes: - Add apply_count_filter(): drops concepts with zero diagonal entries, rebuilds graph/matrix/indices - Wire into run_pipeline.py as Step 5 (between co-occurrence and Node2Vec) - Add 4 tests covering drop logic, reindexing, and subgraph isolation Tests: 70/70 passing Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

Flip defaults to paper-faithful (L2, lambda=1e-3, AdamW, per-element mean reduction) while keeping G2Lab code-faithful variant available via parameters. We do not know with certainty which variant produced the paper's published AUPRC numbers — exposing both lets users run ablations and lets reviewers verify either claim. Default behavior (paper-faithful): reg_distance = "l2" (paper Equation 4: ||w - w_n2v||^2) reg_reduction = "mean" (paper's per-element normalization) optimizer = "adamw" (paper Algorithm 1) lambd = 1e-3 (paper Table 6) Code-faithful variant (one-line switch): train_keep(..., reg_distance="cosine", reg_reduction="sum", lambd=1e-5, optimizer="adagrad") Changes: - KeepGloVe: use_cosine_reg (bool) replaced with reg_distance (str) and reg_reduction (str). forward() handles all 4 combinations. - train_keep: adds optimizer parameter ("adamw" or "adagrad") - run_pipeline.run_keep_pipeline: forwards all four parameters - Default lambd changed from 1e-5 to 1e-3 (paper Table 6) - L2 path uses squared L2 to match paper Equation 4 exactly - Added 9 new tests covering paper/code defaults, parameter validation, and empirical difference between L2 and cosine distance Tests: 79/79 passing Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

…alth) Add explicit docstring note to extract_patient_codes_from_df explaining why KEEP's embedding training intentionally does NOT apply the censoring rule (2nd-occurrence date < outcome date) that appears in G2Lab's create_cohort_sentence.py. Rationale: - KEEP embeddings are population-level, task-agnostic - Paper Appendix A.4 uses complete patient history for embedding training - Downstream prediction censoring is handled by PyHealth task processors (MortalityPredictionMIMIC3/4, ReadmissionPredictionMIMIC4) — same mechanism works for both MIMIC-III and MIMIC-IV - Adding censoring to embedding training would discard signal for no benefit Also add keep-implementation-comparison.md to version control (was previously untracked) with updated censoring section reflecting this decision. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

Implement the paper's intrinsic evaluation methodology (Appendix B.1) so users can validate their KEEP embeddings against paper Table 2 targets before running expensive extrinsic evaluations. Paper Table 2 targets (UK Biobank KEEP): Resnik correlation: median ~0.68 Co-occurrence correlation: median ~0.62 Methodology per paper Appendix B.1: For each code, compare cosine similarity against Resnik semantic similarity (IC of lowest common ancestor) and co-occurrence counts. Use K1=10 most-similar + K2=150 random concepts per source, repeat 250 times, report median correlation. New public API (exposed via keep_emb module): - compute_information_content(graph, patient_code_sets=None) - resnik_similarity(graph, a, b, ic) - resnik_correlation(embeddings, node_ids, graph, ...) - cooccurrence_correlation(embeddings, node_ids, cooc, code_to_idx, ...) - evaluate_embeddings(...) # convenience wrapper for both - load_keep_embeddings(path) # load keep_snomed.txt back into numpy - apply_count_filter (also re-exported from build_cooccurrence) Usage on H200 after running the pipeline: results = evaluate_embeddings( keep_embeddings, node_ids, graph, cooc_matrix=cooc_matrix, code_to_idx=code_to_idx, num_runs=250, ) # results["resnik"]["median"] -> compare to paper's 0.68 # results["cooccurrence"]["median"] -> compare to paper's 0.62 Add 14 tests with synthetic SNOMED hierarchy + embeddings that verify IC computation, Resnik semantic similarity (siblings closer than cousins), correlation directionality (aligned > random), and round-trip file loading. Tests: 93/93 passing Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

…ring Append a verified executive summary to keep-implementation-comparison.md showing what each of paper/G2Lab/Desmond/ours actually delivered. Every row verified by running inspect.signature() on our actual functions, not aspirational. Summary of our port: - Graph: CONCEPT_ANCESTOR-driven, root=4274025, depth<=5, orphan rescue - ICD mapping: Dict[str, List[int]] multi-target - Co-occurrence: 2-occurrence filter, dense rollup, count filter - GloVe: L2 + lambda=1e-3 + AdamW + mean reduction (paper-faithful) - All paper-vs-code deviations exposed as configurable parameters - 93 tests passing on synthetic fixtures Also clean up stale docstring in train_glove.py that still said "we default to Adagrad to match the code" — we now default to AdamW (paper Algorithm 1) with Adagrad as configurable alternative. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

Update mortality_mimic3_grasp_keep.py to expose the paper vs code variants and run intrinsic eval against paper Table 2 targets. Configuration: - USE_KEEP: True/False toggle for KEEP vs random embeddings - KEEP_VARIANT: "paper" (default) or "code" — selects the 4-knob preset (reg_distance, reg_reduction, optimizer, lambd) - RUN_INTRINSIC_EVAL: computes Resnik correlation after pipeline KEEP_VARIANTS dict provides two presets: paper: L2 + lambda=1e-3 + AdamW + mean reduction code: Cosine + lambda=1e-5 + Adagrad + sum reduction After training, loads exported keep_snomed.txt back, rebuilds graph, and runs resnik_correlation against paper's 0.68 target. Results saved to run's results.json for ablation comparison. Smoke test (synthetic MIMIC dev mode, 113 concepts): KEEP variant: paper Resnik correlation (median): 0.5996 (paper target: 0.68) On real MIMIC-III + full pipeline on H200, expect to approach 0.68. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

… variant - Add USE_LOCAL_MIMIC + LOCAL_MIMIC_ROOT to point at local MIMIC-III - Add DEV_MODE toggle (dev=True runs small pipeline for smoke tests) - Add USE_KEEP_CACHE + KEEP_CACHE_ROOT for variant-specific embedding reuse - Update "paper" variant to match paper Algorithm 1 verbatim: reg_distance="l2", reg_reduction="sum", optimizer="adamw", lambd=1e-3 (was mean+adamw, which diverged; sum+adamw now works and produces Resnik 0.8135, exceeding paper target 0.68) Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

run_pipeline.py now saves cooc_matrix.npy and cooc_index.json alongside keep_snomed.txt, enabling paper Table 2 co-occurrence correlation eval (target 0.62) without rebuilding the matrix from MIMIC. Example script computes both Resnik and co-occurrence correlations after the pipeline finishes and saves both to results.json. Gracefully skips co-occ when cached embeddings predate this change (no cooc_matrix.npy). Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

…che dir - Rename cache root from "keep_output" to "keep_emb_output" for clarity (matches the keep_emb module name and distinguishes from generic output/) - After GRASP training completes, copy the three KEEP artifacts (keep_snomed.txt, cooc_matrix.npy, cooc_index.json) into the run's output/<timestamp>/ directory so each run is self-contained for sharing via Google Drive backup or teammate handoff. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

- run_pipeline.py: - Extract detect_mimic_schema() and extract_diagnoses_for_schema() as module-level helpers. Auto-detects MIMIC-III (icd9_code column) vs MIMIC-IV (icd_code + icd_version columns) from the diagnosis DataFrame, then dispatches to the correct ICD-to-SNOMED routing. - MIMIC-IV path routes each row through icd9_map or icd10_map based on the icd_version value, standardizes codes per version, and preserves existing multi-target SNOMED expansion. - Replace inline extraction block in run_keep_pipeline with a single helper call. Shorter, testable, backward compatible. - example script (mortality_mimic3_grasp_keep.py): - Add MIMIC_VERSION toggle ("mimic3" or "mimic4") + LOCAL_MIMIC_ROOTS dict. Swaps dataset class (MIMIC3Dataset <-> MIMIC4EHRDataset), task class (MortalityPredictionMIMIC3 <-> MortalityPredictionMIMIC4), data path, and table-name casing in one flag. - tests/core/test_keep_pipeline_detection.py (new, 11 tests): - detect_mimic_schema: column-based routing, edge cases (empty, extra columns, icd_version as definitive signal). - extract_diagnoses_for_schema: MIMIC-III path, MIMIC-IV dual-ICD routing, cross-version same-concept merging, min_occurrences interaction, missing icd10_map raises clear error. - All 11 new tests pass; 104 total KEEP tests pass — no regressions. Paper-faithful reproduction: the KEEP paper (Elhussein et al., CHIL 2025) uses MIMIC-IV + UK Biobank. Pipeline now supports MIMIC-IV natively while preserving MIMIC-III compatibility. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

…eanup - Flip DEV_MODE default from False to True so fresh users don't accidentally launch a multi-hour full pipeline run. Opt in to full runs by flipping to False. - Remove outdated reference to docs/plans/keep/keep-implementation-comparison.md from the KEEP_VARIANTS docstring — the comparison doc has evolved beyond the scope that comment implied. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

CPU default silently made GloVe training 100x+ slower on GPU hosts. Switch default to cuda — CPU-only users pass device="cpu" explicitly.

…e reg_reduction knob Library (run_pipeline.py, train_glove.py): - Add resolve_device() helper for "auto"|"cuda"|"mps"|"cpu" selection. Default device changes from "cpu" to "auto" (picks cuda > mps > cpu). - Remove reg_reduction parameter across KeepGloVe, train_keep(), and run_keep_pipeline(). Always uses sum per paper Eq 4 (Σᵢ₌₁^V) — the parameter was mathematically coupled to λ (mean == 1/V × sum) and had no independent use case. Mean reduction silently scaled the reg term by ~1/8000 on MIMIC-IV and caused Resnik to drop to 0.26. New standalone example (keep_emb/examples/train_keep.py): - Embedding-only entry point (no downstream task) - Flat Trainer-style output: output/keep_emb_output/<timestamp>/ - Writes config.json (inputs) + results.json (outputs) - Toggles: ENABLE_COMPUTE_TRACKING, SAVE_COOC_ARTIFACTS, DEVICE - Sweep-friendly: MIN_OCCURRENCES, KEEP_VARIANT, DEVICE as config vars Mortality example (renamed mimic3 -> mimic4_grasp_keep.py): - Handles both MIMIC-III and MIMIC-IV via MIMIC_VERSION toggle - Add DEVICE, MIN_OCCURRENCES, KEEP_CACHE_RUN_ID configs - Cache layout matches train_keep.py (flat, timestamped) - Cache lookup reads config.json (falls back to legacy manifest.json) - config.json now records mimic_version, min_occurrences, dev_mode, device_config; results.json records keep_embeddings_used (resolved cache path, distinct from the config's requested path) Tests (+25): - test_keep_device_resolution.py: 10 tests (passthrough + auto resolution with mocks so it works on any machine) - test_keep_train_example.py: 15 tests (import smoke, config validity, library defaults locked to paper variant, mortality example structural checks via AST to avoid running training) - test_keep_glove.py: updated assertions after reg_reduction removal Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

…nostic - run_pipeline.py: save node2vec_embeddings.npy and keep_embeddings.npy aligned to cooc rows; add export_node2vec_text flag that writes node2vec_snomed.txt in the same format as keep_snomed.txt for use as a downstream pretrained_emb_path baseline; default output_dir moved to output/embeddings/keep/run. - examples/train_keep.py: add EXPORT_NODE2VEC_SNOMED_TXT toggle, move OUTPUT_ROOT under output/embeddings/keep/ (leaves room for sibling baseline methods), compute Node2Vec-only Resnik + cooc baselines and KEEP lift over them, compute per-concept embedding drift (||keep - n2v|| / ||n2v||), add stage2_effectiveness verdict (embeddings_frozen / moved_but_no_signal / partial / strong) to results.json. - examples/mortality_prediction: align KEEP_CACHE_ROOT with new namespaced path and update manifest->config cache-lookup comments. Answers the "are KEEP embeddings actually different from Node2Vec, or is stage 2 effectively frozen" question with measurable drift + N2V baseline comparison numbers on every run. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

Renames test_forward_returns_scalar_loss to test_forward_and_backward_pass and adds .backward() + .grad assertions on emb_u/emb_v weights. Guards against accidental detach / no_grad / non-leaf parameter bugs and explicitly covers the "gradient computation" bullet of the PyHealth Model test rubric (previously only implicitly covered by test_loss_decreases). Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

Adds a dedicated Sphinx page for the KEEP pretrained embedding pipeline and wires it into the models toctree next to GRASP (its primary consumer via pretrained_emb_path). Registers the new SNOMED vocabulary class under MedCode's Diagnosis codes section. - docs/api/models/pyhealth.models.KEEP.rst: new page with overview, quick-start, and API reference grouped as End-to-end pipeline / Stage 1 (SNOMED hierarchy + Node2Vec) / Stage 2 (co-occurrence + regularized GloVe) / Export & intrinsic eval. - docs/api/models.rst: toctree entry for KEEP placed after GRASP. - docs/api/medcode.rst: autoclass entry for pyhealth.medcode.SNOMED. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <desmondfung123@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

Turns the paper title in the KEEP reference page into an external link to https://arxiv.org/abs/2510.05049 so readers can jump straight to the source. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <desmondfung123@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

ddhangdd · 2026-04-22T02:08:11Z

Under "Core library changes":

Row 1: pyhealth/medcode/pretrained_embeddings/keep.py → should be pyhealth/medcode/pretrained_embeddings/keep_emb/run_pipeline.py
Row 2: pyhealth/medcode/pretrained_embeddings/keep → should be pyhealth/medcode/pretrained_embeddings/keep_emb/*

Overrides the toctree display label so the left sidebar matches the naming convention of neighboring entries (pyhealth.models.GRASP, pyhealth.models.MedLink, ...) while the page itself keeps its descriptive 'KEEP — Pretrained Medical-Code Embeddings' H1. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <desmondfung123@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>

cleanup

capccode and others added 30 commits February 19, 2026 22:19

feat: migrate GRASP model from PyHealth 1.0 to 2.0 API

ff02b37

Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

feat: add GRASP mortality prediction notebook and fix cluster_num

d5221d1

Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

Merge branch 'sunlabuiuc:master' into refactor/grasp-model

1ffc97a

Merge branch 'sunlabuiuc:master' into feature/code-mapping

5060a2c

Merge branch 'sunlabuiuc:master' into refactor/grasp-model

8226dbc

docs: add docstrings to SequenceProcessor class and fit method

95422ae

Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

docs: add docstrings, type hints, and fix test dims for GRASP module

f29d262

Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>

Merge branch 'fix/drug-task-ndc-extraction' into dev/grasp-full-pipeline

f568edf

Merge branch 'feature/code-mapping' into dev/grasp-full-pipeline

d55c55c

Merge branch 'refactor/grasp-model' into dev/grasp-full-pipeline

f544819

Merge branch 'fix/grasp-squeeze-batch-one' into dev/grasp-full-pipeline

c90ed34

Merge branch 'feature/code-mapping' into dev/grasp-full-pipeline

5ee12a9

Merge branch 'feat/grasp-notebooks' into dev/grasp-full-pipeline

6fe71dd

# Conflicts: # examples/mortality_prediction/mortality_mimic3_grasp_with_code_mapping.ipynb

Merge branch 'sunlabuiuc:master' into dev/grasp-full-pipeline

77ef668

Initial plan

af2fef6

capccode and others added 25 commits April 7, 2026 01:55

fix(keep): default run_keep_pipeline device to cuda

9be53d1

CPU default silently made GloVe training 100x+ slower on GPU hosts. Switch default to cuda — CPU-only users pass device="cpu" explicitly.

ddhangdd force-pushed the dev/keep-embeddings branch from 56cdccf to be47a40 Compare April 22, 2026 01:43

ddhangdd and others added 2 commits April 22, 2026 13:44

chore(docs)delete internal comparison doc

95e540e

cleanup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev/keep embeddings#1065

Dev/keep embeddings#1065
christyanamarie wants to merge 73 commits intosunlabuiuc:masterfrom
lookman-olowo:dev/keep-embeddings

christyanamarie commented Apr 21, 2026 •

edited

Loading

Uh oh!

ddhangdd commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

christyanamarie commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

1. KEEP pipeline integration

2. Task/model integration

3. Reproducibility support

Core library changes

Tests

Examples and notebooks

Suggested review order

Uh oh!

ddhangdd commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

christyanamarie commented Apr 21, 2026 •

edited

Loading

ddhangdd commented Apr 22, 2026 •

edited

Loading