Draft
Conversation
Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>
Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>
Adds optional code_mapping parameter to SequenceProcessor that maps granular medical codes to grouped vocabularies (e.g. ICD9CM→CCSCM) before building the embedding table. Resolves the functional gap from the 1.x→2.0 rewrite where code_mapping was removed. Ref sunlabuiuc#535 Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com>
Two identical notebooks for A/B testing code_mapping impact on mortality prediction. Only difference is the schema override in Step 2. Both use seed=42 for reproducible splits. Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com>
…mapping event.drug returns drug names (e.g. "Aspirin") which produce zero matches in CrossMap NDC→ATC; event.ndc returns actual NDC codes enabling 3/3 feature mapping for mortality and readmission tasks. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>
Checks that mortality and readmission task processors build vocabulary from NDC codes (numeric strings) rather than drug names (e.g. "Aspirin"), confirming the event.drug -> event.ndc fix works correctly. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>
…e docs - Fix event.drug -> event.ndc in MortalityPredictionMIMIC4 (line 282) - Update readmission task docstrings to reflect NDC extraction Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>
DrugRecommendationMIMIC3 used prescriptions/drug (drug names) via Polars column select; changed to prescriptions/ndc to match MIMIC-4 variant and enable NDC->ATC code mapping. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>
RNNLayer: clamp sequence lengths to min 1 so pack_padded_sequence does not crash on all-zero masks, matching TCNLayer (tcn.py:186). ConCare: guard covariance divisor with max(n-1, 1) to prevent ZeroDivisionError when attention produces single-element features. Both edge cases are triggered when code_mapping collapses vocabularies and some patients have all codes map to <unk>, producing all-zero embeddings and all-zero masks. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>
Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>
Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>
…mapping Baseline notebook runs GRASP with raw ICD-9/NDC codes. Code_mapping notebook collapses vocab via ICD9CM→CCSCM, ICD9PROC→CCSPROC, NDC→ATC for trainable embeddings on full MIMIC-III. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>
- ConCare FinalAttentionQKV: bare .squeeze() removed batch dim when batch_size=1, causing IndexError in softmax. Use .squeeze(-1) and .squeeze(1) to target only the intended dimensions. - ConCare cov(): division by zero when x.size(1)==1. Guard with max(). - GRASP grasp_encoder: remove stale torch.squeeze(hidden_t, 0) that collapsed [1, hidden] to [hidden] with batch_size=1. Both RNNLayer and ConCareLayer already return [batch, hidden]. - GRASP random_init: clamp num_centers to num_points to prevent ValueError when cluster_num > batch_size. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>
…mapping Baseline notebook runs GRASP with raw ICD-9/NDC codes. Code_mapping notebook collapses vocab via ICD9CM→CCSCM, ICD9PROC→CCSPROC, NDC→ATC for trainable embeddings on full MIMIC-III. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>
Allow tasks to accept a code_mapping dict that upgrades input_schema entries so SequenceProcessor maps raw codes (e.g. ICD9CM) to grouped vocabularies (e.g. CCSCM) at fit/process time. This avoids manual schema manipulation after task construction. - Add code_mapping parameter to BaseTask.__init__() - Thread **kwargs + super().__init__() through all task subclasses with existing __init__ methods (4 readmission tasks, 1 multimodal mortality task) - Add 17 tests covering SequenceProcessor mapping and task-level code_mapping initialization Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>
Replace manual task.input_schema override with the new code_mapping parameter on MortalityPredictionMIMIC3(). Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>
# Conflicts: # examples/mortality_prediction/mortality_mimic3_grasp_with_code_mapping.ipynb
Mirrors the GRASP+ConCare mortality notebook pipeline exactly (same tables, split, seed, metrics) but sweeps 72 configurations of embedding_dim, hidden_dim, cluster_num, lr, and weight_decay. Results are logged to sweep_results.csv. Supports --root for pointing at local MIMIC-III, --code-mapping, --dev, and --monitor. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>
Smaller ConCare configs (embedding_dim=8/16) may learn slower and need more epochs before plateauing. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu>
Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com> Co-authored-by: ddhangdd <43976109+ddhangdd@users.noreply.github.com>
…Option A)
Add SNOMED as a first-class code_mapping target in PyHealth's medcode
system, enabling any model to use KEEP embeddings via
code_mapping=("ICD9CM", "SNOMED") + pretrained_emb_path="keep_snomed.txt".
SNOMED vocabulary:
- Add pyhealth/medcode/codes/snomed.py (InnerMap subclass)
- Register in medcode/__init__.py
- SNOMED data generated locally from Athena OMOP download (IHTSDO
licensing restricts GCS redistribution). Users need only SNOMED +
ICD9CM + ICD10CM vocabularies from https://athena.ohdsi.org/
Medcode file generation:
- Add generate_medcode_files.py: produces SNOMED.csv, ICD9CM_to_SNOMED.csv,
and ICD10CM_to_SNOMED.csv in PyHealth's medcode cache from Athena data
Embedding export:
- Add export_embeddings.py: exports keep_snomed.txt (primary, Option A)
and keep_icd9.txt / keep_icd10.txt (fallback, Option B)
- Round-trip tested with init_embedding_with_pretrained()
Also fixes author order across all keep_emb modules.
Co-Authored-By: Colton Loew <colton.loew@gmail.com>
Co-Authored-By: ddhangdd <dfung2@wisc.edu>
Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com>
Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
… support Add mortality_mimic3_grasp_keep.py example demonstrating the full KEEP pipeline integrated with GRASP. Verified end-to-end: Athena parsing, SNOMED graph, Node2Vec, co-occurrence, GloVe, export, SNOMED code_mapping, GRASP with pretrained embeddings — all complete without errors. - Add examples/mortality_prediction/mortality_mimic3_grasp_keep.py with USE_KEEP toggle for quick comparison of KEEP vs random embeddings - Fix GRASP.__init__: wire pretrained_emb_path through to EmbeddingModel (was silently ignored — EmbeddingModel always got pretrained_emb_path=None) - Fix author order in build_omop_graph, build_cooccurrence, train_node2vec Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
Update mortality_mimic3_grasp_keep.py with proven hyperparams from GRASP+GRU sweep: batch_size=256, hidden_dim=32, cluster_num=8, lr=1e-3, wd=1e-4, monitor=pr_auc (better for 12% imbalanced data). Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
Add run_pipeline.py with run_keep_pipeline() convenience function that wraps all KEEP stages (graph, mappings, medcode files, patient extraction, rollup, co-occurrence, Node2Vec, GloVe, export) into a single call. Simplify mortality_mimic3_grasp_keep.py example: Step 2 goes from 40 lines of pipeline glue to 3 lines. Add hardware info, compute tracking (CodeCarbon + pynvml), loss landscape visualization, and per-run artifact saving (config.json, results.json, loss_landscape.png) to Trainer's output folder. Update keep_emb/__init__.py to export all public functions. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
run_keep_pipeline() now checks for .csv first, falls back to .csv.gz. pandas read_csv handles gzip decompression transparently. This allows users to transfer compressed Athena files (~115MB vs ~924MB) to remote compute environments. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
…code Fix three compounding bugs in train_glove.py that made regularization effectively zero, preventing KEEP from anchoring to Node2Vec ontology. Critical fixes: - Lambda scaling: change .mean() to .sum() in reg loss. With .mean(), effective lambda was divided by batch_size*embedding_dim (102,400x smaller than intended). Now matches reference's sum-over-batch - Row+col regularization: compute reg loss for both i_indices and j_indices per batch, matching reference (was row-only) - L2 path: use torch.norm(p=2) (linear L2) not .pow(2) (squared L2) - Default lambda: 1e-5 (matches reference LAMBD=0.00001, not paper Table 6's 1e-3 which uses different normalization convention) Other fixes: - Return (glove_loss, reg_loss) tuple for separate logging - Add paper vs code deviation table to KeepGloVe docstring - Sort polyhierarchy parents in generate_medcode_files for reproducibility - Update run_pipeline.py lambda default to 1e-5 Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
…epts
Two paper-faithfulness fixes to build_hierarchy_graph matching the KEEP
paper Appendix A.1.1.
Root node:
- Before: BFS from 187 standard SNOMED Condition roots → 65,375 nodes
(broader than paper, includes body temperature observations,
pain findings, family history, etc.)
- After: BFS from single concept 4274025 "Disease" → 68,303 nodes
(matches paper Appendix A.1.1 exactly)
Orphan rescue via CONCEPT_ANCESTOR:
- Some Condition concepts have direct "Is a" parents only in Observation
domain (e.g., "DRESS syndrome", "Drug-induced HF"). Our domain filter
strips those edges, orphaning the concept.
- The paper's CONCEPT_ANCESTOR approach naturally includes them via
transitive closure. We replicate this via an optional rescue step:
load CONCEPT_ANCESTOR, find descendants of root not yet in graph,
add edge to closest in-graph ancestor (tie-break by smaller concept_id).
- Rescued 93 orphans in real Athena data (68,303 → 68,396 nodes)
- Matches keep-mimic4's node count exactly.
Changes:
- Add root_concept_id parameter (default 4274025) to build_hierarchy_graph
- Add ancestor_csv parameter for optional orphan rescue
- Add _rescue_orphans() internal function
- Update run_pipeline.py to pass CONCEPT_ANCESTOR.csv automatically
- Add 9 new tests (5 single-root, 4 orphan rescue) using synthetic fixtures
Co-Authored-By: Colton Loew <colton.loew@gmail.com>
Co-Authored-By: ddhangdd <dfung2@wisc.edu>
Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com>
Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
Rewrite build_hierarchy_graph to match the paper's algorithm (Appendix A.1.1). Replaces our two-pass approach (BFS + orphan rescue) with a single-pass approach that mirrors the paper. Before: 1. Load all SNOMED Condition concepts (105K candidates) 2. Load "Is a" edges from CONCEPT_RELATIONSHIP 3. BFS from root, track depth -> 68,303 nodes 4. Look up missing nodes via CONCEPT_ANCESTOR 5. Rescue 93 orphans (add nodes + edges) Total: 68,396 nodes, 152,194 edges After: 1. Load CONCEPT_ANCESTOR, get descendants of root within depth 2. Intersect with SNOMED Condition standard concepts -> 68,396 nodes 3. Load "Is a" edges, filter to node set 4. Rescue 17 orphan edges (node set already complete) Total: 68,396 nodes, 152,222 edges (28 more direct edges than before) Empirical equivalence verified: node sets are identical between the two implementations (0 nodes differ). The paper-faithful approach finds 28 more direct "Is a" edges because it doesn't depend on BFS reachability. Changes: - build_hierarchy_graph: rewrite to use CONCEPT_ANCESTOR for node set - _rescue_orphan_edges: simplified — only adds edges, nodes already in graph - _bfs_fallback: new function for when CONCEPT_ANCESTOR is unavailable - Keeps ancestor_csv parameter for backward compatibility - All 60 existing tests pass unchanged Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
Previously, when an ICD code mapped to multiple SNOMED concepts in Athena's "Maps to" relationships, only the first target was kept and alternatives were silently dropped. This affected ~27% of ICD-10 codes (18,851 codes on real Athena data) — combination codes like "A01.04 Typhoid arthritis" that encode multiple atomic clinical concepts. Change: Dict[str, int] -> Dict[str, List[int]] throughout the pipeline. Files updated: - build_omop_graph.build_icd_to_snomed: returns sorted list per code - build_all_mappings: signature updated - build_cooccurrence.extract_patient_codes_from_df: dense expansion (each ICD occurrence counts as an occurrence of ALL its SNOMED targets) - export_embeddings.export_icd: averages embeddings for multi-target codes - generate_medcode_files.generate_crossmap_csv: emits one row per (ICD, SNOMED) pair — PyHealth's CrossMap aggregates these into lists Verified on real Athena data: - ICD-9: 2,063 multi-target codes (12.1%) - ICD-10: 18,851 multi-target codes (26.7%) — matches paper's ~24% Tests: 66/66 passing (6 new multi-target tests added) Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
…ful) Implements the implicit count filter visible in G2Lab's "_ct_filter" file suffix. Drops SNOMED concepts with zero patient observations after dense rollup, ensuring Stage 1 (Node2Vec) and Stage 2 (GloVe) operate on the same vocabulary. Without this filter, ~92% of exported embeddings would be Node2Vec-only (no Stage 2 refinement) because GloVe's co-occurrence matrix only sees observed concepts. Why this matters for KEEP: - Every exported embedding has undergone both stages (semantic consistency) - Paper reports ~5,686 concepts at depth 5 (we would have 68K without filter) - Node2Vec walks stay in clinically-meaningful regions - GloVe co-occurrence matrix shrinks from 68K*68K (18GB) to ~5K*5K (130MB) Changes: - Add apply_count_filter(): drops concepts with zero diagonal entries, rebuilds graph/matrix/indices - Wire into run_pipeline.py as Step 5 (between co-occurrence and Node2Vec) - Add 4 tests covering drop logic, reindexing, and subgraph isolation Tests: 70/70 passing Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
Flip defaults to paper-faithful (L2, lambda=1e-3, AdamW, per-element
mean reduction) while keeping G2Lab code-faithful variant available
via parameters. We do not know with certainty which variant produced
the paper's published AUPRC numbers — exposing both lets users run
ablations and lets reviewers verify either claim.
Default behavior (paper-faithful):
reg_distance = "l2" (paper Equation 4: ||w - w_n2v||^2)
reg_reduction = "mean" (paper's per-element normalization)
optimizer = "adamw" (paper Algorithm 1)
lambd = 1e-3 (paper Table 6)
Code-faithful variant (one-line switch):
train_keep(..., reg_distance="cosine", reg_reduction="sum",
lambd=1e-5, optimizer="adagrad")
Changes:
- KeepGloVe: use_cosine_reg (bool) replaced with reg_distance (str)
and reg_reduction (str). forward() handles all 4 combinations.
- train_keep: adds optimizer parameter ("adamw" or "adagrad")
- run_pipeline.run_keep_pipeline: forwards all four parameters
- Default lambd changed from 1e-5 to 1e-3 (paper Table 6)
- L2 path uses squared L2 to match paper Equation 4 exactly
- Added 9 new tests covering paper/code defaults, parameter validation,
and empirical difference between L2 and cosine distance
Tests: 79/79 passing
Co-Authored-By: Colton Loew <colton.loew@gmail.com>
Co-Authored-By: ddhangdd <dfung2@wisc.edu>
Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com>
Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
…alth) Add explicit docstring note to extract_patient_codes_from_df explaining why KEEP's embedding training intentionally does NOT apply the censoring rule (2nd-occurrence date < outcome date) that appears in G2Lab's create_cohort_sentence.py. Rationale: - KEEP embeddings are population-level, task-agnostic - Paper Appendix A.4 uses complete patient history for embedding training - Downstream prediction censoring is handled by PyHealth task processors (MortalityPredictionMIMIC3/4, ReadmissionPredictionMIMIC4) — same mechanism works for both MIMIC-III and MIMIC-IV - Adding censoring to embedding training would discard signal for no benefit Also add keep-implementation-comparison.md to version control (was previously untracked) with updated censoring section reflecting this decision. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
Implement the paper's intrinsic evaluation methodology (Appendix B.1)
so users can validate their KEEP embeddings against paper Table 2
targets before running expensive extrinsic evaluations.
Paper Table 2 targets (UK Biobank KEEP):
Resnik correlation: median ~0.68
Co-occurrence correlation: median ~0.62
Methodology per paper Appendix B.1:
For each code, compare cosine similarity against Resnik semantic
similarity (IC of lowest common ancestor) and co-occurrence counts.
Use K1=10 most-similar + K2=150 random concepts per source, repeat
250 times, report median correlation.
New public API (exposed via keep_emb module):
- compute_information_content(graph, patient_code_sets=None)
- resnik_similarity(graph, a, b, ic)
- resnik_correlation(embeddings, node_ids, graph, ...)
- cooccurrence_correlation(embeddings, node_ids, cooc, code_to_idx, ...)
- evaluate_embeddings(...) # convenience wrapper for both
- load_keep_embeddings(path) # load keep_snomed.txt back into numpy
- apply_count_filter (also re-exported from build_cooccurrence)
Usage on H200 after running the pipeline:
results = evaluate_embeddings(
keep_embeddings, node_ids, graph,
cooc_matrix=cooc_matrix, code_to_idx=code_to_idx,
num_runs=250,
)
# results["resnik"]["median"] -> compare to paper's 0.68
# results["cooccurrence"]["median"] -> compare to paper's 0.62
Add 14 tests with synthetic SNOMED hierarchy + embeddings that verify
IC computation, Resnik semantic similarity (siblings closer than
cousins), correlation directionality (aligned > random), and round-trip
file loading.
Tests: 93/93 passing
Co-Authored-By: Colton Loew <colton.loew@gmail.com>
Co-Authored-By: ddhangdd <dfung2@wisc.edu>
Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com>
Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
…ring Append a verified executive summary to keep-implementation-comparison.md showing what each of paper/G2Lab/Desmond/ours actually delivered. Every row verified by running inspect.signature() on our actual functions, not aspirational. Summary of our port: - Graph: CONCEPT_ANCESTOR-driven, root=4274025, depth<=5, orphan rescue - ICD mapping: Dict[str, List[int]] multi-target - Co-occurrence: 2-occurrence filter, dense rollup, count filter - GloVe: L2 + lambda=1e-3 + AdamW + mean reduction (paper-faithful) - All paper-vs-code deviations exposed as configurable parameters - 93 tests passing on synthetic fixtures Also clean up stale docstring in train_glove.py that still said "we default to Adagrad to match the code" — we now default to AdamW (paper Algorithm 1) with Adagrad as configurable alternative. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
Update mortality_mimic3_grasp_keep.py to expose the paper vs code
variants and run intrinsic eval against paper Table 2 targets.
Configuration:
- USE_KEEP: True/False toggle for KEEP vs random embeddings
- KEEP_VARIANT: "paper" (default) or "code" — selects the 4-knob
preset (reg_distance, reg_reduction, optimizer, lambd)
- RUN_INTRINSIC_EVAL: computes Resnik correlation after pipeline
KEEP_VARIANTS dict provides two presets:
paper: L2 + lambda=1e-3 + AdamW + mean reduction
code: Cosine + lambda=1e-5 + Adagrad + sum reduction
After training, loads exported keep_snomed.txt back, rebuilds graph,
and runs resnik_correlation against paper's 0.68 target. Results
saved to run's results.json for ablation comparison.
Smoke test (synthetic MIMIC dev mode, 113 concepts):
KEEP variant: paper
Resnik correlation (median): 0.5996 (paper target: 0.68)
On real MIMIC-III + full pipeline on H200, expect to approach 0.68.
Co-Authored-By: Colton Loew <colton.loew@gmail.com>
Co-Authored-By: ddhangdd <dfung2@wisc.edu>
Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com>
Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
… variant - Add USE_LOCAL_MIMIC + LOCAL_MIMIC_ROOT to point at local MIMIC-III - Add DEV_MODE toggle (dev=True runs small pipeline for smoke tests) - Add USE_KEEP_CACHE + KEEP_CACHE_ROOT for variant-specific embedding reuse - Update "paper" variant to match paper Algorithm 1 verbatim: reg_distance="l2", reg_reduction="sum", optimizer="adamw", lambd=1e-3 (was mean+adamw, which diverged; sum+adamw now works and produces Resnik 0.8135, exceeding paper target 0.68) Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
run_pipeline.py now saves cooc_matrix.npy and cooc_index.json alongside keep_snomed.txt, enabling paper Table 2 co-occurrence correlation eval (target 0.62) without rebuilding the matrix from MIMIC. Example script computes both Resnik and co-occurrence correlations after the pipeline finishes and saves both to results.json. Gracefully skips co-occ when cached embeddings predate this change (no cooc_matrix.npy). Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
…che dir - Rename cache root from "keep_output" to "keep_emb_output" for clarity (matches the keep_emb module name and distinguishes from generic output/) - After GRASP training completes, copy the three KEEP artifacts (keep_snomed.txt, cooc_matrix.npy, cooc_index.json) into the run's output/<timestamp>/ directory so each run is self-contained for sharing via Google Drive backup or teammate handoff. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
- run_pipeline.py:
- Extract detect_mimic_schema() and extract_diagnoses_for_schema()
as module-level helpers. Auto-detects MIMIC-III (icd9_code column)
vs MIMIC-IV (icd_code + icd_version columns) from the diagnosis
DataFrame, then dispatches to the correct ICD-to-SNOMED routing.
- MIMIC-IV path routes each row through icd9_map or icd10_map based
on the icd_version value, standardizes codes per version, and
preserves existing multi-target SNOMED expansion.
- Replace inline extraction block in run_keep_pipeline with a single
helper call. Shorter, testable, backward compatible.
- example script (mortality_mimic3_grasp_keep.py):
- Add MIMIC_VERSION toggle ("mimic3" or "mimic4") + LOCAL_MIMIC_ROOTS
dict. Swaps dataset class (MIMIC3Dataset <-> MIMIC4EHRDataset),
task class (MortalityPredictionMIMIC3 <-> MortalityPredictionMIMIC4),
data path, and table-name casing in one flag.
- tests/core/test_keep_pipeline_detection.py (new, 11 tests):
- detect_mimic_schema: column-based routing, edge cases (empty,
extra columns, icd_version as definitive signal).
- extract_diagnoses_for_schema: MIMIC-III path, MIMIC-IV dual-ICD
routing, cross-version same-concept merging, min_occurrences
interaction, missing icd10_map raises clear error.
- All 11 new tests pass; 104 total KEEP tests pass — no regressions.
Paper-faithful reproduction: the KEEP paper (Elhussein et al., CHIL
2025) uses MIMIC-IV + UK Biobank. Pipeline now supports MIMIC-IV
natively while preserving MIMIC-III compatibility.
Co-Authored-By: Colton Loew <colton.loew@gmail.com>
Co-Authored-By: ddhangdd <dfung2@wisc.edu>
Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com>
Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
…eanup - Flip DEV_MODE default from False to True so fresh users don't accidentally launch a multi-hour full pipeline run. Opt in to full runs by flipping to False. - Remove outdated reference to docs/plans/keep/keep-implementation-comparison.md from the KEEP_VARIANTS docstring — the comparison doc has evolved beyond the scope that comment implied. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
CPU default silently made GloVe training 100x+ slower on GPU hosts. Switch default to cuda — CPU-only users pass device="cpu" explicitly.
…e reg_reduction knob Library (run_pipeline.py, train_glove.py): - Add resolve_device() helper for "auto"|"cuda"|"mps"|"cpu" selection. Default device changes from "cpu" to "auto" (picks cuda > mps > cpu). - Remove reg_reduction parameter across KeepGloVe, train_keep(), and run_keep_pipeline(). Always uses sum per paper Eq 4 (Σᵢ₌₁^V) — the parameter was mathematically coupled to λ (mean == 1/V × sum) and had no independent use case. Mean reduction silently scaled the reg term by ~1/8000 on MIMIC-IV and caused Resnik to drop to 0.26. New standalone example (keep_emb/examples/train_keep.py): - Embedding-only entry point (no downstream task) - Flat Trainer-style output: output/keep_emb_output/<timestamp>/ - Writes config.json (inputs) + results.json (outputs) - Toggles: ENABLE_COMPUTE_TRACKING, SAVE_COOC_ARTIFACTS, DEVICE - Sweep-friendly: MIN_OCCURRENCES, KEEP_VARIANT, DEVICE as config vars Mortality example (renamed mimic3 -> mimic4_grasp_keep.py): - Handles both MIMIC-III and MIMIC-IV via MIMIC_VERSION toggle - Add DEVICE, MIN_OCCURRENCES, KEEP_CACHE_RUN_ID configs - Cache layout matches train_keep.py (flat, timestamped) - Cache lookup reads config.json (falls back to legacy manifest.json) - config.json now records mimic_version, min_occurrences, dev_mode, device_config; results.json records keep_embeddings_used (resolved cache path, distinct from the config's requested path) Tests (+25): - test_keep_device_resolution.py: 10 tests (passthrough + auto resolution with mocks so it works on any machine) - test_keep_train_example.py: 15 tests (import smoke, config validity, library defaults locked to paper variant, mortality example structural checks via AST to avoid running training) - test_keep_glove.py: updated assertions after reg_reduction removal Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
…nostic - run_pipeline.py: save node2vec_embeddings.npy and keep_embeddings.npy aligned to cooc rows; add export_node2vec_text flag that writes node2vec_snomed.txt in the same format as keep_snomed.txt for use as a downstream pretrained_emb_path baseline; default output_dir moved to output/embeddings/keep/run. - examples/train_keep.py: add EXPORT_NODE2VEC_SNOMED_TXT toggle, move OUTPUT_ROOT under output/embeddings/keep/ (leaves room for sibling baseline methods), compute Node2Vec-only Resnik + cooc baselines and KEEP lift over them, compute per-concept embedding drift (||keep - n2v|| / ||n2v||), add stage2_effectiveness verdict (embeddings_frozen / moved_but_no_signal / partial / strong) to results.json. - examples/mortality_prediction: align KEEP_CACHE_ROOT with new namespaced path and update manifest->config cache-lookup comments. Answers the "are KEEP embeddings actually different from Node2Vec, or is stage 2 effectively frozen" question with measurable drift + N2V baseline comparison numbers on every run. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
Renames test_forward_returns_scalar_loss to test_forward_and_backward_pass and adds .backward() + .grad assertions on emb_u/emb_v weights. Guards against accidental detach / no_grad / non-leaf parameter bugs and explicitly covers the "gradient computation" bullet of the PyHealth Model test rubric (previously only implicitly covered by test_loss_decreases). Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <dfung2@wisc.edu> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
Adds a dedicated Sphinx page for the KEEP pretrained embedding pipeline and wires it into the models toctree next to GRASP (its primary consumer via pretrained_emb_path). Registers the new SNOMED vocabulary class under MedCode's Diagnosis codes section. - docs/api/models/pyhealth.models.KEEP.rst: new page with overview, quick-start, and API reference grouped as End-to-end pipeline / Stage 1 (SNOMED hierarchy + Node2Vec) / Stage 2 (co-occurrence + regularized GloVe) / Export & intrinsic eval. - docs/api/models.rst: toctree entry for KEEP placed after GRASP. - docs/api/medcode.rst: autoclass entry for pyhealth.medcode.SNOMED. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <desmondfung123@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
56cdccf to
be47a40
Compare
Turns the paper title in the KEEP reference page into an external link to https://arxiv.org/abs/2510.05049 so readers can jump straight to the source. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <desmondfung123@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
Contributor
|
Under "Core library changes":
|
Overrides the toctree display label so the left sidebar matches the naming convention of neighboring entries (pyhealth.models.GRASP, pyhealth.models.MedLink, ...) while the page itself keeps its descriptive 'KEEP — Pretrained Medical-Code Embeddings' H1. Co-Authored-By: Colton Loew <colton.loew@gmail.com> Co-Authored-By: ddhangdd <desmondfung123@gmail.com> Co-Authored-By: lookman-olowo <lookmanolowo@hotmail.com> Co-Authored-By: christiana-beard <christyanamarie116@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Contributors: Lookman Olowo (lolowo2), Desmond Fung (dkfung2), Christiana Beard (cmbeard2)
Contribution Type: Model
Paper: Ahmed Elhussein et al., "KEEP: Integrating Medical Ontologies with Clinical Data for Robust Code Embeddings." CHIL 2025.
Description
This PR reproduces the KEEP embedding pipeline and integrates it into PyHealth task/model workflows.
1. KEEP pipeline integration
2. Task/model integration
3. Reproducibility support
Core library changes
Tests
Examples and notebooks
Suggested review order