Normalize taxon_by handling to match area_by/period_by in cohort grouping#997
Normalize taxon_by handling to match area_by/period_by in cohort grouping#997shauryam2807 wants to merge 11 commits into
Conversation
|
Hello |
| # Copy the specified area_by column to a new "area" column. | ||
| df_samples["area"] = df_samples[area_by] | ||
|
|
||
| # Copy the specified taxon_by column to a new "taxon" column, |
There was a problem hiding this comment.
The issue is that "taxon" is already a column. It is why it is treated differently.
|
Ah @jonbrenas , I see! Because df_samples already has a taxon column with real data, overwriting it destroys that data. To fix #808, should we normalize it into a new column name like "cohort_taxon", or should we ditch the normalization approach altogether since taxon can't be treated exactly like area and period? |
|
Thanks @shauryam2807, I think using a new column name is the best solution. |
|
Hello Changes in this update:
On a separate note — as a GSoC 2026 applicant who has applied for the MalariaGEN project, I just wanted to reiterate how much I've been enjoying contributing to the codebase! While waiting for the final results, I plan to continue addressing issues and helping out wherever I can. Please let me know if this looks good to merge or if there's anything else you'd like me to address here! |
|
Hello |
| ) | ||
| period_str = df_cohorts["period"].astype(str) | ||
| df_cohorts["label"] = area_str + "_" + taxon_clean + "_" + period_str | ||
| # Create a label using the normalized "taxon" column. |
There was a problem hiding this comment.
Why was the non-default case dropped?
| ds_out["cohort_taxon"] = "cohorts", df_cohorts[coh_col] | ||
| else: | ||
| ds_out[f"cohort_{coh_col}"] = "cohorts", df_cohorts[coh_col] | ||
| ds_out[f"cohort_{coh_col}"] = "cohorts", df_cohorts[coh_col] |
There was a problem hiding this comment.
I think the comment "# Other functions expect cohort_taxon, e.g. plot_frequencies_interactive_map()" is still true
| @@ -21,8 +21,6 @@ | |||
| "funestus": TAXON_PALETTE[0], | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
The global variable definitions are still needed
| """Create a test DataFrame with intermediate and unassigned taxon values.""" | ||
| return pd.DataFrame( | ||
| { | ||
| "sample_id": ["S1", "S2", "S3", "S4"], |
There was a problem hiding this comment.
Why is "sample_id" needed?
Fixes #808
Problem
_prep_samples_for_cohort_grouping()normalizesarea_by→"area"andperiod_by→"period"by copying user-specified columns into standard internal columns. However,taxon_byis treated inconsistently — it keeps the original column name instead of being normalized to"taxon".This forces
_build_cohorts_from_sample_grouping()to accepttaxon_byas a parameter and maintain two separate label-generation code paths (one for the default"taxon"column, one for custom columns).Re: discussion in #694 (comment): #694 (comment)
Solution (Option A, per @jonbrenas)
Normalize
taxon_byto a standard"taxon"column in_prep_samples_for_cohort_grouping(), consistent witharea_byandperiod_by. This:taxon_byin_build_cohorts_from_sample_grouping()*_byparameters consistent in behaviorChanges
malariagen_data/anoph/frq_base.py— Add taxon column normalization in_prep_samples_for_cohort_grouping(); removetaxon_byparam from_build_cohorts_from_sample_grouping(); simplify label logicmalariagen_data/anoph/snp_frq.py— Update groupby and removetaxon_byfrom_build_cohorts_from_sample_grouping()callsmalariagen_data/anoph/cnv_frq.py— Samemalariagen_data/anoph/hap_frq.py— Sametests/anoph/test_frq_base.py— Add tests for taxon normalizationTesting
taxon_by="taxon"works unchanged (backward compat) ✅taxon_bycreates standard"taxon"column ✅ruff checkpasses ✅