Normalize taxon_by handling to match area_by/period_by in cohort grouping by shauryam2807 · Pull Request #997 · malariagen/malariagen-data-python

shauryam2807 · 2026-03-01T10:30:06Z

Fixes #808

Problem

_prep_samples_for_cohort_grouping() normalizes area_by → "area" and period_by → "period" by copying user-specified columns into standard internal columns. However, taxon_by is treated inconsistently — it keeps the original column name instead of being normalized to "taxon".

This forces _build_cohorts_from_sample_grouping() to accept taxon_by as a parameter and maintain two separate label-generation code paths (one for the default "taxon" column, one for custom columns).

Re: discussion in #694 (comment): #694 (comment)

Solution (Option A, per @jonbrenas)

Normalize taxon_by to a standard "taxon" column in _prep_samples_for_cohort_grouping(), consistent with area_by and period_by. This:

Removes the need for taxon_by in _build_cohorts_from_sample_grouping()
Simplifies label generation to a single code path
Makes all three *_by parameters consistent in behavior

Changes

malariagen_data/anoph/frq_base.py — Add taxon column normalization in _prep_samples_for_cohort_grouping(); remove taxon_by param from _build_cohorts_from_sample_grouping(); simplify label logic
malariagen_data/anoph/snp_frq.py — Update groupby and remove taxon_by from _build_cohorts_from_sample_grouping() calls
malariagen_data/anoph/cnv_frq.py — Same
malariagen_data/anoph/hap_frq.py — Same
tests/anoph/test_frq_base.py — Add tests for taxon normalization

Testing

Default taxon_by="taxon" works unchanged (backward compat) ✅
Custom taxon_by creates standard "taxon" column ✅
Original custom column is preserved ✅
Label generation works correctly ✅
ruff check passes ✅

…_by removal

shauryam2807 · 2026-04-10T06:59:57Z

Hello
Hi @jonbrenas,
When you get a chance, could you take a look at this PR?
Let me know if anything needs improvement—I’ll update it right away.

jonbrenas · 2026-04-10T07:04:35Z

    # Copy the specified area_by column to a new "area" column.
    df_samples["area"] = df_samples[area_by]

+    # Copy the specified taxon_by column to a new "taxon" column,


The issue is that "taxon" is already a column. It is why it is treated differently.

shauryam2807 · 2026-04-10T07:17:04Z

Ah @jonbrenas , I see! Because df_samples already has a taxon column with real data, overwriting it destroys that data. To fix #808, should we normalize it into a new column name like "cohort_taxon", or should we ditch the normalization approach altogether since taxon can't be treated exactly like area and period?

jonbrenas · 2026-04-10T09:38:14Z

Thanks @shauryam2807, I think using a new column name is the best solution.

…oid overwriting user metadata

…n af1.py

shauryam2807 · 2026-04-10T10:55:32Z

Hello
Hi @jonbrenas,
Thanks for the suggestion! I've updated the implementation to use new column names (cohort_taxon, cohort_area, cohort_period) inside _prep_samples_for_cohort_grouping, ensuring the original user metadata columns are never overwritten.

Changes in this update:

_prep_samples_for_cohort_grouping: Now writes to cohort_taxon, cohort_area, and cohort_period instead of overwriting taxon, area, and period.
_build_cohorts_from_sample_grouping: Renames these back to taxon, area, and period after aggregation, keeping the downstream API (datasets, plots, etc.) fully unchanged.
Downstream Updates: Updated groupby() calls in snp_frq.py, hap_frq.py, and cnv_frq.py to use the new internal column names.
Tests: Updated test_frq_base.py to match the new column names and behavior.
Linting & Formatting:
- Fixed duplicate constant definitions in af1.py that were causing the pre-commit linting step to fail.
- Applied ruff format across all modified files to ensure CI passes.

On a separate note — as a GSoC 2026 applicant who has applied for the MalariaGEN project, I just wanted to reiterate how much I've been enjoying contributing to the codebase! While waiting for the final results, I plan to continue addressing issues and helping out wherever I can.

Please let me know if this looks good to merge or if there's anything else you'd like me to address here!

shauryam2807 · 2026-04-16T05:04:55Z

Hello
Hi @jonbrenas if there any change you want to suggest me sir it will be highly greatfull to me
thank you

jonbrenas · 2026-05-28T12:33:05Z

-        )
-        period_str = df_cohorts["period"].astype(str)
-        df_cohorts["label"] = area_str + "_" + taxon_clean + "_" + period_str
+    # Create a label using the normalized "taxon" column.


Why was the non-default case dropped?

jonbrenas · 2026-05-28T12:34:50Z

-                ds_out["cohort_taxon"] = "cohorts", df_cohorts[coh_col]
-            else:
-                ds_out[f"cohort_{coh_col}"] = "cohorts", df_cohorts[coh_col]
+            ds_out[f"cohort_{coh_col}"] = "cohorts", df_cohorts[coh_col]


I think the comment "# Other functions expect cohort_taxon, e.g. plot_frequencies_interactive_map()" is still true

jonbrenas · 2026-05-28T12:36:15Z

@@ -21,8 +21,6 @@
    "funestus": TAXON_PALETTE[0],
 }



The global variable definitions are still needed

jonbrenas · 2026-05-28T12:38:22Z

    """Create a test DataFrame with intermediate and unassigned taxon values."""
    return pd.DataFrame(
        {
+            "sample_id": ["S1", "S2", "S3", "S4"],


Why is "sample_id" needed?

.

fb564e1

shauryam2807 mentioned this pull request Mar 1, 2026

Reassess use of "private" columns in prep_samples_for_cohort_grouping and build_cohorts_from_sample_grouping #808

Open

shauryam2807 added 6 commits March 5, 2026 19:57

Merge branch 'master' into GH808-normalize-taxon-by

5595f16

Fix malformed AST in test_frq_base.py from merge conflict

9cbb143

Auto-format _build_cohorts_from_sample_grouping signature after taxon…

4c7274b

…_by removal

Merge branch 'master' into GH808-normalize-taxon-by

93fbf70

Merge branch 'master' into GH808-normalize-taxon-by

58a73cf

Update frq_base.py

4020070

jonbrenas requested changes Apr 10, 2026

View reviewed changes

shauryam2807 added 3 commits April 10, 2026 15:44

Use new column names (cohort_taxon, cohort_area, cohort_period) to av…

9cc67ac

…oid overwriting user metadata

Fix pre-commit lint violation: remove duplicate constant assignment i…

d23b1ac

…n af1.py

Fix ruff formatting: line length and trailing comma

4f4f35f

shauryam2807 requested a review from jonbrenas April 10, 2026 18:28

Merge branch 'master' into GH808-normalize-taxon-by

a3f46d7

jonbrenas requested changes May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize taxon_by handling to match area_by/period_by in cohort grouping#997

Normalize taxon_by handling to match area_by/period_by in cohort grouping#997
shauryam2807 wants to merge 11 commits into
malariagen:masterfrom
shauryam2807:GH808-normalize-taxon-by

shauryam2807 commented Mar 1, 2026

Uh oh!

shauryam2807 commented Apr 10, 2026

Uh oh!

jonbrenas Apr 10, 2026

Uh oh!

shauryam2807 commented Apr 10, 2026

Uh oh!

jonbrenas commented Apr 10, 2026

Uh oh!

shauryam2807 commented Apr 10, 2026 •

edited

Loading

Uh oh!

shauryam2807 commented Apr 16, 2026 •

edited

Loading

Uh oh!

jonbrenas May 28, 2026

Uh oh!

jonbrenas May 28, 2026

Uh oh!

jonbrenas May 28, 2026

Uh oh!

jonbrenas May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shauryam2807 commented Mar 1, 2026

Problem

Solution (Option A, per @jonbrenas)

Changes

Testing

Uh oh!

shauryam2807 commented Apr 10, 2026

Uh oh!

jonbrenas Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

shauryam2807 commented Apr 10, 2026

Uh oh!

jonbrenas commented Apr 10, 2026

Uh oh!

shauryam2807 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes in this update:

Uh oh!

shauryam2807 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonbrenas May 28, 2026

Choose a reason for hiding this comment

Uh oh!

jonbrenas May 28, 2026

Choose a reason for hiding this comment

Uh oh!

jonbrenas May 28, 2026

Choose a reason for hiding this comment

Uh oh!

jonbrenas May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shauryam2807 commented Apr 10, 2026 •

edited

Loading

shauryam2807 commented Apr 16, 2026 •

edited

Loading