Skip to content

Ensure all variant data CSV namespaces work independently #710

@bencap

Description

@bencap

Problem

The variant data CSV endpoint (/score-sets/{urn}/variants/data) supports several namespaces (scores, counts, vep, gnomad, clingen) that can be requested individually or in combination. However, namespaces that depend on MappedVariant data (clingen, vep) only return values when include_post_mapped_hgvs=true or gnomad is also requested. Otherwise, all their values resolve to NA.

The root cause is in get_score_set_variants_as_csv() in src/mavedb/lib/score_sets.py. The function has four hardcoded query branches that decide whether to join MappedVariant and/or GnomADVariant:

  1. gnomad in namespaces and include_post_mapped_hgvs → joins both
  2. include_post_mapped_hgvs only → joins MappedVariant
  3. gnomad in namespaces only → joins GnomADVariant (via MappedVariant)
  4. else → selects only Variant, no joins

The clingen and vep namespaces both read from MappedVariant (clingen_allele_id and vep_functional_consequence respectively), but neither is checked when deciding whether to join MappedVariant. When they're the only namespace requested, the query falls into branch 4, mappings stays None, and all values come back as NA.

The existing test (test_download_clingen_file_in_variant_data_path) masks this by always including include_post_mapped_hgvs=true.

Expected behavior

Every namespace should work independently. These should all return populated data:

GET /score-sets/{urn}/variants/data?namespaces=clingen
GET /score-sets/{urn}/variants/data?namespaces=vep
GET /score-sets/{urn}/variants/data?namespaces=clingen&namespaces=vep
GET /score-sets/{urn}/variants/data?namespaces=scores&namespaces=clingen

Proposed fix

Replace the four hardcoded query branches with a single composable query that determines which joins are needed based on the full set of requested namespaces:

  • Needs MappedVariant: clingen in namespaces, vep in namespaces, or include_post_mapped_hgvs is True
  • Needs GnomADVariant: gnomad in namespaces

This reduces the branching from four cases to a single query that conditionally adds joins, making it straightforward to add future namespaces (e.g. ClinVar) without further combinatorial explosion.

Changes needed

  1. Refactor query logic in get_score_set_variants_as_csv() — Compute needs_mapping and needs_gnomad booleans from the inputs, build one query with conditional joins, and extract results into variants, mappings, and gnomad_data lists uniformly.

  2. Add tests for independent namespace requests — Test ?namespaces=clingen and ?namespaces=vep without include_post_mapped_hgvs=true or gnomad, asserting populated (non-NA) values.

  3. Update existing testtest_download_clingen_file_in_variant_data_path should drop the include_post_mapped_hgvs=true flag to verify standalone behavior.

Relevant files

  • src/mavedb/lib/score_sets.pyget_score_set_variants_as_csv() query logic and variant_to_csv_row()
  • src/mavedb/routers/score_sets.py/variants/data endpoint
  • src/mavedb/models/mapped_variant.pyMappedVariant model
  • `tests/routers/test_score_set.py — Existing ClinGen/gnomAD CSV tests

Metadata

Metadata

Assignees

Labels

app: backendTask implementation touches the backendtype: bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions