Skip to content

Decide on preferred form (and/or formalize current handling) for reference-identical HGVS variants #93

@bencap

Description

@bencap

Background

Recent work in #53 added support for positionless reference-identical variants (ACC:p.=, ACC:c.=, ACC:g.=) to the VRS mapping pipeline. These are represented as VRS Allele objects with a ReferenceLengthExpression
sequence length, since ga4gh/hgvs-tools cannot translate .= expressions.

The pipeline correctly distinguishes the two valid HGVS forms:

Form Example Current routing VRS output
Positionless NP_001234.1:p.= endswith(".=") → RLE path Full-sequence RLE Allele
Positional NP_001234.1:p.Ala13= Normal translate_hgvs_to_vrs path Position-specific LSE Allele

These are genuinely different VRS alleles, even though both express "this variant matches the reference."

The Problem

MaveDB scoreset submitters can use either form, and the choice is often a matter of convention rather than biological intent:

  • A scoreset with a single wildtype control row might submit p.= (no specific position).
  • A saturating mutagenesis experiment might submit one row per tested position — p.Ala1=, p.Gly2=, etc. — using the positional form.

Both are biologically valid, but they imply different things to a VRS consumer: the positionless form says "the whole sequence is reference-identical," while the positional form says "this one residue specifically is reference-identical." A consumer performing variant deduplication or lookup will not recognize these as the same variant.

Furthermore, despite describing the same variant, all VRS digests between these variants differ which breaks a fundamental contract of the specification.

We should decide whether to have a preference and normalize one form to the other, or to accept both and document the distinction clearly.

Options

Option A — Prefer positionless (p.=), normalize positional ref-identical to RLE

  • Positional p.Ala13= rows are normalized to positionless RLE with a warning logged.
  • Consumers always get a single canonical "reference" allele per transcript.
  • Downside: lossy — discards position information that the submitter explicitly provided.
    A scoreset with 200 p.AlaN= rows collapses to one repeated allele.

Option B — Prefer positional, reject or warn on positionless

  • p.= is flagged as ambiguous (no position anchor) and mapped with an error_message,
    or normalized to a per-codon set of ref-identical alleles using the transcript sequence.
  • Preserves per-position specificity throughout the pipeline.
  • Downside: generating per-codon alleles for a positionless p.= is expensive and
    may not match submitter intent; rejecting it drops a valid representation.

Option C — Accept both as-is, document the distinction

  • No normalization. Both forms pass through to distinct VRS alleles as they do today.
  • Consumers must handle both RLE and LSE ref-identical alleles.
  • Downside: inconsistent output across scoresets that express the same thing differently.

Suggested Starting Point

Option C reflects current behavior, is the lowest-risk path, and eliminates any possibility of altering submitter intent. At minimum, it should be documented clearly.

Any decision towards options A or B will inform whether we add normalization logic to _create_post_mapped_hgvs_strings / _construct_vrs_allele and whether we update the MaveDB submission guidelines to recommend one form.

Affected Code

  • vrs_map.py: _construct_vrs_allele (.endswith(".=") RLE branch), _create_post_mapped_hgvs_strings (short-circuit block), _hgvs_variant_is_valid
  • lookup.py: translate_ref_identical_to_vrs
  • annotate.py: _get_hgvs_string, _get_vrs_ref_allele_seq, _annotate_allele_mapping, _annotate_haplotype_mapping

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions