Background
Recent work in #53 added support for positionless reference-identical variants (ACC:p.=, ACC:c.=, ACC:g.=) to the VRS mapping pipeline. These are represented as VRS Allele objects with a ReferenceLengthExpression
sequence length, since ga4gh/hgvs-tools cannot translate .= expressions.
The pipeline correctly distinguishes the two valid HGVS forms:
| Form |
Example |
Current routing |
VRS output |
| Positionless |
NP_001234.1:p.= |
endswith(".=") → RLE path |
Full-sequence RLE Allele |
| Positional |
NP_001234.1:p.Ala13= |
Normal translate_hgvs_to_vrs path |
Position-specific LSE Allele |
These are genuinely different VRS alleles, even though both express "this variant matches the reference."
The Problem
MaveDB scoreset submitters can use either form, and the choice is often a matter of convention rather than biological intent:
- A scoreset with a single wildtype control row might submit
p.= (no specific position).
- A saturating mutagenesis experiment might submit one row per tested position —
p.Ala1=, p.Gly2=, etc. — using the positional form.
Both are biologically valid, but they imply different things to a VRS consumer: the positionless form says "the whole sequence is reference-identical," while the positional form says "this one residue specifically is reference-identical." A consumer performing variant deduplication or lookup will not recognize these as the same variant.
Furthermore, despite describing the same variant, all VRS digests between these variants differ which breaks a fundamental contract of the specification.
We should decide whether to have a preference and normalize one form to the other, or to accept both and document the distinction clearly.
Options
Option A — Prefer positionless (p.=), normalize positional ref-identical to RLE
- Positional
p.Ala13= rows are normalized to positionless RLE with a warning logged.
- Consumers always get a single canonical "reference" allele per transcript.
- Downside: lossy — discards position information that the submitter explicitly provided.
A scoreset with 200 p.AlaN= rows collapses to one repeated allele.
Option B — Prefer positional, reject or warn on positionless
p.= is flagged as ambiguous (no position anchor) and mapped with an error_message,
or normalized to a per-codon set of ref-identical alleles using the transcript sequence.
- Preserves per-position specificity throughout the pipeline.
- Downside: generating per-codon alleles for a positionless
p.= is expensive and
may not match submitter intent; rejecting it drops a valid representation.
Option C — Accept both as-is, document the distinction
- No normalization. Both forms pass through to distinct VRS alleles as they do today.
- Consumers must handle both RLE and LSE ref-identical alleles.
- Downside: inconsistent output across scoresets that express the same thing differently.
Suggested Starting Point
Option C reflects current behavior, is the lowest-risk path, and eliminates any possibility of altering submitter intent. At minimum, it should be documented clearly.
Any decision towards options A or B will inform whether we add normalization logic to _create_post_mapped_hgvs_strings / _construct_vrs_allele and whether we update the MaveDB submission guidelines to recommend one form.
Affected Code
vrs_map.py: _construct_vrs_allele (.endswith(".=") RLE branch), _create_post_mapped_hgvs_strings (short-circuit block), _hgvs_variant_is_valid
lookup.py: translate_ref_identical_to_vrs
annotate.py: _get_hgvs_string, _get_vrs_ref_allele_seq, _annotate_allele_mapping, _annotate_haplotype_mapping
Background
Recent work in #53 added support for positionless reference-identical variants (
ACC:p.=,ACC:c.=,ACC:g.=) to the VRS mapping pipeline. These are represented as VRSAlleleobjects with aReferenceLengthExpressionsequence length, since
ga4gh/hgvs-toolscannot translate.=expressions.The pipeline correctly distinguishes the two valid HGVS forms:
NP_001234.1:p.=endswith(".=")→ RLE pathAlleleNP_001234.1:p.Ala13=translate_hgvs_to_vrspathAlleleThese are genuinely different VRS alleles, even though both express "this variant matches the reference."
The Problem
MaveDB scoreset submitters can use either form, and the choice is often a matter of convention rather than biological intent:
p.=(no specific position).p.Ala1=,p.Gly2=, etc. — using the positional form.Both are biologically valid, but they imply different things to a VRS consumer: the positionless form says "the whole sequence is reference-identical," while the positional form says "this one residue specifically is reference-identical." A consumer performing variant deduplication or lookup will not recognize these as the same variant.
Furthermore, despite describing the same variant, all VRS digests between these variants differ which breaks a fundamental contract of the specification.
We should decide whether to have a preference and normalize one form to the other, or to accept both and document the distinction clearly.
Options
Option A — Prefer positionless (
p.=), normalize positional ref-identical to RLEp.Ala13=rows are normalized to positionless RLE with a warning logged.A scoreset with 200
p.AlaN=rows collapses to one repeated allele.Option B — Prefer positional, reject or warn on positionless
p.=is flagged as ambiguous (no position anchor) and mapped with anerror_message,or normalized to a per-codon set of ref-identical alleles using the transcript sequence.
p.=is expensive andmay not match submitter intent; rejecting it drops a valid representation.
Option C — Accept both as-is, document the distinction
Suggested Starting Point
Option C reflects current behavior, is the lowest-risk path, and eliminates any possibility of altering submitter intent. At minimum, it should be documented clearly.
Any decision towards options A or B will inform whether we add normalization logic to
_create_post_mapped_hgvs_strings/_construct_vrs_alleleand whether we update the MaveDB submission guidelines to recommend one form.Affected Code
vrs_map.py:_construct_vrs_allele(.endswith(".=")RLE branch),_create_post_mapped_hgvs_strings(short-circuit block),_hgvs_variant_is_validlookup.py:translate_ref_identical_to_vrsannotate.py:_get_hgvs_string,_get_vrs_ref_allele_seq,_annotate_allele_mapping,_annotate_haplotype_mapping