Skip to content

Include mapped variant data and README in public data dump #664

@bencap

Description

@bencap

The public data dump script (src/mavedb/scripts/export_public_data.py) currently exports metadata (main.json), score/count CSVs, and a license file. It does not include mapped variant data (VRS alleles, mapped HGVS, etc.), even though this data is available via GET /api/v1/score-sets/{urn}/mapped-variants.

We should include mapped variant JSON in the data dump so that downstream consumers have access to post-mapped VRS representations without needing to call the live API.

Proposed Changes
1. Add mapped variant data to the dump
For each published score set that has completed mapping, export its mapped variant data (the same payload returned by GET /score-sets/{urn}/mapped-variants) as a JSON file in the archive, e.g.: mapped/tmp:00000001-a-1.mapped-variants.json

Each file should contain the current mapped variants for that score set, including pre_mapped and post_mapped VRS allele JSON, HGVS columns, and VRS version metadata.

2. Add a README to the archive
Add a README.md (or README.txt) to the root of the dump archive that documents:

  • What is included in the dump (metadata JSON, score CSVs, count CSVs, mapped variant JSON, license)
  • The structure/layout of the archive directory
  • A brief description of each file type and its format
  • Any caveats (e.g. only CC0-licensed published data is included, only current mapped variants are exported)
  • A link back to MaveDB and the API documentation for further reference

Metadata

Metadata

Assignees

Labels

app: backendTask implementation touches the backendtype: enhancementEnhancement to an existing featuretype: maintenanceMaintaining this project

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions