The public data dump script (src/mavedb/scripts/export_public_data.py) currently exports metadata (main.json), score/count CSVs, and a license file. It does not include mapped variant data (VRS alleles, mapped HGVS, etc.), even though this data is available via GET /api/v1/score-sets/{urn}/mapped-variants.
We should include mapped variant JSON in the data dump so that downstream consumers have access to post-mapped VRS representations without needing to call the live API.
Proposed Changes
1. Add mapped variant data to the dump
For each published score set that has completed mapping, export its mapped variant data (the same payload returned by GET /score-sets/{urn}/mapped-variants) as a JSON file in the archive, e.g.: mapped/tmp:00000001-a-1.mapped-variants.json
Each file should contain the current mapped variants for that score set, including pre_mapped and post_mapped VRS allele JSON, HGVS columns, and VRS version metadata.
2. Add a README to the archive
Add a README.md (or README.txt) to the root of the dump archive that documents:
- What is included in the dump (metadata JSON, score CSVs, count CSVs, mapped variant JSON, license)
- The structure/layout of the archive directory
- A brief description of each file type and its format
- Any caveats (e.g. only CC0-licensed published data is included, only current mapped variants are exported)
- A link back to MaveDB and the API documentation for further reference
The public data dump script (src/mavedb/scripts/export_public_data.py) currently exports metadata (main.json), score/count CSVs, and a license file. It does not include mapped variant data (VRS alleles, mapped HGVS, etc.), even though this data is available via GET /api/v1/score-sets/{urn}/mapped-variants.
We should include mapped variant JSON in the data dump so that downstream consumers have access to post-mapped VRS representations without needing to call the live API.
Proposed Changes
1. Add mapped variant data to the dump
For each published score set that has completed mapping, export its mapped variant data (the same payload returned by
GET /score-sets/{urn}/mapped-variants) as a JSON file in the archive, e.g.:mapped/tmp:00000001-a-1.mapped-variants.jsonEach file should contain the current mapped variants for that score set, including pre_mapped and post_mapped VRS allele JSON, HGVS columns, and VRS version metadata.
2. Add a README to the archive
Add a README.md (or README.txt) to the root of the dump archive that documents: