In Section B (Table 11) of the paper, the pretraining dataset is described as multi-species. However, the version available for download appears to contain only the raw DNA sequences, without any labels indicating the species of origin for each sequence.
Is there a way to obtain the species labels for the pretraining sequences, or could a mapping between sequences and their source species be released?
In Section B (Table 11) of the paper, the pretraining dataset is described as multi-species. However, the version available for download appears to contain only the raw DNA sequences, without any labels indicating the species of origin for each sequence.
Is there a way to obtain the species labels for the pretraining sequences, or could a mapping between sequences and their source species be released?