TECHNOLOGY

Methods and Technologies:

Sampling
Our ultimate goal is to be as consistent as possible in the generation and assembly of genomes with related efforts such as the Vertebrate Genome Project, DToL, ERGA and EBP-Nor.

Yggdrasil will make informal agreements with sample providers after speaking with the initiative's Project Coordinator, Tom Gilbert (tgilbert@sund.ku.dk)

For high quality genomes, sampling must happen in ways that best preserves the DNA. Before sending your samples, please contact Tom Gilbert (tgilbert@sund.ku.dk) to discuss species to be sampled. You will be provided with sampling criteria for your specific samples.

We also require that metadata be registered for all specimens and sent to the Project Coordinator before sequencing begins. Metadata include geolocation and time, collector, how the species was identified, by whom, and photos of the specimen.

For a general overview of the sampling standard that we are using, please take a look at the Earth Biogenome Project's guidelines.

Sequencing
All Yggdrasil genome samples are principally based around PacBio long read sequencing coupled to Arima Hi-C Illumina based scaffolding. Where material is available, this is ideally complemented with transcriptome sequencing of 3-4 tissues, to aid annotation.

Additionally we are exploring the use of PromethION P2 Solo from Oxford Nanopore Tech to further polish genome assembly, where required.

Assembly
Assembly process follows the VGP pipeline and includes quality control and filtering of raw data, contig assembly, purging haplotypic duplications, scaffolding, manual curation and gap closing. We perform extensive quality control of all intermediates of the assembly. We aim to generate high quality diploid assemblies which meet 6.C.Q40 (https://www.earthbiogenome.org/assembly-standards) minimum standard of Earth Biogenome Project.

All raw data (both genomic and transcriptomic) and final assemblies are released at NCBI under Yggdrasil BioProject PRJNA955268. After release assemblies are submitted for annotation by The NCBI Eukaryotic Genome Annotation Pipeline.

When combining HiFi and Hi-C sequencing data, we can create haplotype resolved assemblies, meaning we can separate reads by maternal and paternal origin, without having access to parental data. In diploid, or polyploid organisms, this adds another level of information, and creates more accurate assemblies than a primary and alternate assembly would.

Testing, by us, but also earlier by Darwin Tree of Life and Vertebrate Genomes Project, among others, has shown that the combination of HiFi and Hi-C, in appropriate coverages, usually generates assemblies that fulfill the Earth Biogenome Project's criteria for assembly standards. There are other ways to get to these standards, by using combinations of Oxford Nanopore Technologies sequencing data and Illumina, but these are often less straight-forward and involves more steps to a final assembly that the strategy we outline here.

GoaT registry
When a decision is taken to sequence any species, and initial QC has been performed to confirm the sample available is sufficient, our intention to sequence the species will be indicated on the Yggdrasil website, under the 'Species section', and the Genomes on a Tree registry (GoaT) under code name YGG, with direct or estimated values for over 70 taxon attributes and over 30 assembly attributes across 1.5 million eukaryotic species.