Supergene for supercolour
Welcome to the update page of “Supergene for supercolour” !!
- The genome of the grove snail has been assembled!
- The genome of the grove snail has been sequenced!
- How big is the genome of the grove snail?
In the past few months we’ve been very busy with the assembly of the genome of the grove snail. And it was quite a challenge because the genome of this species is very large (3.5 billion letters)! Previous studies (eg. “The karyotype of the land snail Cepaea nemoralis” door C. Page, 1978) have shown that grove snail DNA is packed into 22 chromosomes. Ideally, we would assemble the genome in 22 long sequences that correspond to the chromosomes. But given the low sequencing coverage we started with it is quite impossible.
As mentioned before, we produced 7.2 billion “long reads” using PacBio sequencing technology. Approximately 30% of these reads were shorter than 5000 letters. Such short pieces usually hamper the assembly process so we decided to use only the longer reads (4.8 billion with a total length of 74 billion letters). How can we put all these reads together?
The principle is quite simple, but one needs a very powerful computer or, even better, a “high performance cluster” of hundreds of such computers (we work on such a cluster at LUMC). There are computer tools that look for overlapping parts of long reads to stitch them together. Sometimes it’s quite straightforward. For example, reads 1 to 3 in the figure below have long overlapping parts, so it’s quite easy to make a consensus out of them (a so-called “contig”). It’s becoming more tricky when the reads have long repetitive parts such as GGGG and CCCCC in read 4. This read can be placed both at the start as at the end of the contig. When the programme encounters such reads and can’t decide where to place them it will simply stop. Unfortunately, any genome with a high proportion of such repetitive sequences will be assembled in many short contigs instead of long chromosomes.
Many assembly programmes have been developed, each using a slightly different algorithm to stitch the reads together. We’ve used Canu (https://github.com/marbl/canu/releases), WTDBG2 (https://github.com/ruanjue/wtdbg2), and Flye (https://github.com/fenderglass/Flye) with our dataset. In addition, we used a handy tool Quickmerge (https://github.com/mahulchak/quickmerge) to make a consensus out of three resultant assemblies. To correct the errors in contigs we used Pilon (https://github.com/broadinstitute/pilon/wiki) together with highly accurate Illumina “short reads” (see previous post).
Our assembly of the snail genome consists out of 28 538 contigs. De shortest contig is 1000 and the longest is 3,5 billion letters. One of the widely used statistics to describe genome assembly quality is N50 (a median of contig lengths), and in general the higher this number the better. Our assembly has N50 of 333 110 bp. This means that half of the genome is assembled in contigs of 333 110 letters or longer.
Another essential statistics for genome assemblies is the percentage of present BUSCO genes (https://busco.ezlab.org/). A complete BUSCO (Benchmarking Universal Single-Copy Orthologs) dataset consists out of 978 highly conserved genes found in all animals. Our assembly contains 907 of such genes and is therefore completed for 92,7%.
The N50 and BUSCO statistics tell us that our snail genome assembly is not bad at all, especially given that we started with quite a low read coverage. Now the real work is about to begin! The next step is the annotation of the genome, i.e. finding which DNA sequences represent the actual genes. And, of course, finding the supergene for colour variation in Cepaea shells! (to be continued…)
It took a while but we have finally sequenced the genome of the grove snail Cepaea nemoralis. This means that we revealed the order of nucleotide bases (letters A, T, G en C), all 3.5 billions of them! If such genome were a book with 3.5 thousand letters per page it would contain one million pages! Previous studies of such large genomes, for example in humans, have shown that a significant fraction is made up of fragments found in multiple copies, so-called repetitive sequences. Such repeated fragments from different parts of the genome may appear as identical, which makes the assembly a complicated task. Even the human genome is not perfectly assembled for this reason.
We have used two modern, or “next-generation sequencing”, technologies: Illumina and PacBio. Illumina is a “short-read” technology with a high degree of accuracy. The principle is quite simple: first, the DNA is cut into small pieces, usually around 500 bases. These fragments are attached to a solid surface of a tiny flowcell and exposed to a solution of fluorescently labelled nucleotides (A, T, G en C), each with a different colour. When the nucleotides become incorporated into the DNA they emit a signal detected by the camera. As each of the four bases has its own colour the sequence of DNA can be determined. Usually up to 150 letters can be “read” with high confidence by the Illumina machine. These “short reads” contain very few mistakes, in contrast to “long reads”.
There are two kinds of “long-read sequencing” technologies: Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), each with its own (dis)advantages. The working principle of PacBio, which we used in our project, is similar to that of Illumina, except that the DNA is not cut into small pieces in the beginning. In fact, the DNA has to be treated with great care to avoid degradation, as longer fragments will result in longer DNA sequences, or “long reads”. High-quality DNA kan produce sequences of hundreds of thousands basepairs, but there is one disadvantage to this process: the reads usually contain many errors. This is true for both PacBio and ONT. It is therefore recommended to combine long reads with highly accurate Illumina short reads, so that the errors in DNA sequences can be corrected.
The DNA of a single individual has been used for our genome sequencing project. This snail has been bred in the laboratory of our collaborator Dr. Angus Davison at the University of Nottingham in the United Kingdom (more information on his research here http://www.angusdavison.org/). The DNA has been sequenced with both PacBio and Illumina. The former technology produced about 7.2 billion long reads with a total length of almost 80 billion nucleotides. This means that the coverage of our dataset is about 23 times, i.e. each letter in the genome has been “read” 23 times. The average length of PacBio reads was about 11 thousand nucleotides, and the longest fragment was almost 262 thousand letters! As the coverage is not that high (the usually recommended 30 – 50 times was above the budget), Illumina sequencing dataset will come handy. For the same snail we obtained 800 million Illumina short reads of 150 bases, i.e. a dataset of 120 billion letters.
The next step is the assembly of the grove snail genome. But how can one put together all these millions of DNA fragments? Fortunately, we are not the only scientists working on a genome sequencing project. There are many powerful computer algorithms and tools available to analyse such an enormous amount of complex data. But the assembly of a large genome remains a challenge! (to be continued…)
In order to identify the supergene for colour polymorphism in the snail Cepaea nemoralis we first need to sequence its genome. Several sequencing techniques are available nowadays, for example Illumina short-read sequencing. The costs of sequencing a complete genome, however, are still quite high. Especially because, to achieve good results, one needs to sequence at high coverage. At least 30x coverage (i.e. each letter of the genome must be sequenced 30 times) is recommended to reduce the amount of errors.
Hence, before starting on such an expensive endeavour, we would like to know how many letters there are in the genome of our snail. This is because we want to sequence it at exactly 30x coverage, not more (or it would become too expensive) and not less (or there will be too many errors). That’s why we need to determine the genome size.
It is, of course, possible that someone else has already done this before. We check the website Animal Genome Size Database which contains information on 6222 animal species. Genome size is indicated as C-value, or the amount of DNA in picograms (1pg = 10-12 gram) in the nucleus of a haploid cell (haploid = contains one set of chromosomes). Unfortunately, there is no data on Cepaea nemoralis in this database. But we can find information on four other species of Helicidae, the snail family Cepaea belongs to. All four of these species have quite large genomes, between 2.84 and 4.00 pg. This is similar to human genome which is 3.50 pg!
Well, we need to measure the size of Cepaea genome ourselves. The most widely used technique for this purpose is “flow cytometry”. It works by comparing the amount of DNA in your organism of interest with that in a control animal with known genome size. In our case we can use the zebrafish as it’s C-value is known to be 1,7 pg. We travel to Didam (NL) where we visit “Plant Cytometry Services”. This company will measure the DNA amounts in our snail samples.
The tissues of snails and zebrafish are crushed together to release the cell nuclei. Then propidium iodide is added to the mixture. This chemical binds to nuclear DNA. The mixture is then passed through a very fine filter. Now we have a solution with nuclei of the snail and the zebrafish to be analysed by the flow cytometer. The nuclei pass through a very thin tube where they are radiated with the UV-light. Propidium iodide emits fluorescent signal which is measured by the instrument. The amount of fluorescent signal depends on the amount of DNA in the nuclei.
Fluorescence is shown in a graph such as this one. There are two peaks – one for the zebrafish (left) and one for the snail (right). The software calculates the ratio between the two peaks, in this case it’s 2.11 (snail) : 1.00 (zebrafish). The analysis is repeated three times and the average ratio is 2.06 : 1.00. Now we can calculate the amount of DNA is the snail: 2.06 x 1.7 (zebrafish C-value) = 3.50 pg. The genome of Cepaea nemoralis is just as large as the human genome!
Now we can also calculate the number of bases (letters A, G, T, C) as 1 pg is equal to 978 million bases. The snail has approximately 3.50 x 978 = 3432 million letters in its genome!
The challenge now is to find the supergene for supercolour among all these letters! (to be continued…)
Colour polymorphic snail reveals the evolution of supergene architecture
The three researchers from Naturalis Biodiversity Centre (Leiden, The Netherlands) have been awarded a grant
from the Netherlands Organisation for Scientific Research (NWO) for the project “Evolution of supergenes and the
genetic basis of snail colour polymorphism”. Dr. Suzanne Saenko, Dr. Dick Groenenberg, and Prof. Menno
Schilthuizen will study the the supergene that controls shell colour polymorphism in a classical
model for ecological genetics and climate-induced evolutionary change, the land snail Cepaea nemoralis.
A supergene is a cluster of several genes, each of which affects a different morphological or behavioural trait.
Because of tight physical linkage within supergenes multiple phenotypic characters are inherited as a single locus.
Supergenes are thought to be crucial for the maintenance of highly discrete adaptive phenotypes which can
eventually lead to reproductive isolation and speciation. Multiple complex polymorphisms are presumably
controlled by supergenes, but the molecular evidence for this phenomenon is still scarce and the emergence of
such genetic architecture is surprisingly poorly understood. To help fill this scientific gap, the researchers will
sequence and assemble the genome of C. nemoralis, identify the individual components of its supergene through
linkage mapping, and investigate their role in shell coloration through studies of gene expression and function.
Information about the progress will be posted on our website regularly.