Next generation shotgun sequencing and the challenges of de novo genome assembly

The structure of genomes varies greatly throughout the three domains of life. Eubacterial and archaebacterial genomes are arranged into a single circular chromosome. Eukaryote genomes are arranged into multiple linear chromosomes within the nucleus, and can be made challenging by features such as introns, low complexity repetitive elements, large genome sizes and extensive sequence duplications. These problems are generally more substantial in the plant and animal kingdoms, which tend to have a large number of repetitive regions and are prone to ploidy changes. The sequencing of whole genomes has traditionally been restricted to small genomes or genomes of model organisms which are of particular global interest, because of the high cost of sequencing and the complexity of assembling all the sequence reads. Sequencing of other eukaryote genomes (so called ‘non-model organisms’), which may be of biological or evolutionary importance, has to date been considered unfeasible within the resources of the average research laboratory. This scenario is changing.


Introduction
The structure of genomes varies greatly throughout the three domains of life.Eubacterial and archaebacterial genomes are arranged into a single circular chromosome.Eukaryote genomes are arranged into multiple linear chromosomes within the nucleus, and can be made challenging by features such as introns, low complexity repetitive elements, large genome sizes and extensive sequence duplications.These problems are generally more substantial in the plant and animal kingdoms, which tend to have a large number of repetitive regions and are prone to ploidy changes.The sequencing of whole genomes has traditionally been restricted to small genomes or genomes of model organisms which are of particular global interest, because of the high cost of sequencing and the complexity of assembling all the sequence reads.Sequencing of other eukaryote genomes (so called 'non-model organisms'), which may be of biological or evolutionary importance, has to date been considered unfeasible within the resources of the average research laboratory.This scenario is changing.
In the last five years, the development of next generation sequencing technology has seen a major increase in efficiency (the number of base pairs sequenced per day) as well as a vast decrease in cost per base, with prices approximately halving every five months since 2007. 1 Unfortunately, the read length (the number of nucleotides sequenced off one fragment) has seen a large decrease compared to that of the original Sanger methods. 2 This reduction has posed a major challenge for genome assembly as it makes the definition of areas that have repeated elements for long stretches extremely difficult.The severity of this problem increases with the size of the genome and the number of variable repeat regions.It was for this reason that de novo sequencing of the genomes of complex non-model organisms using solely next generation sequencing technology initially looked dubious. 3Improvements in next generation sequencing technology to increase sequence read lengths, together with the development of new software tools to assemble the sequence reads, has, however, made the sequencing of the genomes of non-model organisms feasible for the first time, as demonstrated recently by the sequencing and assembly of the giant panda 4 and strawberry genomes. 5

Next generation sequencing technology
The phrase 'next generation sequencing' is a general term applied to sequencing platforms that use post-Sanger technology to sequence large numbers of DNA fragments in parallel. 6Roche, through their '454' sequencing platform, developed the first viable high-throughput sequencing technique. 7The genomic DNA is randomly sheared into fragments of approximately 800 bp to 1000 bp in length, although a shorter fragment works better if maximum read length is not required.Adaptor primer sequences are ligated onto the ends of the genome fragments, which are then hybridised to beads covered in one of the complementary primers.The beads are diluted in an oil emulsion, such that each oil droplet contains only one bead.The DNA fragment attached to each bead is amplified by the polymerase chain reaction (PCR) in a process known as emulsion PCR or emPCR.Each bead is then aliquoted into a welled plate, such that there is only one bead in each well.The actual sequencing commences with the plates undergoing a series of washes.Each wash contains one of the four nucleotides.The incorporation of these nucleotides into a Sequencing the genomes of the many scientifically fascinating plants and animals found in South Africa is fast becoming a viable option as a result of the rapid and sustained drop in the cost of next generation sequencing over the last five years.However, the processing and assembly of the sequence data produced is not trivial.There are several factors which need to be taken into consideration when planning a strategy to assemble genome sequence data de novo.This paper reviews the advances and the challenges in two of the most rapidly developing areas of the field: the sequencing technology and the software programs used to assemble de novo the sequence data generated by these technologies into a genome.sequence yields a light reaction which can be measured.The number of nucleotides incorporated into a sequence per wash is deduced by the strength of the signal.This method is referred to as pyrosequencing or sequencing by synthesis.The 454 platform, GS FLX+, typically gives a mode read length of 700 bp (Table 1) and is capable of reaching 1 kb.Unfortunately, the 454 platform is not accurate when it comes to sequencing long homogeneous nucleotide runs, resulting in a high chance of a measured insertion or deletion in these regions.
Illumina's platform (originally owned by Solexa) was the next product out on the market.In this method, the genomic DNA is fragmented to a desired size (less than a 1 kb) and adaptors are ligated to either end in a similar manner to Roche's method.These fragments are then chemically attached to a glass slide covered in complementary oligonucleotides of both primers called a flow cell.A method called bridge amplification (Figure 1) is then used to amplify the DNA fragments. 8This amplification results in copies of the original fragment being arranged in clusters around the flow cell.These clusters are then sequenced one nucleotide at a time by washing with a buffer containing fluorescently tagged, reversibly terminating nucleotides. 9The reversibly terminating nucleotides allow all four nucleotides to be washed over at once as only one can be incorporated into the sequence while the fluorescent tag is still attached; each nucleotide has a different colour tag.The colour of each cluster is then recorded, the fluorescent tag removed and the next wash commenced. 10The fragment can be sequenced from both of the primers, resulting in two sequences with a known number of nucleotides between them, depending on the initial fragment size (Table 1).This method is called a paired-end read and can be used to partially compensate for the short read lengths in de novo assembly.Read lengths vary between 100 bp and 150 bp depending on the platform, with a length of 250 bp to be introduced on the MiSeq platform in 2012.
SOLiD sequencing, the third well-established next generation sequencing method, was developed by Applied Biosystems and then later by Life Technologies.The DNA fragments are amplified in a similar manner to that of Roche's system (bead emulsion PCR), 11 but the actual sequencing happens in a very different manner.Instead of sequencing by synthesis, SOLiD uses the binding of eight-base oligoprimers to differentiate between nucleotides on the DNA fragment.Although an oligoprimer consists of eight nucleotides, six nucleotides are degenerate, leaving only two that are specific to the query sequence (the two closest to the 3' end; Figure 2a).Thus 16 oligoprimers are supplied in each wash (all possible dinucleotide combinations).The nucleotides closest to the 5' end are marked with one of four fluorescent dyes (with four dinucleotide combinations per colour; Figure 2b).
In the first wash of the SOLiD sequencing procedure, the appropriate oligoprimer is ligated to an initial universal primer (n) that is complementary to the primer ligated on the DNA fragment and used for amplification (Figure 3).Once the fluorescent signal is recorded, the last three nucleotides  The mechanism of Illumina's bridge amplification.The process shown is repeated until DNA fragments form dense clusters.The green and red regions show the two primer sequences ligated to the blue DNA fragments.Primers that are complementary to these regions, and which are attached to the platform, are denoted in the cognate colour with spots.Once amplification is complete, the fragments can be sequenced from either of the two platform primers (i.e. the primers indicated by the spotted regions) or from both to produce paired-end reads.
(which contain the fluorescent dye) are cleaved off and another wash of oligoprimers is applied.This process is repeated until the DNA fragment is completely covered by oligo remnants.The now double-stranded DNA is denatured and the oligoprimer remnants are discarded.The entire process is repeated four more times, but each time with a slightly smaller initial primer (n-1, n-2, etc.).In this way, each nucleotide is analysed twice by dinucleotide probes (Figure 3) and results in a sequence (referred to as colour space) that is directly translatable to genetic code as long as the first nucleotide is known, which it must be because the second sequencing reaction is anchored with an initial universal primer at n-1 (Figure 3).The advantage of this drastic departure from the more conventional sequencing technologies is that it is possible to differentiate between a single nucleotide polymorphism and a sequencing error in the colour space sequence.This ability is because a single nucleotide polymorphism affects two of the colours in the colour space (measured twice), whereas a sequencing error changes only one of the colours.The platform is also capable of paired-end reads, but the second read of each pair is considerably shorter than the primary read (35 bp instead of 75 bp; Table 1).These shorter read lengths make SOLiD's application to the de novo assembly of larger genomes limited.
In addition to SOLiD, Life Technologies have also acquired the rights to a second sequencing technology.Known as Ion Torrent, this platform is very similar to Roche's 454 in many respects.Fragments are amplified by emulsion PCR before being sequenced by applying the nucleotides one at a time and measuring the signal strength emitted. 12What makes Ion Torrent different from the rest is the fact that it measures the change in pH from an incorporated nucleotide rather than a light signal, resulting in very rapid sequencing (Table 1).The measurement of pH change is accomplished using semiconductor chips and an ion-sensitive layer.Read lengths are currently 200 bp, but lengths of 400 bp are expected to be available soon.Unfortunately, because the reaction is not limited to the incorporation of a single nucleotide, it is subject to the same error in regions of homogeneous nucleotide runs that 454 is.
All of the aforementioned technologies amplify the DNA before sequencing.In addition to using time and money, this approach introduces an amplification bias against 'g-c' or 'at' rich areas.However, there are a few sequencing platforms that do not amplify the target DNA.These single-molecule sequencers have been dubbed the 'third generation' (making amplification-dependent sequencers the 'second generation').
Notably among these third-generation sequencers is the PacBio RS from Pacific Bioscience.The PacBio RS rapidly sequences long individual fragments (10 kb) using a system it calls SMRT (single-molecule real-time) technology; (Table 1).
In this system, DNA polymerases attached to a detection surface are each given a single DNA strand.This strand is then sequenced by measuring the signal given off whenever a phospholinked nucleotide is incorporated into the strand.Each nucleotide is tagged with a different colour which is cleaved when it is added to the existing strand.There is therefore no need for washes or terminating nucleotides.The nucleotide suspension is simply applied to the plate and the sequencing commences.This process therefore results in much longer read lengths than the other currently available platforms, but at a much higher error rate (15%). 6n an effort to try and account for this error rate, in addition to sequencing long single strands, Pacific Bioscience offer a service where they take a shorter piece of double stranded DNA (2 kb for example), and add hairpin adaptors onto the end.This addition has the effect of circularising the DNA, which allows it to be sequenced repeatedly by the DNA polymerase, dramatically reducing the error rate.This is known as circular consensus sequencing.Pacific Bioscience will not be the only sequencer with 10-kb read lengths for long -soon the sequencers GridION and MinION from Oxford Nanopore Technologies will join them. 13r lack of these long read lengths, the original three technologies (Illumina, 454 and SOLiD) are at least able to compensate with the creation of mate-paired reads.Matepaired reads can be thought of as paired-end reads with a very long gap between them (several kilobases).However, matepaired reads require a more complicated library construction than do normal genomic reads.With mate-paired reads, the genome is initially fragmented to allow for large gaps of a desired length.These fragments are then circularised with biotinylated primers (Figure 4).The circular DNA is then re-fragmented and the biotinylated regions are purified out, resulting in two sections of DNA that were once separated by kilobases of DNA on the same sequencable fragment.The incorporation of mate-paired reads is required for definition of the longer repetitive elements in a eukaryotic genome and is necessary for analyses into structural variation.
For more information on the chemistry of the different techniques, we recommend the reading of Metzker's 12 review and for more information about individual platforms offered by each company and their associated prices, we refer you to Glenn 6 .

De novo genome assembly
Once the genome has been sequenced, the millions of short sequence reads need to be assembled into an informative model of the genome.However, the extremely large number of reads can make the computation very taxing.A good assembly algorithm therefore needs to be as efficient as possible within a computer's processing power, without sacrificing too much accuracy.
The compromise between speed and accuracy is commonly helped through the creation of k-mers. 14K-mers are created by breaking down the existing sequences into shorter sequences of length 'k'.These allow the data to be summarised and compacted in such a way that commonalities can be identified and links drawn between sequences.K-mer information is used to varying degrees in different algorithm strategies and programs.Overlap assembly strategies such as Greedy and Overlap/Layout/Consensus use k-mer data to efficiently find the best overlap amongst the sequences, whereas Eulerian strategies use the k-mers to create a de Bruijn graph. 15

Greedy algorithm
The first short-read genome assembly programs (for example SSAKE, 16 SHARCGS 17 and VCAKE 18 ) were based on the Greedy algorithm (Table 2). 14This algorithm works in an intuitive manner by finding high scoring overlap sequences and joining them together.In this way, contigs are produced and extended.However, the localised view that this procedure maximises, opens it up to several pitfalls.For example, repetitive elements within the genome, if not treated with care, can cause hybrid contigs to be formed (Figure 5a). 3 To address this issue, most Greedy algorithms either stop extension if several good hits are found that do not agree with each other (i.e. they do not share sufficient commonalities) or do not take into account elements in the overlap that are overly represented within the data set (suggesting it is found multiple times in the genome). 15The local maximisation also tends to lead to solutions that are not globally optimal (Figure 5b).This tendency to get caught in local maxima limits the use of this algorithm to small simple genomes. 19

Overlap/Layout/Consensus algorithm
The Overlap/Layout/Consensus method is a three-step process that effectively combines the overlap idea of the Greedy algorithms with a global view point to prevent local maxima from being pursued (an Overlap/Layout/ Consensus algorithm would have successfully put together the correct sequence in Figure 5b for example). 15The first step of the method identifies overlaps between all the sequences.This pairwise alignment, while streamlined with the use of indexing and k-mers, can still be computationally expensive in larger data sets.The sequences are then mapped out on a graph as nodes and connected according to the overlap information (layout step).This step is kept computationally efficient as actual sequence composition is not considered. 14deally a single path, called the Hamiltonian path, is identified that goes through all the sequences exactly once (or rather one path per chromosome).In reality, however, this single path is prevented by insufficient coverage, sequencing errors, genetic polymorphism and unresolvable elements caused by repeated elements.Once the optimum path has been derived, sequence identity is decided by compiling evidence for each nucleotide in the consensus step.Examples of Overlap/Layout/Consensus programs are Newbler, 7 Edena 20 and CABOG 21 (Table 2).
The computational requirements of this strategy are proportional to the number of sequence reads in the data set.These programs have therefore not done well with the exponential increase in next generation sequencing data.While programs like Newbler and CABOG have been used to assemble large genomes, these assemblies were supplemented by long Sanger reads.Such programs would struggle to scale up to the hundreds of millions of reads that would be necessary for a large de novo assembly consisting purely of next generation sequencing data. 19The strategy should not, however, be dismissed yet.A new take on the method, called SGA (String Graph Assembly), was recently developed by Simpson and Durbin 22 ; SGA uses a highly effective compression technique that allows for the assembly of genomes with minimal RAM usage.

De Bruijn algorithm
The second type of graph assembly technique is based on the formation of de Bruijn graphs (also known as an Eulerian algorithm). 23These graphs are formed from the creation of k-mers of the original sequences.These k-mers are then connected to k-mers that they overlap with at k-1 sites.In this way, the algorithm avoids the computationally difficult global pairwise overlap step used in the Greedy and Overlap/Layout/Consensus strategies as the overlap is implicit (Figure 6a).However, the conversion of sequences into k-mers does result in the loss of some information.This loss can cause the algorithm to create links between two unrelated sequences that it then cannot dismantle again without consulting the original reads (Figure 6b). 14The larger the value of 'k' and the smaller the genome, the fewer false associations will be made in the de Bruijn graph.The value of 'k' cannot be made too large, however, as sequences still need to share 'k-1' nucleotides in the real overlap in order to be associated.
Once the complete de Bruijn graph has been created from the k-mers, a path that uses every node (called the Eulerian path) must be drawn through the system.As a consequence of collapsed nodes (as seen in Figure 6b), there are normally several possible paths through the graph.In order to separate the true path from the false ones, sequence information needs to be added back into the network (Figure 7). 14Once the best path has been identified, the contig sequence can be read from the k-mer identities.Examples of de Bruijn graph programs are Euler, 23 Velvet, 24 ABySS, 25 AllPaths 26 and SOAP de novo 27 (Table 2).These programs are appealing because, theoretically, the computational requirements increase with genome size and not with the number of reads.The fact that computational requirements are independent of the number of reads is important because (without sequencing error) a genome will be broken down into the same de Bruijn graph, regardless of the coverage.Of these programs, Allpaths LG and SOAP de novo have been designed specifically to deal with large genomes and were ranked first and second, respectively, in an open assembly competition hosted by an independent third party, with SGA being placed third. 28OAP de novo was also the program used to assemble the giant panda genome.2) 4) 6) 2) 6) 6) 2) 3) 5) 1) 2) 6) 3) 4) 5)

Dealing with sequence errors
Many of the problems faced by genome assembly algorithms, particularly those using a de Bruijn graph, are convoluted by the existence of sequencing errors. 14Although low, different technologies have different error rates (reported on their respective websites).For example, SOLiD sequencing has an initial accuracy of 99.94% for each nucleotide.This accuracy is reduced to 99% by the 35th base pair.In comparison, the 454 sequencer's accuracy decreases to 99% only at the 400th base pair.Reports on the Illumina platform indicate that 80% of nucleotides are 99.9% accurate.Even with these levels of accuracy, sequencing a genome will result in anywhere between hundreds of thousands to tens of millions of incorrect bases.The probability of any one nucleotide being incorrect is reported with each call by the sequencer.Although this information is useful for determining the overall quality of a sequencing run, the inclusion of it in the actual genome assembly is a computationally costly endeavour. 19Because of the cost, most assembly programs choose to ignore data that include the probability of the error (e.g.Edena, Velvet, SOAP de novo and SSAKE).
Theoretically, Greedy algorithms are well protected from errors in a high-coverage situation as sequences without error should be incorporated preferentially to those with an error (because of a higher overlap score).Overlap algorithms in general also avoid using error probabilities by not requiring a 100% sequence similarity, allowing the error to be passed over and resolved later in the consensus step.The de Bruijn graph method on the other hand, does not have one of these inherent mechanisms to deal with sequence errors. 14If unaddressed, sequence errors can cause excessive complexity in the de Bruijn graph by adding so-called 'bubbles' (which occur when an otherwise linear path has two possible central sequences) and 'spurs' (which occur when a sequence has two possible starts or ends) to an otherwise unambiguous path (Figure 8) or by causing collapsed nodes (as in Figure 6b).The creation of bubbles and spurs is the most common outcome of an error as 4 k should be much larger than the genome, meaning novel k-mer creation is the most likely outcome of an incorrect base.The creation of novel k-mers makes sequence errors especially troublesome in de Bruijn graphs, as one error can potentially result in 'k' new k-mers in a graph.However, not all bubbles and spurs are the result of sequencing errors; bubbles and spurs can also be caused by single nucleotide polymorphisms, microsatellites and tandemly duplicated areas.
Methods for dealing with errors normally revolve around this concept of original k-mer creation.The Euler method, for example, plots the k-mers versus their frequency in the data, 23 creating a bimodal graph, with one peak representing the real k-mers observed and a second peak (lower on the graph) representing the new k-mers created by sequencing errors.A point between these modes is chosen and everything above this point is trusted and everything below the point is distrusted.A similar mechanism is employed in many programs.Suspicious k-mers can either be altered to a more likely k-mer (Euler) or discarded (Velvet).This approach does, however, mean that true k-mers with a very low coverage will be adjusted or disregarded and certain errors that happened to result in a common k-mer will be a b accepted.Once a graph has been assembled, there are several simplification algorithms that trim spurs and collapse bubbles to make informative contigs. 14This necessary step does unfortunately result in a loss of real information like single nucleotide polymorphisms and occasionally even entire exons. 29

Conclusion
Given their differences, the question arises as to which assembler and sequencing platform should be used under which conditions.This question is not an easy one to answer.The main variables to consider are the budget for sequencing, the computer processing power and the genome size.The decision is made more difficult by the fact that sequencer technology does not stand still -new assembly programs are constantly being released and old programs are being updated.Generally though: • SOLiD's advantage of being able to identify single nucleotide polymorphisms does not make up for its shorter read length which makes the de novo assembly of complex genomes difficult.• Sequencing on Roche's 454 is expensive, but it has a good read length, which is especially important for the de novo assembly of complex genomes.• Pacific Bioscience has surpassed 454 in read length for a similar cost.• Illumina's platform provides a good balance between read length and cost; and is the most widely used for such projects.
• Paired-end reads on the Illumina or SOLiD platforms are worth the extra cost.• Ion Torrent offers fast sequencing turnover combined with a reasonable read length; however, the lack of paired-end reads limits it to smaller genomes.• Greedy algorithms should not be used unless computer processing power is limited and the genome is simple.• With the exception of SGA, the de Bruijn graph programs cope better with the large numbers of short reads that Illumina or SOLiD will produce.• Having a portion of your coverage as either mate-paired reads or long third-generation reads will significantly decrease the fragmentation of the final assembly.• Extreme guanidine-cytosine contents can affect the amplification efficiency of DNA and thus negatively impact second-generation sequencing.• Check to see whether a potential reference genome exists before attempting de novo assembly (http://www.ncbi.nlm.nih.gov/genome/browse/).
If you can afford it, a good strategy is to combine data from two different platforms (for example Illumina and Pacific Bioscience), which would compensate for each method's deficiencies.This strategy gives the best chance of a complete and accurate genome description.However, even under the best conditions, the de novo assembly of large genomes will be fractured and incomplete to some degree.Nevertheless, this deficiency should not deter researchers from taking advantage of these new sequencing technologies to assemble the genomes of diverse plants and animals, even if they are incomplete, to answer important biological and evolutionary questions.Genome sequences, even fractured ones, are a wealth of potential information.

FIGURE 2 :
FIGURE 2: SOLiD oligo primer design.(a) The structure of an 8-mer oligoprimer used for SOLiD sequencing.The cleavage site (dotted line) allows for the removal of the fluorescent signal (star) which is necessary before the next oligoprimer can be added.(b) A chart of how colours are allocated to SOLiD primers based on the first two nucleotides.The order in which the chart is read (rows or columns first) is irrelevant: B (blue), G (green), O (orange), R (red).

FIGURE 3 :
FIGURE 3: SOLiD sequencing procedure.Every nucleotide is measured twice with the use of repeated washes and primer application in SOLiD sequencing to get a final sequence read.Note that the first nucleotide in the final read (A in bold) is from the initial primer and has a known identity.This known identity is required to decipher the colour space.

FIGURE 4 :
FIGURE 4: Creation of mate-paired libraries for Illumina.A long DNA fragment is labelled with biotinylated deoxyribonucleotide triphosphate on either end, before being circularised.The circularisation brings the two furthest ends of the DNA fragment (dark and light grey) next to each other.The DNA is then re-fragmented and the piece containing the two furthest ends is purified out using the biotin markers.The new fragment (which is a reasonable length to sequence) has primers ligated onto the ends and is ready for amplification and sequencing as normal.Mate-paired read construction by the other sequencing companies follows a similar pattern.

FIGURE 5 :
FIGURE 5: Limitations of the Greedy algorithm.(a) Repetitive elements (AT n in this case) common throughout the genome lead to large overlap scores and linkage of unrelated sequences in a Greedy algorithm.The first potential overlapping sequence in the diagram starts farthest into a repetitive element and is therefore given the highest score and added to the contig (overlapping nucleotides are underlined).(b) Duplicated regions (represented in green) can prevent the Greedy algorithm from finding the true sequence by having a large overlapping region (shown in red) as a result of a common ancestral past.Here, fragments 2 and 6 have a larger overlap with each other than with either fragment 3 or fragment 5, causing a mis-assembly despite the fact that there was only one path through the whole sequence.

FIGURE 7 :FIGURE 6 : 2 FIGURE 8 :
FIGURE 7: Paired-end reads make up for short read lengths during assembly.(a) An ambiguous path is untangled using a paired-end read that spans the uncertain area.(b) The length of a paired-end read is used to untangle an ambiguous de Bruijn graph.

TABLE 1 :
A summary of five of the predominant sequencing platforms for de novo sequencing: 454 FLX +, HiSeq2000, SOLiD, Ion Torrent and PacBio RS.

TABLE 2 :
A summary of freely available programs that can be used for de novo genome assembly, including the conditions under which the programs should be used.