CN104239750A

CN104239750A - High-throughput sequencing data-based genome de novo assembly method

Info

Publication number: CN104239750A
Application number: CN201410421844.3A
Authority: CN
Inventors: 郑洪坤; 刘敏
Original assignee: BEIJING BIOMARKER TECHNOLOGIES Co Ltd
Current assignee: BEIJING BIOMARKER TECHNOLOGIES Co Ltd
Priority date: 2014-08-25
Filing date: 2014-08-25
Publication date: 2014-12-24
Anticipated expiration: 2034-08-25
Also published as: CN104239750B

Abstract

The invention provides a high-throughput sequencing data-based genome de novo assembly method, which comprises the following steps: (1) establishing a de Bruijn graph according to high-throughput sequencing data, and carrying out sequencing data error correction and super read assembly on the basis of the corrected de Bruijn graph; (2) utilizing super read to carry out primary contigs assembly; (3) taking specifically local primary contigs and reads, locally assembling, and combining all local assembly results; (4) sequencing contigs by a sub-graph segmentation algorithm and a simulated annealing algorithm to obtain final scaffolds. The errors brought by high-throughput sequencing are eliminated by de Bruijn graph correction, so that the data accuracy is improved; the sequencing read length is improved by establishing a super read method, and the contigs length is obviously enhanced; the processing capacity of repeated sequences is greatly enhanced by local assembly.

Description

Based on the genome from the beginning assemble method of high-flux sequence data

Technical field

The present invention relates to a kind of genome assemble method, particularly relate to a kind of genome based on short data records sequenced fragments from the beginning assemble method.

Background technology

Along with developing rapidly of second generation sequencing technologies, the decline rapidly of order-checking expense, from the beginning gene order-checking is more subject to the favor of researcher.But, utilizes a large amount of short read data again to recover genome original appearance and be also faced with huge challenge, and a wherein the most key step is exactly contigs assembling.De Bruijn structure is the core of graph theory packing algorithm, it is the core of present main flow from the beginning packing algorithm, it builds Euler diagram based on the overlay information of kmer, and it is the foundation stone that contigs builds, and therefore exploitation of the present invention also will based on De Bruijn.

Current contigs packing algorithm all only carries out De Bruijn and builds, and be also relatively-stationary for the kmer size in figure, although there are some many kmer packing algorithms, they also all only carry out a composition, then merge simultaneously.The composite software general for the short data records used in assembling also just simply filters also correction process, secondary processing can't be carried out, the upper limit of kmer size during this De Bruijn also just limited to a great extent builds to these the most original short data records.Therefore for the genome assemble method not carrying out short data records processing, kmer size is all smaller, can produce more branch in De Bruijn builds, and improves the complexity of De Bruijn greatly, thus reduces assembling effect.

In addition, a large feature of animal-plant gene group is exactly that repetitive sequence ratio is higher, and repetitive sequence can allow in genome assembling process and produces a large amount of optional sites and branch, and then improves assembling difficulty.The strategy mainly containing two kinds of main flows at present processes partial picture wherein: a kind of strategy utilizes large fragment library to stride across repetitive sequence, and estimate repetitive sequence area size, then chooses the repetitive sequence path of an appropriate length; Another kind is then first avoid repetitive sequence region, turnes back to carry out the assembling in repeated order region after completing preliminary assembling again.From strategy, second method, for more effective complex genome, because it localizes global issue, greatly reduces the difficulty of assembling.

Summary of the invention

For the deficiency that prior art exists, the object of this invention is to provide a kind of genome based on high-flux sequence data from the beginning assemble method---GNOVO, first this technology carrys out by correcting data error the order-checking mistake that high throughput checks order intrinsic, logical super read assembles shorter read to be assembled into have and reads more greatly long super read simultaneously, thus part overcomes order-checking reads long too short problem.Secondly, by local assembling, the repetitive sequence on full-length genome is changed into the single-copy sequence of local, thus greatly reduce the difficulty of repetitive sequence process, improve the length of contigs assembling.

In order to realize the object of the invention, a kind of genome based on high-flux sequence data of the present invention from the beginning assemble method---GNOVO, key step is:

1) by high-flux sequence data construct de Bruijn (using less kmer), and carry out figure correction process, and carry out sequencing data error correction based on the de Bruijn after error correction, error correction principles is shown in Fig. 4;

2) super read assembling is carried out based on the de Bruijn after error correction;

3) rebuild de Bruijn (using larger kmer) with super read, and carry out figure correction process, the de Bruijn after error correction is split, obtains elementary contigs;

4) the elementary contigs of specific portion is transferred according to the link information of mate-pair, and carry out local assembling according to the reads of the comparison information collection local of sequencing data, all local assembling result is merged together, and deconsolidation process after carrying out error correction, thus obtain contigs;

5) build scaffold connection layout according to the link information of mate-pair, by dividing sub-picture algorithm, contigs is split, and adopt simulated annealing to sort to contigs in local to obtain final scaffolds.

GNOVO building block principle process flow diagram is shown in Fig. 1.

In step 1-4 all with de Bruijn for core texture, in GNOVO, de Bruijn exists with the data structure form of Hash, and its developing algorithm is:

1) according to Genome Size and kmer size, allocation of space and initialization are carried out to Hash table;

2) iteration reads every bar read, and is numbered, and numbers from 0.

3) from 5 ' to 3 ' end extracts all kmer successively, and is stored in Hash table.If kmer exists, then the routing information that only need store kmer is just passable, namely stores its forerunner and rear-guard.If kmer does not exist, then need newly-built kmer node, also need to store routing information simultaneously.

4), when storing first kmer information in read, if it does not exist in Hash table, then illustrate that its true forerunner kmer node does not exist, till current, it is non-existent.Therefore, newly-built end order-checking projection node is just needed this time, for replacing true forerunner kmer node, as the backtracking predecessor node of this kmer node.

5) when storing non-first kmer node, if find that this kmer is present in, and the backtracking predecessor node of this kmer node is end order-checking projection node, then need this end order-checking projection node to remove, this kmer node backtracking predecessor node is set to previous kmer node simultaneously.Because in current read, this kmer is not first kmer, so it necessarily has a true forerunner, i.e. its previous kmer, therefore, end order-checking projection node can be replaced with true forerunner kmer node, thus reduce end order-checking projection quantity, and then save partial memory.

De Bruijn is as kernel data structure, and its accuracy is very important, therefore, develops a series of figure correction process in GNOVO, and key step is packed: 1) de Bruijn simplify processes; 2) end order-checking projection delete processing; 3) blister path union operation; 4) process is removed on low cover degree limit.

1) de Bruijn simplify processes: according to Hash table, each kmer node is traveled through.For current kmer node, extend according to its true forerunner and rear-guard, if the complementary node of current kmer node also exists, then need to extend according to these two nodes simultaneously.Extension method: extending with the direction entering limit along going out limit, namely extending along true forerunner and rear-guard.Extension condition on single direction: the kmer node of extended spot has and only has a true forerunner (comprising its complementary kmer, if existed), has simultaneously and only has a true rear-guard.Extension on single direction stops: extend for rear-guard, the current kmer node extended to has two or more true forerunner, there is two or more true rear-guard in other words, do not have true rear-guard in other words, the current kmer node extended in other words is present in De Bruijn.Extend for forerunner, the current kmer node extended to has two or more true forerunner, has two or more true rear-guard in other words, does not have true forerunner in other words, and the current kmer node extended in other words is present in De Bruijn.

2) end order-checking projection mainly produces due to the order-checking mistake of read end, and in GNOVO, the check order criterion of projection mistake of end is: a) length is less than 2K (K is the length of kmer); The equipotential that b) must there is high coverage enters limit or goes out limit.

3) blister path refers to the graphic structure be made up of two with identical starting point and terminal different paths, and in addition to the start and the end points only, figure inside does not exist other any crossover node.Blister path is mainly produced by the order-checking mistake in the middle part of heterozygous sites and read, and in GNOVO, blister path is defined as: 1) path is all less than 200bp; 2) similarity in path is greater than 0.8; 3) coverage of a paths is had at least lower than certain specific threshold value.The core algorithm of blister path search algorithm is " Dijkstra-like breadth-first search " (dijkstra's algorithm is foremost algorithm in Shortest Path Searching Algorithm, and " breadth-first search " represents breadth first traversal).

4) low cover degree limit is mainly produced by read order-checking mistake, its main discrimination standard: 1) coverage is less than certain specific threshold value; 2) node at two ends, limit all exists except at least one the true forerunner worked as except front and at least one true rear-guard.Choosing of coverage threshold value, for monoploid, general acquiescence choose limit coverage average in other words median 1/2, for amphiploid genome, acquiescence choose limit coverage average in other words median 1/4.But the best way carries out choosing of threshold value according to the population distribution of coverage.

Super Read refers to a longer sequence, it be by polishing paired-end between breach in other words conj.or perhaps connect paired-end two ends by overlay information and a sequence obtaining, the structure principle of Super Read is shown in Fig. 5.Because it obtains based on paired-end, therefore the expectation value of its length will be library fragments size.Because super read is the read and middle breach that are connected to both-end, therefore its length is read long a lot generally than read, has very large advantage using super read as assembling starting point.The assembling of Super read adopts depth-priority-searching method to carry out route searching and obtains.

In a lot of analysis, all set out based on single copy node, reason mainly contains: 1) first from single copy node, Assembly analysis ratio is easier to, and the probability of makeing mistakes can be less.2) there is the information of single copy node, then by it as basic point, the assembling of a part of repetitive sequence can have been solved when post-processed repetitive sequence.

Here hypothesis has a limit, and its length is n, the Xi read number (notice that the physical length of the length of side is here for n-k+1, because limit is based on kmer, therefore the maximal value of i is n-k+1) that to represent with the site i on limit be read initiation site.Here suppose that Xi be independently stochastic variable, it obeys to expect to be the Poisson distribution of ρ, and its expectation ρ is determined (this refers to the distribution situation of the coverage on all limits, i.e. population distribution) by the distribution of the coverage on limit.

Theoretical according to central limit, length is that should to obey average be ρ to the expectation value of Xi on the limit of n, and standard deviation is normal distribution.If certain limit singly copies limit, so the mean value of Xi and the difference of ρ just should not be too large.Here the ratio taking off face is accurate as the judgement of limit uniqueness:

F (\overset{&OverBar;}{X}, n, ρ) = \frac{\log 2}{2} + (n - k + 1) \frac{ρ^{2} - {\overset{&OverBar;}{X}}^{2} / 2}{2 ρ}

In order to weigh the specificity uniqueness in other words on limit, in GNOVO, adopt F>=5 as the standard judged.I.e. F larger (namely the mean value of Xi is less), the specificity on limit is stronger, but the mean value of little Xi also may owing to checking order mistake and causing, but this part mistake generally can be repaired in the error correction procedure above.

In GNOVO, local packing algorithm main thought is the assembling by localizing to genome, reduces the complicacy of assembling, obtains local assembling effect preferably.Again by merging the result of each local assembling, obtaining whole genomic assembling result, genome assembling effect (contigs) is obviously promoted, local building block principle can reference diagram 6.Its main step has:

1) elementary contigs and reads is compared, by the comparison result of reads, obtain the range information between elementary contigs, and the relation of reads and elementary contigs.Elementary contigs and reads information is read in internal memory.

2) the choosing of elementary contigs seed.Filter out multicopy (copy number >2) or the shorter elementary contigs of length.To the elementary contigs retained according to the distance relation between elementary contigs, build scaffold connection layout, and select apart from each other and longer elementary contigs as seed wherein.Elementary contigs near seed in certain limit will be selected after obtaining seed.

3) local reads chooses: to the elementary contigs of each local, selects to only have the sequenced fragments of one end on elementary contigs according to comparison result.Indentation, there will be in and the super read that sequenced fragments coverage is greater than 0.9 also chooses out simultaneously.

4) build de Bruijn in local and carry out local assembling.

5) the local assembling result in each Local map is merged, obtain the assembling result of the overall situation, then carry out simplification and figure correction process, thus obtain final contigs.

In scaffold assembling process, first the scaffold figure of entirety can be carried out subgraph fractionation, be divided into independently subgraph little one by one, the paired end of this subgraph and other contigs drops on border contigs, and (length is greater than the contigs of library size, normal paired reads can not stride across it) on, therefore can regard each subgraph as a little entirety, separately scaffold assembling is carried out to it.Adopt simulated annealing to sort to contigs in GNOVO, the ranking results choosing conflict limit in sequencer procedure minimum is final scaffold.After scaffold assembling, can be regarded as an entirety, more again be assembled with other contigs.

After adopting simulated annealing to complete sequence to the contigs in subgraph, GNOVO adopts quadratic programming algorithm to carry out the estimation of the breach size between adjacent contigs, and the objective function in computation process is:

f (χ) = \underset{iϵE}{Σ} \frac{{((C_{i} + Σ_{jϵ G_{i}} g_{j}) - μ_{i})}^{2}}{σ_{i}^{2}}

In formula, E is the set on limit in subgraph, C _ifor the overall length of the contigs that limit i strides across, for the overall length of the breach that limit i strides across, μ _ifor the average library size that limit i is corresponding, for the variance in the corresponding library of limit i.

Described genome from the beginning assemble method, wherein said assemble method adopts C language, perl language and fortran Programming with Pascal Language to realize on (SuSE) Linux OS, can process large gene order-checking data, calculate that have can the advantages such as the lower and speed of concurrency, internal memory is fast.

Key point of the present invention is:

1) by first first carrying out figure error correction to the method for de Bruijn error correction, and then by the de Bruijn after error correction, correction process is carried out to high-flux sequence data.

2) according to the de Bruijn after error correction, adopt path search algorithm to assemble pair-end, and then obtain reading longer super read, and adopt super read to carry out elementary contigs structure.

3) transfer elementary contigs and reads of specific portion according to the comparison information of pair-end and mate-pair, carry out local assembling, finally all local assembling result is merged together and obtains contigs.

4) simulated annealing is adopted to carry out scaffold assembling to the scaffold subgraph after segmentation.

Adopt the strategy of local assembling first to assemble in each local, the complicacy of total system is transformed the unicity of local, thus greatly reduce the difficulty of assembling.

It is different from general contigs packing algorithm that the contigs of genome based on high-flux sequence data of the present invention from the beginning in assemble method (called after GNOVO method) assembles thinking, it adopts the strategy building De Bruijn for twice, first figure carries out building based on less K, is mainly used in error correction and super read builds; Second figure is then utilize super read data, and carry out building based on larger K, it is mainly used in elementary contigs and builds.Structure due to elementary contigs builds based on the super read with greater depth, and larger K also can better processing section repetitive sequence simultaneously.Fig. 2 is shown in the application in Super read process repetitive sequence assembling direction.In addition, because the read of high-flux sequence is shorter, for genome packing algorithm proposes huge challenge, but Research idea of the present invention is but jump out from genome packing algorithm, research and development are focused on the length how improving read, thus provides higher starting point for the input of packing algorithm.According to the range information between pair-end, utilize graph-theoretical algorithm to complete the filling of breach between pair-end, and then obtain the super read (for the pair-end having overlay information, then directly carrying out connecting) of length.Because the read of high-flux sequence data is very short, and the length of super read is much longer compared with read, therefore larger advantage will be had using super read as the strategy of the starting point of assembling, as: 1) longer overlay information can be used to carry out read connection; 2) super read can stride across longer repetitive sequence (see Fig. 2); 3) GNOVO can adopt larger KMER based on super read, reduces figure complexity, and then better processes heterozygous sequence (see Fig. 3).Secondly, utilize the thought of local assembling, the repetitive sequence on full-length genome is changed into the single-copy sequence of local, thus greatly reduce the difficulty of repetitive sequence process, improve the length of contigs assembling.Local packing algorithm in GNOVO transfers elementary contigs and reads of specific portion according to the comparison information of pair-end and mate-pair, carries out local assembling, finally all local assembling result is merged together and obtains contigs.Finally, GNOVO also sorts to the contigs obtained in conjunction with subgraph partitioning algorithm and simulated annealing, obtains final scaffolds.

Beneficial effect of the present invention is mainly manifested in:

1) by carrying out error correction based on the de Bruijn after error correction to raw sequencing data, ensure that the accuracy of sequencing result, base error rate is generally less than 0.0001, also provides a kind of new sequencing data error correction method simultaneously.

2) based on de Bruijn, path search algorithm is adopted to assemble pair-end, the sequence that can be more grown, thus the difficulty greatly reducing assembling, for the sequence that can obtain 150bp to 230bp 180bp library.

3) super read is utilized to carry out genome assembling, be conducive in de Bruijn packing algorithm, adopting larger kmer (>95), thus reduce figure complexity, improve the length of elementary contigs, ensure that final assembling effect, directly more than 10kb, even 30-50kb can be reached for bacterium data N50.

4) local is adopted to assemble plan, in local, portion gene group is assembled, greatly reduce assembling difficulty, the particularly difficulty of repetitive sequence assembling, ensure that the length of final contigs, directly more than 50kb, even can reach 100-500kb for bacterium data N50, also provide a kind of local assemble method newly simultaneously;

5) adopt simulated annealing to carry out scaffold assembling to the scaffold subgraph after segmentation, constructed scaffold length is longer, and N50 can reach more than 500kb usually.

6) make full use of Linux cluster advantage, improve operation efficiency by approach such as parallel computation design and path Hash designs, overcome the restriction of calculator memory to large data sets computing, the genome assembling within 10G can be completed.

Accompanying drawing explanation

Fig. 1 is GNOVO assembling flow path overview diagram, wherein A is that raw data is filtered, if the N base ratio higher (>5%) in order-checking read, the ratio higher (>5%) of inferior quality base (mass value <20) in other words, then such read will be filtered raw sequencing data processing stage.B is read error correction, and based on the de Bruijn after error correction, paired ends carries out error correction from mate-pairs data by adopting different strategies respectively.C is that Super read is built into, and based on the de Bruijn after error correction, adopts path search algorithm to carry out super read structure.D is that elementary contigs assembles, and utilizes super reads data, adopts large kmer, according to the elementary contigs of de Bruijn the Theory Construction.E is local assembling, and the list first transferred near seed copies elementary contigs, then from single-ended comparison data, transfers inning read, finally builds de Bruijn in little local and assembles.F is that Scaffold builds, and namely builds scaffold according to the link information of mate-pair.

Fig. 2 is repetitive sequence solution figure, and wherein A is large kmer strategy, namely adopts the kmer longer than repetitive sequence to carry out repetitive sequence assembling.B and C is connection strategy, namely utilizes the link information of paired ends and mate-pairs, or perhaps super reads stride across link information, the repetitive sequence of moderate-length is assembled.D is local packaging strategy, and in local, a lot of repetitive sequence is all single copy, easily carries out assembling.E is breach filling Strategy, namely after completing scaffold assembling, carries out local assembling for each breach.

Fig. 3 is heterozygous sequence processing scheme figure, and wherein A merges simple isolated SNP region.B strides across link information by super read, identifies the assembling mode of adjacent heterozygous sequence, and carries out merging treatment.C is the larger heterozygosis region for close together, adopts the link information of paired ends or perhaps mate-pairs, heterozygosis part is merged.

Fig. 4 is Read error correction figure, and wherein A is the raw sequencing data structure de Bruijn after utilizing simple filtration, then carries out correction process, and processing mode is mainly the deletion of end order-checking projection, mistake edge contract and blister path and merges.B is Paired ends error correction, and its medium and small grey rectangle is order-checking mistake, and PE is original read, and PE* is the read after error correction.C is Mate-pairs correcting data error, and MP is the read comprising order-checking mistake, and MP* is the result after error correction.Wherein the rectangle part of grey is order-checking mistake, and the E namely in figure, by deleted in error correction procedure.J is the cyclisation site of introducing in library construction process.

Fig. 5 is that Super read builds principle overview diagram, and wherein in A, R1 and R2 is respectively two terminal sequences of paired end, and the mode retrieved by kmer is navigated in the de Bruijn after error correction.B is the search in path between the path search algorithm in employing graph theory carry out R1 and R2, and the dotted line in figure is search for the path obtained.C is according to the kmer information in searching route, the path sequence extracted, i.e. super read.

Fig. 6 is local building block principle overview diagram, wherein in A " c1, " " c2; " " c3, " " c4, " " c5; " " r1, " " r2, " and " r3 " are elementary contigs, " c1, " " c2, " " c3; " " c4, " and " c5 " is all single copy, " r1; " " r2, " and " r3 " is all single copy, and " c2 " and " c4 " screens the seed obtained from the elementary contigs of all single copies.In B, the camber line of grey is the link information of mate-pairs between the elementary contigs of difference." c1 " and " c3 " is the elementary contigs of vicinity of " c2 ", and " c3 " and " c5 " is the elementary contigs of vicinity of " c4 ".Short grey rectangle is UARs, does not namely have comparison list to copy the read of elementary contigs.C builds de Bruijn in local, and carries out error correction.D is based on the de Bruijn after local error correction, to there being the elementary contigs of the relation of connection to carry out route searching, carries out local assembling.All assembling results merge by E, obtain final genome assembling result.

Fig. 7 is bifidobacterium bifidum (Bifidobacterium bifidum PRL2010) genome collinearity figure.(genome sequence accession number is CP001840.1, and assembling Genome Size is 2,214 for GNOVO assembling result and bifidobacterium bifidum, collinearity figure 656bp), being longitudinally wherein the assembling genome of GNOVO, is laterally that black dotted line is genome collinearity part with reference to genome.

Fig. 8 is streptomycete (Streptomyces roseosporus NRRL 15998) genome collinearity figure.GNOVO assembles result and streptomycete (Streptomyces roseosporus NRRL15998, genome sequence accession number is NZ_DS999644.1, assembling Genome Size is 7,817, collinearity figure 295bp), being longitudinally wherein the assembling genome of GNOVO, is laterally that dotted line is genome collinearity part with reference to genome.

Embodiment

Following examples for illustration of the present invention, but are not used for limiting the scope of the invention.

Embodiment 1 Escherichia coli (E.coli) genome is assembled

1) test data introduction

This test data is from NCBI (National Center for Biotechnology Information, i.e. US National Biotechnology Information center) SRA (Short Read Archive) database download and obtain, SRA database network address is www.ncbi.nlm.nih.gov/sra, and the detailed accession number of data is SRX016044.The details of test data are as follows:

Upload the date: 2009-05-22;

Library size: 180bp;

Order-checking total amount: 2.1G;

The predicted gene group order-checking degree of depth: 456.5x.

2) appraisal procedure

Carry out test and comparison to 7 composite softwares altogether, travel through the major parameter of each composite software, the result then choosing assembling result best compares assessment, and the detailed assembly parameter that result preferably assembled by each software is as follows:

GNOVO (the inventive method) assembly parameter is: k1=25, k2=95, m1=5, m2=2, other parameter is default parameters, here k1 is kmer size first time building de Bruijn, and k2 is second time carries out building when elementary contig assembles de Bruijn kmer size based on super read; M1 is when first time building de Bruijn, carry out the parameter of low cover degree edge contract correction process, for defining the threshold value of low cover degree, when m2 is second time structure de Bruijn, carry out the parameter of low cover degree edge contract correction process, for defining the threshold value of low cover degree;

JR-Assembler assembly parameter is: be default parameters;

Edena:m=53, other parameter is default parameters;

Taipan:k=50, other parameter is default parameters;

Velvet:k=45, other parameter is default parameters;

ABySS:k=45, other parameter is default parameters;

SOAPdenovo:k=53, other parameter is default parameters;

GNOVO is when assembling, and the first step by high-flux sequence data construct de Bruijn, and carries out sequencing data error correction based on the de Bruijn after error correction.When carrying out structure de Bruijn here, kmer size adopts 25 (adopting parameter k1 to specify), error correction is carried out to original de Bruijn simultaneously, adopt 5 here when deleting low cover degree limit for threshold value, namely the degree of depth is less than the limit of 5 all by deleted.After completing figure error correction, primitive sequencer reads will be compared in de Bruijn and carry out error correction.

Before and after error correction, de Bruijn main information is in table 1:

De Bruijn main information before and after the error correction of the table 1 GNOVO method assembling E.coli genome first step

As can be seen from the index contrast before and after error correction, kmer sum reduces about 10%, but nodes and kmer species number but have dropped 400 times and 80 times all respectively, visible order-checking mistake creates a large amount of low depth kmer and extra node, thus uses the complexity of de Bruijn to greatly improve.

GNOVO is when assembling, and read comparison is returned in de Bruijn by second step, adopts path search algorithm to build super read.Carrying out in path search process, the searching for reference of acquiescence is 3, i.e. 3 times accurately poor.Original read number is 6096923, and the read number successfully building Super read is 5843968, and search efficiency is 95.8% (5843968/6096923).

GNOVO is when assembling, and the 3rd step utilizes super read to rebuild de Bruijn, by figure error correction and deconsolidation process, obtains elementary contig.When carrying out structure de Bruijn here, kmer size adopts 95 (adopting parameter k2 to specify), error correction is carried out to original de Bruijn simultaneously, adopt 2 here when deleting low cover degree limit for threshold value, namely the degree of depth is less than the limit of 2 all by deleted.After completing figure error correction, de Bruijn is split from Nodes, obtain elementary contigs.

Before and after error correction, de Bruijn main information is as follows:

De Bruijn main information before and after the step error correction of table 2 GNOVO method assembling E.coli genome the 3rd

As can be seen from the above table, figure complexity is now very low, only has 2324 nodes, and the integrality of namely assembling is extraordinary.

The statistical information of the first group of contigs obtained is: contig overall length is that to add up to 169, contig N50 length be 60284bp for 4.55Mb, contigs.

GNOVO is when assembling, and the 4th step transfers elementary contigs and reads of specific portion according to the comparison information of pair-end and mate-pair, carries out local assembling.In the assembling of local, acquiescence minimum support is 3, when the linking number namely between two elementary contigs is only more than or equal to 3 just effective (parameter-cutoff is arranged, generally not proposed amendments).Kmer size in the assembling of local uses many kmer, is 19,57 and 95 (being arranged by parameter "-k ", "-q " and "-n ") under default situations.Assembling the statistical information of contigs obtained is: contig overall length is that to add up to 161, contig N50 length be 63618bp for 4.55Mb, contigs.

GNOVO is when assembling, and the 5th step utilizes mate-pair link information to build scaffold connection layout, and being sorted to contigs by dividing sub-picture algorithm and simulated annealing obtains final scaffolds.In scaffold assembling process, same acquiescence minimum support is 3, when the linking number namely between two elementary contigs is only more than or equal to 3 just effective (parameter-cutoff is arranged, generally not proposed amendments).Here owing to not having large library, therefore scaffold assembling effect does not promote, and assembles consistent with the 4th step.

3) results contrast

The assembling of each composite software the results are shown in Table 3:

Table 3 each composite software assembling E.coli genome result

Contigs number: length is not added up at the Contigs of below 300bp.

Overall length: the overall length of all contigs.

Maximum contig length: assembling result in the longest contig length the longest the longest contig length and

Average contig length: the mean value of all contig length.

N50: represent by all Contigs according to sorting from long to short, is then added according to this order successively by Contig, and when the length be added reaches a half of Contig total length, the Contig length that last adds is Contig N50.

Assembly defect contig number: can not comparison to the contig number on original reference genome.

In the present embodiment, assemble method GNOVO of the present invention obtains 161 contigs, secondly be the method for JR-Assembler software, obtain 192 contigs, be much better than other composite softwares, and the N50 length of GNOVO is 63.618K, exceed more than 10K than JR-Assembler (48.673K) and Velvet (43.998K), illustrate that the assembling integrality of GNOVO is much better than other composite softwares in this example.The longest contig that GNOVO obtains is 334.908K, exceeds more than 100K than other softwares.The contig number of the mistake assembling of GNOVO is 0, consistent with other most of softwares, shows its high accuracy.In the present embodiment, GNOVO shows larger advantage compared with other composite softwares.

Embodiment 2 streptomycete (S.roseosporus) genome is assembled

1) test data introduction

This test data downloads from the SRA database of NCBI to obtain, and SRA database network address is www.ncbi.nlm.nih.gov/sra, and the detailed accession number of data is SRX026747 and SRX016085.

A) details of test data SRX026747 are as follows:

Upload the date: 2010-08-06;

Library size: 180bp;

Order-checking total amount: 10.7G;

The predicted gene group order-checking degree of depth: 1389.6X.

B) details of test data SRX016085 are as follows:

Upload the date: 2009-09-20;

Library size: 4kb;

Order-checking total amount: 3.5G;

The predicted gene group order-checking degree of depth: 454.5X.

2) appraisal procedure

Here carry out test and comparison to 5 composite softwares altogether, travel through the major parameter of each composite software, the result then choosing assembling result best compares assessment, and the detailed assembly parameter that result preferably assembled by each software is as follows:

GNOVO assembly parameter is: k1=25, k2=95, m1=11, m2=5, and other parameter is default parameters (detailed assessment details can reference example 1);

JR-Assembler: be default parameters

ABySS:k=45, other parameter is default parameters;

Velvet:k=49, other parameter is default parameters;

SOAPdenovo:k=63, other parameter is default parameters;

3) results contrast

The assembling of each composite software the results are shown in Table 4:

Table 4

In the present embodiment, the N50 of GNOVO is the highest 13.134K, is secondly Velvet (12.499K); The longest contig length is 73.115K, exceeds more than 10K than Velvet (61.423K).Contig quantity is 1,242, more than minimum 1,127 of ABySS.In this example, GNOVO is all slightly excellent than other composite softwares in contig maximum length, average length, N50 length, only a little higher than ABySS on contig number, illustrates good assembling ability on the whole.

But it should be noted that the size of the assembling result of GNOVO is 9.79M, be obviously greater than other assembling result.Therefore, inventor has carried out nt comparing to raw data, the data containing two bacteriums in comparison result display raw data, therefore infers that raw data is a Mixed Microbes.By downloading the reference genome of corresponding bacterium from NCBI, i.e. streptomycete (Streptomyces roseosporus NRRL 15998, genome sequence accession number is NZ_DS999644.1, assembling Genome Size is 7, 817, 295bp) with bifidobacterium bifidum (Bifidobacterium bifidum PRL2010, genome sequence accession number is CP001840.1, assembling Genome Size is 2, 214, 656bp), carry out full-length genome comparison (similarity requires 99%) by MUMMER to find, the assembling result of GNOVO can well comparison on these two genomes, comparison result is shown in Fig. 7 and Fig. 8.Simultaneously, this also demonstrates the supposition that inventor starts, and namely raw data is a Mixed Microbes, and also demonstrating GNOVO from the side significantly promotes contigs length simultaneously, there is higher assembling accuracy, greatly improved the processing power of repetitive sequence by local assembling.

Embodiment 3 Neuraspora crassa (N.crassa) genome is assembled

1) test data introduction

This test data downloads from the SRA database of NCBI to obtain, and SRA database network address is www.ncbi.nlm.nih.gov/sra, and the detailed accession number of data is SRX030834.

A) details of test data SRX030834 are as follows:

Upload the date: 2010-11-11;

Library size: 180bp;

Order-checking total amount: 5.5G;

The predicted gene group order-checking degree of depth: 148.3X.

2) appraisal procedure

Here carry out test and comparison to 6 composite softwares altogether, travel through here to the major parameter of each composite software, the result then choosing assembling result best compares assessment, and the detailed assembly parameter that result preferably assembled by each software is as follows:

GNOVO assembly parameter is: k1=25, k2=95, m1=5, m2=2, and other parameter is default parameters (detailed assessment details can reference example 1);

JR-Assembler: be default parameters

ABySS:k=35, other parameter is default parameters;

Velvet:k=37, other parameter is default parameters;

SOAPdenovo:k=47, other parameter is default parameters;

Edena:m=45, other parameter is default parameters;

3) results contrast

The assembling of each composite software the results are shown in Table 5:

Table 5

In the present embodiment, the N50 of GNOVO is 10.473K, has assemble integrality preferably compared with other composite softwares (4 ~ 6K); Contig maximum length and the average length of assembling are all better than other composite softwares.The contig number of GNOVO is 11,300, more than 10,187 of Velvet, is positioned at second.In this embodiment, the assembling effect of GNOVO entirety is better than other composite softwares.

Embodiment 4 Staphylococcus intermedius (S.intermedius ATCC 27335) genome is assembled

1) test data introduction

This test data downloads from the SRA database of NCBI to obtain, and SRA database network address is www.ncbi.nlm.nih.gov/sra, and the detailed accession number of data is SRX297066 and SRX297065.

A) details of test data SRX297066 are as follows:

Upload the date: 2012-11-18;

Library size: 180bp;

Order-checking total amount: 1.1G;

The predicted gene group order-checking degree of depth: 564.10X.

B) details of test data SRX297065 are as follows:

Upload the date: 2012-11-19;

Library size: 5kb;

Order-checking total amount: 1.5G;

The predicted gene group order-checking degree of depth: 769.23X.

2) appraisal procedure

Carry out test and comparison to 5 composite softwares altogether, travel through the major parameter of each composite software, the result then choosing assembling result best compares assessment, and the detailed assembly parameter that result preferably assembled by each software is as follows:

GNOVO assembly parameter is: k1=25, k2=95, m1=11, m2=2, and other parameter is default parameters (detailed assessment details can reference example 1);

Allpaths-lg: be default parameters

SPAdes:K=61,73,95, other parameter is default parameters;

MaSuRCA:k=85, other parameter is default parameters;

SOAPdenovo:k=77, other parameter is default parameters;

3) results contrast

The assembling of each composite software the results are shown in Table 6:

Table 6

In the present embodiment, GNOVO assembling obtains the complete sequence (1 Scaffold) of bacterium, and Contig number 7, is much better than other composite softwares (Scaffold of more than 10), show its assembling on great ability, GNOVO significantly promotes contigs length; The processing power of repetitive sequence is greatly improved by local assembling.

Although above the present invention is described in detail with a general description of the specific embodiments, on basis of the present invention, can make some modifications or improvements it, this will be apparent to those skilled in the art.Therefore, these modifications or improvements without departing from theon the basis of the spirit of the present invention, all belong to the scope of protection of present invention.

Claims

1., based on a genome from the beginning assemble method for high-flux sequence data, comprise the following steps:

(1) by high-flux sequence data construct de Bruijn, and high-flux sequence correcting data error is carried out based on the deBruijn figure after error correction;

(2) path search algorithm is adopted to build super read;

(3) utilize super read to rebuild de Bruijn, by the error correction of de Bruijn and deconsolidation process, obtain elementary contigs;

(4) transfer elementary contigs and reads of specific portion according to the comparison information of pair-end and mate-pair, carry out local assembling;

(5) utilize mate-pair link information to build scaffold connection layout, being sorted to contigs by dividing sub-picture algorithm and simulated annealing obtains final scaffolds.

2. method according to claim 1, is characterized in that, the kmer length building de Bruijn in step (1) is floated according to coverage information, and length is between 17-37.

3. method according to claim 1, is characterized in that, the de Bruijn described in step (1) exists with the data structure form of Hash, and its developing algorithm is:

2) iteration reads every bar read, and is numbered, and numbers from 0;

3) from 5 ' to 3 ' end extracts all kmer successively, and is stored in Hash table, if kmer exists, then only need store the routing information of kmer, namely store its forerunner and rear-guard; If kmer does not exist, then need newly-built kmer node, also need to store routing information simultaneously;

4) when storing first kmer information in read, if it does not exist in Hash table, then illustrate that its true forerunner kmer node does not exist, need newly-built end order-checking projection node, for replacing true forerunner kmer node, as the backtracking predecessor node of this kmer node;

5) when storing non-first kmer node, if find that this kmer exists, and the backtracking predecessor node of this kmer node is end order-checking projection node, then need this end order-checking projection node to remove, this kmer node backtracking predecessor node is set to previous kmer node simultaneously.

4. method according to claim 1, is characterized in that, step (1) and the de Bruijn described in step (3) carry out error correction by following steps: 1) de Bruijn simplify processes; 2) end order-checking projection delete processing; 3) blister path union operation; 4) process is removed on low cover degree limit.

5. method according to claim 1, is characterized in that, the path search algorithm of step (2) is depth-first path search algorithm.

6. method according to claim 1, is characterized in that, step (3) uses larger kmer to build de Bruijn, and kmer size is 75-155.

7. method according to claim 1, is characterized in that, the local number of assembling steps of step (4) is:

1) elementary contigs and reads is compared, by the comparison result of reads, obtain the range information between elementary contigs, and the relation of reads and elementary contigs, elementary contigs and reads information is read in internal memory;

2) multicopy (copy number >2) or the shorter elementary contigs of length is filtered out, according to the distance relation between elementary contigs, build scaffold connection layout, and select apart from each other and longer elementary contigs as seed wherein, select the elementary contigs near seed in certain limit after obtaining seed;

3) to the elementary contigs of each local, select to only have the sequenced fragments of one end on elementary contigs according to comparison result, will indentation, there be in and the super read that sequenced fragments coverage is greater than 0.9 also chooses out simultaneously;

4) build de Bruijn in local and carry out local assembling;

8. the method according to any one of claim 1-6, it is characterized in that, when step 5 is sorted to contigs by dividing sub-picture algorithm and simulated annealing, choose directly sequence and simulated annealing sort according to varying in size of subgraph, subgraph≤8 adopt directly sorts, and > 8 adopts simulated annealing sequence.

9. method according to claim 8, is characterized in that, described dividing sub-picture algorithm comprises the following steps:

1) each contig is traveled through successively;

2) length detecting each contig is connected with limit;

3) deleting being defined as border contig, namely carrying out figure fractionation;

Described direct sort algorithm comprises the following steps:

1) may carry out exhaustive to all sequences;

2) the minimum sequence in conflict limit is chosen as optimal sequencing;

Simulated annealing is carried out sequence to contigs and is comprised the following steps:

1) stochastic generation contigs sequence;

2) probability of acceptance is calculated according to Current Temperatures and the random new sort generated that changes;

3) generate a random chance, if this random chance is less than the probability of acceptance, new contigs is sorted as current sequence;

4) lower the temperature with particular step size, Simultaneous Iteration step 2 and 3;

5) from all current sequences, the minimum sequence in conflict limit is chosen as optimal sequencing.

10. the method according to any one of claim 1-4, is characterized in that, the route searching mode in high-flux sequence correcting data error, partial groups process of assembling is depth-priority-searching method, needs opposite side to be weighted for the route searching mode in figure error correction simultaneously.