CN104239750B

CN104239750B - Genome based on high-flux sequence data from the beginning assemble method

Info

Publication number: CN104239750B
Application number: CN201410421844.3A
Authority: CN
Inventors: 郑洪坤; 刘敏
Original assignee: BEIJING BIOMARKER TECHNOLOGIES Co Ltd
Current assignee: BEIJING BIOMARKER TECHNOLOGIES Co Ltd
Priority date: 2014-08-25
Filing date: 2014-08-25
Publication date: 2017-07-28
Anticipated expiration: 2034-08-25
Also published as: CN104239750A

Abstract

The invention provides the from the beginning assemble method, including step of the genome based on high-flux sequence data：1) de Bruijns are built according to high-flux sequence data, sequencing data error correction and super read assemblings is carried out based on the de Bruijns after error correction；2) primary contigs is carried out using super read to assemble；3) the primary contigs and reads of specific portion are transferred, local assembling merges all local assembling results；4) contigs is ranked up by dividing sub-picture algorithm and simulated annealing and obtains final scaffolds.The present invention eliminates the mistake that high-flux sequence is brought by the error correction of de Bruijns, improves data accuracy；Sequencing reading length is improved using the method for building super read, contigs length is obviously improved；The disposal ability of repetitive sequence is greatly improved by local assembling.

Description

Genome based on high-flux sequence data from the beginning assemble method

Technical field

The present invention relates to a kind of genome assemble method, more particularly to a kind of genome based on short sequence fragment From the beginning assemble method.

Background technology

With developing rapidly for second generation sequencing technologies, the dramatic decrease of expense is sequenced, from the beginning gene order-checking more by To the favor of researcher.But, using substantial amounts of short read data recover again genome original appearance be also faced with it is huge Challenge, and wherein a most key step is exactly contigs assemblings.De Bruijns structure is the core of graph theory packing algorithm, It is the core of present main flow from the beginning packing algorithm, and it is the overlay information based on kmer to build Euler diagram, and it is contigs The foundation stone of structure, therefore the exploitation of the present invention will also be based on De Bruijns.

Current contigs packing algorithms all only carry out a De Bruijn and built, while big for the kmer in figure Small is also relatively-stationary, although there are some many kmer packing algorithms, but they also all only carry out a composition, then is closed And.Simple filtering also correction process is also simply carried out for the general composite software of the short sequence used in assembling, can't Short sequence to these most originals carries out secondary operation, during this De Bruijn also just largely limited is built The upper limit of kmer sizes.Therefore for the genome assemble method without short serial processing, kmer sizes are all smaller, in De Bruijn can produce more branch in building, and greatly improve the complexity of De Bruijns, so as to reduce assembling effect Really.

In addition, a big feature of animal-plant gene group is exactly that repetitive sequence ratio is higher, and repetitive sequence can allow genome Substantial amounts of optional site and branch are produced in assembling process, and then improves assembling difficulty.Mainly there is the strategy of two kinds of main flows at present To handle partial picture therein：A kind of strategy is to utilize large fragment library across repetitive sequence, and estimates repetitive sequence region Size, then chooses the repetitive sequence path of an appropriate length；It is another, it is first to avoid repetitive sequence region, completes just Return to carry out the assembling in repeated order region after step assembling.From strategy, second method is for complex genome More effectively, because it localizes global issue, the difficulty of assembling is greatly reduced.

The content of the invention

In view of the deficienciess of the prior art, it is an object of the invention to provide a kind of gene based on high-flux sequence data From the beginning assemble method --- GNOVO is organized, the technology handles the sequencing mistake of high-flux sequence inherently by correcting data error first, Shorter read is assembled into by logical super read assemblings simultaneously reads long super read with bigger, so that part overcomes survey The problem of sequence reads long too short.Secondly, by local assembling, the repetitive sequence on full-length genome is changed into local single copy sequence Row, so as to greatly reduce the difficulty of repetitive sequence processing, improve the length of contigs assemblings.

In order to realize the object of the invention, a kind of genome based on high-flux sequence data of the invention from the beginning side of assembling Method --- GNOVO, is mainly comprised the following steps：

1) de Bruijns (using less kmer) are built by high-flux sequence data, and carry out figure correction process, And sequencing data error correction is carried out based on the de Bruijns after error correction, error correction principles are shown in Fig. 4；

2) super read assemblings are carried out based on the de Bruijns after error correction；

3) de Bruijns (using larger kmer) are rebuild with super read, and carry out figure correction process, it is right De Bruijns after error correction are split, and obtain primary contigs；

4) the primary contigs of specific portion is transferred according to mate-pair link information, and according to the ratio of sequencing data Local assembling is carried out to the local reads of information, all local assembling results are merged together, and carries out after error correction Deconsolidation process, so as to obtain contigs；

5) scaffold connection figures are built according to mate-pair link information, by dividing sub-picture algorithm to contigs Split, and use simulated annealing to be ranked up in part to contigs and obtain final scaffolds.

GNOVO building block principle flow charts are shown in Fig. 1.

All it is that, using de Bruijns as core texture, in GNOVO, de Bruijns are with Hash in step 1-4 Data structure form exist, its developing algorithm is：

1) space distribution and initialization are carried out to Hash table according to Genome Size and kmer sizes；

2) iteration reads every read, and is numbered, and numbers since 0.

3) all kmer are extracted at from 5 ' to 3 ' ends successively, and are stored into Hash table.If kmer has been present, Then need to only store kmer routing information just can be to store its forerunner and rear-guard.If kmer is not present, need newly-built Kmer nodes, while also needing to store routing information.

4), if it is not present in Hash table, its true forerunner is illustrated during first kmer information in storage read Kmer nodes are not present, and untill current, it is non-existent.Therefore, a newly-built end is accomplished by this when to survey Sequence projection node, for replacing true forerunner kmer nodes, is used as the backtracking predecessor node of the kmer nodes.

5) when storing non-first kmer node, if it find that the kmer has existed for, and the kmer nodes It is that projection node is sequenced in end to recall predecessor node, then needs to remove end sequencing projection node, while the kmer is saved Point backtracking predecessor node is set to previous kmer nodes.Because in current read, the kmer is not first kmer, institute Necessarily there is a true forerunner with it, i.e. its previous kmer therefore, it can replace end with true forerunner kmer nodes Projection node is sequenced, so as to reduce end sequencing projection quantity, and then partial memory is saved.

De Bruijns are as kernel data structure, and its accuracy is highly important, therefore, one is developed in GNOVO The figure correction process of series, key step packaging：1) simplification of de Bruijns is handled；2) end sequencing projection delete processing；3) Blister path union operation；4) removing of low cover degree side is handled.

1) simplification of de Bruijns is handled：According to Hash table, each kmer nodes are traveled through.For current kmer Node, is extended according to its true forerunner with rear-guard, if there is also need simultaneously the complementary node of current kmer nodes Extended according to the two nodes.Extension method：Extended along the direction gone out while with entering, i.e., along true forerunner with after Drive is extended.Extension condition on single direction：The kmer nodes (including its complementary kmer, if there is) of extended spot have and Only one true forerunner, while one and only one true rear-guard.Extension on single direction is terminated：Come for rear-guard extension Say, the current kmer nodes extended to there are two or more true forerunners, there are two or more true rear-guards in other words, in other words There is no true rear-guard, the current kmer nodes extended in other words exist in De Bruijns.Prolong for forerunner For stretching, the current kmer nodes extended to have two or more true forerunners, there is two or more true rear-guards in other words, or Person says that, without true forerunner, the current kmer nodes extended in other words exist in De Bruijns.

2) end sequencing projection is produced mainly due to the sequencing mistake of read ends, and projection is sequenced in end in GNOVO The criterion of mistake is：A) length is less than 2K (K is kmer length)；B) equipotential that there must be high coverage enters side or gone out Side.

3) blister path refers to the graphic structure being made up of two different paths with identical beginning and end, except rise Outside point and terminal, other any crossover nodes are not present inside figure.Blister path is mainly in heterozygous sites and read What the sequencing mistake in portion was produced, the definition in blister path is in GNOVO：1) path length is respectively less than 200bp；2) path is similar Degree is more than 0.8；3) at least the coverage of a paths is less than some specific threshold value.The core of blister path search algorithm is calculated Method is that " (dijkstra's algorithm is most write in Shortest Path Searching Algorithm to Dijkstra-like breadth-first search " The algorithm of name, " breadth-first search " represent breadth first traversal).

4) low cover degree side is mainly is sequenced what mistake was produced by read, its main discrimination standard：1) coverage is less than Some specific threshold value；2) node at side two ends is present except when at least one true forerunner in addition to front is true with least one Rear-guard.The selection of coverage threshold value, for monoploid, general acquiescence chooses the average of side coverage median in other words 1/2, for amphiploid genome, acquiescence choose side coverage average in other words median 1/4.But best side Method is that the selection of threshold value is carried out according to the overall distribution of coverage.

Super Read refer to a longer sequence, it be by the breach between polishing paired-end in other words A sequence obtained from connecting paired-end two ends by overlay information, Super Read structure principle is shown in Fig. 5.Due to It is obtained based on paired-end, therefore the desired value of its length will be library fragments size.Due to super read It is the read and middle breach for being connected to both-end, therefore the general reading than read of its length is long a lot, with super read There is very big advantage as assembling starting point.Super read assembling is to carry out route searching using depth-priority-searching method to obtain Arrive.

In many analyses, it is all based on what single copy node set out, reason mainly has：1) first go out from single copy node Hair, Assembly analysis is easier, and the probability of error can be smaller.2) there is the information of single copy node, then in post-processing A part of repetitive sequence assembling can be solved during repetitive sequence by it as basic point.

It is assumed here that there is a line, its length is n, and Xi represents the read using the site i on side as read initiation sites (physical length for noting the length of side here is n-k+1 to number, because side is based on kmer, therefore i maximum is n-k+ 1).It is assumed here that Xi is independent stochastic variable, it is to obey to be desired for ρ Poisson distribution, its expectation ρ by side coverage Distribution determine (this refers to the distribution situation of the coverage on all sides, i.e. overall distribution).

Theoretical according to central limit, it is ρ, standard deviation that the desired value that a length is the Xi on n side, which should obey average, ForNormal distribution.If certain when being single copy, then Xi average value and ρ difference just should not be too Greatly.Here the ratio for removing face is accurate as the judgement of side uniqueness：

F is used in order to weigh in the specific uniqueness in other words on side, GNOVO>=5 are used as the standard judged.I.e. F is bigger (i.e. Xi average value is smaller), the specific stronger but small Xi on side average value is also likely to be to be led due to sequencing mistake Cause, but this partial error can be typically repaired in error correction procedure above.

Local packing algorithm main thought is the assembling by being localized to genome in GNOVO, reduction assembling Complexity, obtains preferably local assembling effect.Again by merging the result of each local assembling, the group of whole gene group is obtained Result is filled, is obviously improved genome assembling effect (contigs), local building block principle may be referred to Fig. 6.It is main Step has：

1) primary contigs and reads are compared, by reads comparison result, obtained between primary contigs Range information, and reads and primary contigs relation.Primary contigs and reads information is read in into internal memory.

2) selection of primary contigs seeds.Filter out multicopy (copy number>Or the shorter primary of length 2) contigs.To the primary contigs of reservation according to the distance between primary contigs relation, scaffold connection figures are built, And primary contigs apart from each other and longer is selected wherein as seed.Obtain to select one near seed after seed Determine the primary contigs in scope.

3) part reads chooses：To each local primary contigs, one end is only had first according to comparison result selection Sequencing fragment on level contigs.Indentation, there will be in simultaneously and the super read that fragment coverage is more than 0.9 are sequenced Select and.

4) the local assembling of de Bruijns progress is locally being built.

5) the local assembling result in each Local map is merged, obtains the assembling result of the overall situation, then carry out letter Change and figure correction process, so as to obtain final contigs.

In scaffold assembling process, overall scaffold figures can be subjected to subgraph fractionation first, be divided into one Individual small independent subgraph, the subgraph and other contigs paired end all fall border contigs (length be more than text The contigs of storehouse size, normal paired reads can not possibly be across it) on, therefore each subgraph can be regarded as one Individual small entirety, to its independent progress scaffold assembling.Contigs is ranked up using simulated annealing in GNOVO, arranged It is final scaffold that the minimum ranking results in conflict side are chosen in program process.After scaffold assemblings, it can be regarded as one Individual entirety, then assembled again with other contigs.

The contigs in subgraph is completed after sequence using simulated annealing, GNOVO is carried out using quadratic programming algorithm Object function in the estimation of breach size between adjacent contigs, calculating process is：

In formula, E is the set on side in subgraph, C_iFor side i across contigs overall length,For side i across The overall length of breach, μ_iFor the corresponding average library sizes of side i,For the variance in side i correspondences library.

Described genome from the beginning assemble method, wherein described assemble method uses C languages on (SuSE) Linux OS Speech, perl language and fortran Programming with Pascal Language are realized, big gene order-checking data can be handled, and calculating has can be parallel The advantages of property, relatively low internal memory and fast speed.

The key point of the present invention is：

1) figure error correction is first carried out by the method first to de Bruijn error correction, then again with the de Bruijn after error correction Figure to carry out correction process to high-flux sequence data.

2) according to the de Bruijns after error correction, pair-end is assembled using path search algorithm, and then Primary contigs structures are carried out to the longer super read of reading, and using super read.

3) the primary contigs and reads of specific portion are transferred according to pair-end and mate-pair comparison information, Local assembling is carried out, finally all local assembling results are merged together and obtain contigs.

4) scaffold assemblings are carried out to the scaffold subgraphs after segmentation using simulated annealing.

First assembled using the strategy of local assembling in each part, the complexity of total system is converted to local list One property, so as to substantially reduce the difficulty of assembling.

The genome based on high-flux sequence data of the present invention is from the beginning in assemble method (being named as GNOVO methods) Contigs assemblings thinking is different from general contigs packing algorithms, and it uses the strategy for building De Bruijns twice, the One figure is built based on less K, is mainly used in error correction and is built with super read；And second figure is then to utilize Super read data, and built based on larger K, it is mainly used in primary contigs and built.Due to primary Contigs structure is built based on the super read with greater depth, while larger K also can be handled preferably Part repetitive sequence.Fig. 2 is shown in the application of Super read processing repetitive sequence assembly orientations.Further, since high-flux sequence Read is shorter, is that genome packing algorithm proposes huge challenge, but the Research idea of the present invention is from genome Jump comes out in packing algorithm, how focusing on for research and development is improved into read length, so as to be carried for the input of packing algorithm For higher starting point.According to the distance between pair-end information, the filling of breach between pair-end is completed using graph-theoretical algorithm, And then obtain the super read (for there is the pair-end of overlay information, being then directly attached) of long length.By It is all very short in the read of high-flux sequence data, and super read length is much longer compared with read, therefore with super Read will have bigger advantage as the strategy of the starting point of assembling, such as：1) longer overlay information can be used to carry out read Connection；2) super read can be across longer repetitive sequence (see Fig. 2)；3) GNOVO, which is based on super read, to use Bigger KMER, reduces figure complexity, and then preferably processing heterozygous sequence (see Fig. 3).Secondly, the think of of local assembling is utilized Think, the repetitive sequence on full-length genome is changed into local single-copy sequence, so as to greatly reduce repetitive sequence processing Difficulty, improves the length of contigs assemblings.Local packing algorithm in GNOVO is according to pair-end and mate-pair ratio The primary contigs and reads of specific portion are transferred to information, local assembling is carried out, finally by all local assembling results It is merged together and obtains contigs.Finally, GNOVO herein in connection with dividing sub-picture algorithm and simulated annealing to obtaining Contigs is ranked up, and obtains final scaffolds.

Beneficial effects of the present invention are mainly manifested in：

1) by carrying out error correction to raw sequencing data based on the de Bruijns after error correction, it is ensured that sequencing result Accuracy, base error rate is generally less than 0.0001, while also providing a kind of new sequencing data error correction method.

2) de Bruijns are based on, pair-end is assembled using path search algorithm, can be obtained longer Sequence, so as to substantially reduce the difficulty of assembling, can obtain 150bp to 230bp sequence for 180bp libraries.

3) genome assembling is carried out using super read, be conducive in de Bruijn packing algorithms using bigger kmer(>95), so as to reduce figure complexity, primary contigs length is improved, it is ensured that final assembling effect, for thin Bacterium number can be directly more than 10kb according to N50, it might even be possible to reach 30-50kb.

4) using local assembling plan, portion gene group is assembled in part, assembling difficulty is greatly reduced, particularly The difficulty of repetitive sequence assembling, it is ensured that final contigs length, can be directly more than 50kb, very for bacterium data N50 100-500kb can be extremely reached, while also providing a kind of new local assemble method；

5) scaffold assemblings are carried out to the scaffold subgraphs after segmentation using simulated annealing, it is constructed Scaffold length is longer, and N50 generally reaches more than 500kb.

6) Linux cluster advantages are made full use of, fortune is improved by the approach such as parallel computation design and the design of path Hash Efficiency is calculated, restriction of the calculator memory to large data sets computing is overcome, the genome assembling within 10G can be completed.

Brief description of the drawings

Fig. 1 is GNOVO assembling flow path overview diagrams, and wherein A filters for initial data, if the N base ratios in sequencing read Example it is higher (>5%), low quality base (mass value in other words<20) ratio it is higher (>5%), then such read will be original Sequencing data processing stage is filtered.B is read error correction, based on the de Bruijns after error correction, paired ends with Mate-pairs data carry out error correction by different strategies are respectively adopted.C is that Super read are built into, based on the de after error correction Bruijn, super read structures are carried out using path search algorithm.D assembles for primary contigs, utilizes super Reads data, using big kmer, according to de Bruijns the Theory Construction primary contigs.E assembles to be local, first transfers kind Single copy primary contigs near son, inning read is then transferred from single-ended comparison data, is finally built in small part De Bruijns are assembled.F builds for Scaffold, i.e., build scaffold according to mate-pair link information.

Fig. 2 is repetitive sequence solution figure, and wherein A is big kmer strategies, i.e., entered using the kmer longer than repetitive sequence Row repetitive sequence is assembled.B and C is connection strategy, i.e., using paired ends and mate-pairs link information, in other words Be super reads across link information, the repetitive sequence of moderate-length is assembled.D is local packaging strategy, in office Portion, many repetitive sequences are all single copies, are easily assembled.E is breach filling Strategy, that is, completes scaffold groups After dress, local assembling is carried out for each breach.

Fig. 3 is heterozygous sequence processing scheme figure, and wherein A is that simply isolated SNP regions are merged.B is to pass through Super read across link information, recognize the assembling mode of adjacent heterozygous sequence, and carry out merging treatment.C be for away from From nearer larger heterozygosis region, using paired ends or perhaps mate-pairs link information, heterozygosis part is entered Row merges.

Fig. 4 is Read error correction figures, and wherein A is to build de Bruijns using the raw sequencing data after simple filtration, so After carry out correction process, processing mode is mainly that end sequencing projection is deleted, mistake edge contract and blister path merge.B is Paired ends error correction, its medium and small grey rectangle is sequencing mistake, and PE is original read, and PE* is the read after error correction.C is Mate-pairs correcting data errors, MP is the read for including sequencing mistake, and MP* is the result after error correction.The wherein rectangular portion of grey It is divided into the E in sequencing mistake, i.e. figure, will be deleted in error correction procedure.J is the cyclisation site introduced in library construction process.

Fig. 5 is the two ends sequence that Super read build that R1 and R2 in principle overview diagram, wherein A is respectively paired end Row, are navigated to by way of kmer is retrieved in the de Bruijns after error correction.B is using the path search algorithm in graph theory The path that dotted line between progress R1 and R2 in the search in path, figure obtains for search.C is according to the kmer letters in searching route Breath, extracts the path sequence completed, i.e. super read.

Fig. 6 be in local building block principle overview diagram, wherein A " c1, " " c2, " " c3, " " c4, " " c5, " " r1, " " r2, " and " r3 " is primary contigs, and " c1, " " c2, " " c3, " " c4, " and " c5 " is all single copy, and " r1, " " r2, " and " r3 " are single Copy, " c2 " and " c4 " is that obtained seed is screened from all single copy primary contigs.The camber line of grey is mate- in B Link informations of the pairs between different primary contigs." c1 " and " c3 " is the neighbouring primary contigs of " c2 ", " c3 " and " c5 " is the neighbouring primary contigs of " c4 ".Short grey rectangle is UARs, i.e., do not compare single copy primary contigs's read.C is locally to build de Bruijns, and carries out error correction.D is based on the de Bruijns after local error correction, to having The primary contigs of the relation of connection carries out route searching, carries out local assembling.E is to merge all assembling results, Obtain final genome assembling result.

Fig. 7 is bifidobacterium bifidum (Bifidobacterium bifidum PRL2010) genome synteny figure. GNOVO assemble result and bifidobacterium bifidum (genome sequence accession number is CP001840.1, and assembling Genome Size is 2,214, Synteny figure 656bp), wherein longitudinal direction is GNOVO assembling genome, it is laterally reference gene group, black dotted line is gene Group synteny part.

Fig. 8 is streptomycete (Streptomyces roseosporus NRRL 15998) genome synteny figure.GNOVO Assembling result, (Streptomyces roseosporus NRRL15998, genome sequence accession number is NZ_ with streptomycete DS999644.1, assembling Genome Size be 7,817,295bp) synteny figure, wherein longitudinal direction be GNOVO assembling gene Group, is laterally reference gene group, and dotted line is genome synteny part.

Embodiment

Following examples are used to illustrate the present invention, but are not limited to the scope of the present invention.

The Escherichia coli of embodiment 1 (E.coli) genome is assembled

1) test data introduction

The test data is from NCBI (National Center for Biotechnology Information, i.e. U.S. National Biotechnology information centre of state) SRA (Short Read Archive) database download and obtain, SRA database nets Location is www.ncbi.nlm.nih.gov/sra, and the detailed accession number of data is SRX016044.The details of test data are such as Under：

Upload the date：2009-05-22；

Library size：180bp；

Total amount is sequenced：2.1G；

Depth is sequenced in predicted gene group：456.5x.

2) appraisal procedure

Test and comparison is carried out to 7 composite softwares altogether, the major parameter of each composite software is traveled through, then chosen The best result of assembling result is compared assessment, and the detailed assembly parameter that each software preferably assembles result is as follows：

GNOVO (the inventive method) assembly parameter is：K1=25, k2=95, m1=5, m2=2, other parameters are silent Recognize parameter, k1 is builds the kmer sizes of de Bruijns for the first time here, and k2 is carried out just for second based on super read The kmer sizes of de Bruijns are built during level contig assemblings；When m1 is builds de Bruijns for the first time, low cover is carried out The parameter of cover degree edge contract correction process, the threshold value for defining low cover degree, when m2 builds de Bruijns for second, Carry out the parameter of low cover degree edge contract correction process, the threshold value for defining low cover degree；

JR-Assembler assembly parameters are：It is default parameters；

Edena：M=53, other parameters are default parameters；

Taipan：K=50, other parameters are default parameters；

Velvet：K=45, other parameters are default parameters；

ABySS：K=45, other parameters are default parameters；

SOAPdenovo：K=53, other parameters are default parameters；

GNOVO is when being assembled, and the first step builds de Bruijns by high-flux sequence data, and based on error correction De Bruijns afterwards carry out sequencing data error correction.Kmer sizes (are used using 25 when carrying out building de Bruijns here Parameter k1 is specified), while carrying out error correction to original de Bruijns, 5 are used when deleting here low cover degree side The side for being less than 5 for threshold value, i.e. depth will be all deleted.After figure error correction is completed, primitive sequencer reads will be compared de Error correction is carried out in Bruijn.

De Bruijns main information is shown in Table 1 before and after error correction：

De Bruijn main informations before and after the assembling E.coli genome first step error correction of the GNOVO methods of table 1

Contrasted from the index before and after error correction and can be seen that kmer sums reduction about 10%, but nodes and kmer kinds Class number but all have dropped 400 times and 80 times respectively, it is seen that sequencing mistake generates substantial amounts of low depth kmer and extra node, Greatly improved thereby using the complexity of de Bruijns.

GNOVO is when being assembled, and second step compares back read in de Bruijns, using path search algorithm structure Build super read.In path search process is carried out, the searching for reference of acquiescence is accurate poor for 3, i.e., 3 times.Original read numbers For 6096923, the read numbers for successfully building Super read are 5843968, and search efficiency is 95.8% (5843968/ 6096923)。

GNOVO is when being assembled, and the 3rd step rebuilds de Bruijns using super read, passes through figure error correction And deconsolidation process, obtain primary contig.Kmer sizes (use parameter k2 using 95 when carrying out building de Bruijns here Specified), while carrying out error correction to original de Bruijns, 2 are used when deleting here low cover degree side for threshold The side that value, i.e. depth are less than 2 will be all deleted.After figure error correction is completed, de Bruijns are split at node, obtained To primary contigs.

De Bruijns main information is as follows before and after error correction：

De Bruijn main informations before and after the step error correction of the GNOVO methods of table 2 assembling E.coli genomes the 3rd

As can be seen from the above table, now figure complexity is very low, only 2324 nodes, that is, what is assembled is complete Whole property is extraordinary.

The statistical information of first group of obtained contigs is：Contig overall lengths are 4.55Mb, and contigs sums are 169, Contig N50 length is 60284bp.

GNOVO is when being assembled, and the 4th step transfers specific portion according to pair-end and mate-pair comparison information Primary contigs and reads, carry out local assembling.In locally assembling, acquiescence minimum support is the primary of 3, i.e., two Connection number between contigs is only more than or equal to (parameter-cutoff is configured, general it is not recommended that modification) just effective when 3.Office Kmer sizes in portion's assembling use many kmer, (are entered for 19,57 and 95 by parameter "-k ", "-q " and "-n " under default situations Row is set).Assembling obtained contigs statistical information is：Contig overall lengths are 4.55Mb, and contigs sums are 161, Contig N50 length is 63618bp.

GNOVO is when being assembled, and the 5th step builds scaffold connection figures using mate-pair link informations, passes through Dividing sub-picture algorithm and simulated annealing are ranked up to contigs obtains final scaffolds.In scaffold assemblings During, same acquiescence minimum support is only more than or equal to just effective when 3 for the connection number between the primary contigs of 3, i.e., two (parameter-cutoff is configured, general it is not recommended that modification).Here due to no big library, therefore scaffold assembling effects Do not lifted, it is consistent with the assembling of the 4th step.

3) results contrast

The assembling of each composite software the results are shown in Table 3：

Table 3 respectively assembles software combination E.coli genome results

Contigs numbers:Length below 300bp Contigs without statistics.

Overall length：All contigs overall length.

Maximum contig length：Assemble the most long most long contig of length of most long contig in result length and

Average contig length：The average value of all contig length.

N50：Represent all Contigs according to being ranked up from long to short, then by Contig according to this order It is added successively, when the length of addition reaches the half of Contig total lengths, last Contig length added is Contig N50。

Assembly defect contig numbers：The contig numbers in original reference gene group can not be compared.

In the present embodiment, assemble method GNOVO of the invention has obtained 161 contigs, is secondly JR-Assembler The method of software, obtains 192 contigs, is much better than other composite softwares, and GNOVO N50 length is 63.618K, than JR-Assembler (48.673K) and Velvet (43.998K) are higher by more than 10K, illustrate GNOVO assembling integrality in the reality It is much better than other composite softwares in example.The most long contig that GNOVO is obtained be 334.908K, than other software be higher by 100K with On.The contig numbers of GNOVO mistake assembling are 0, consistent with other most of softwares, it is shown that its high accuracy.In this reality Apply in example, GNOVO shows larger advantage compared with other composite softwares.

The streptomycete of embodiment 2 (S.roseosporus) genome is assembled

1) test data introduction

The test data is to download to obtain from NCBI SRA databases, and SRA database network address is Www.ncbi.nlm.nih.gov/sra, the detailed accession number of data is SRX026747 and SRX016085.

A) test data SRX026747 details are as follows：

Upload the date：2010-08-06；

Library size：180bp；

Total amount is sequenced：10.7G；

Depth is sequenced in predicted gene group：1389.6X.

B) test data SRX016085 details are as follows：

Upload the date：2009-09-20；

Library size：4kb；

Total amount is sequenced：3.5G；

Depth is sequenced in predicted gene group：454.5X.

2) appraisal procedure

Here test and comparison is carried out to 5 composite softwares altogether, the major parameter of each composite software traveled through, then Choose the best result of assembling result and be compared assessment, the detailed assembly parameter that each software preferably assembles result is as follows：

GNOVO assembly parameters are：K1=25, k2=95, m1=11, m2=5, other parameters are that default parameters is (detailed Assessment details refer to embodiment 1)；

JR-Assembler：It is default parameters

ABySS：K=45, other parameters are default parameters；

Velvet：K=49, other parameters are default parameters；

SOAPdenovo：K=63, other parameters are default parameters；

3) results contrast

The assembling of each composite software the results are shown in Table 4：

Table 4

In the present embodiment, GNOVO N50 is highest 13.134K, is secondly Velvet (12.499K)；It is most long Contig length is 73.115K, and more than 10K is higher by than Velvet (61.423K).Contig quantity is 1,242, is more than Minimum 1,127 of ABySS.In this example, GNOVO is in contig maximum lengths, average length, N50 length than other Composite software is slightly excellent, a little higher than ABySS only on contig numbers, and preferable assembling ability is illustrated on the whole.

It will be appreciated that the size of GNOVO assembling result is 9.79M, hence it is evident that assemble result more than others. Therefore, inventor has carried out nt comparings to initial data, and comparison result shows the number containing two bacteriums in initial data According to, therefore speculate that initial data is a Mixed Microbes.By downloading the reference gene group of correspondence bacterium, i.e. streptomycete from NCBI (Streptomyces roseosporus NRRL 15998, genome sequence accession number is NZ_DS999644.1, assembles genome Size is 7,817,295bp) and bifidobacterium bifidum (Bifidobacterium bifidum PRL2010, genome sequence login Number be CP001840.1, assembling Genome Size be 2,214,656bp), pass through MUMMER carry out full-length genome comparison (similarity It is required that 99%) find, GNOVO assembling result can be very good to compare onto the two genomes, and comparison result is shown in Fig. 7 and figure 8.Meanwhile, it is a Mixed Microbes that this, which also demonstrates the supposition that inventor starts, i.e. initial data, while also being demonstrated from side GNOVO is obviously improved contigs length, and with higher assembling accuracy, repetitive sequence is greatly improved by local assembling Disposal ability.

The Neuraspora crassa of embodiment 3 (N.crassa) genome is assembled

1) test data introduction

The test data is to download to obtain from NCBI SRA databases, and SRA database network address is Www.ncbi.nlm.nih.gov/sra, the detailed accession number of data is SRX030834.

A) test data SRX030834 details are as follows：

Upload the date：2010-11-11；

Library size：180bp；

Total amount is sequenced：5.5G；

Depth is sequenced in predicted gene group：148.3X.

2) appraisal procedure

Here test and comparison is carried out to 6 composite softwares altogether, the major parameter of each composite software traveled through here, Then choose the best result of assembling result and be compared assessment, the detailed assembly parameter that each software preferably assembles result is as follows：

GNOVO assembly parameters are：K1=25, k2=95, m1=5, m2=2, other parameters are that default parameters is (detailed Assessment details refer to embodiment 1)；

JR-Assembler：It is default parameters

ABySS：K=35, other parameters are default parameters；

Velvet：K=37, other parameters are default parameters；

SOAPdenovo：K=47, other parameters are default parameters；

Edena：M=45, other parameters are default parameters；

3) results contrast

The assembling of each composite software the results are shown in Table 5：

Table 5

In the present embodiment, GNOVO N50 is 10.473K, has preferably assembling complete compared with other composite softwares (4~6K) Property；The contig maximum lengths and average length of assembling are superior to other composite softwares.GNOVO contig numbers are 11,300, More than the 10 of Velvet, 187, positioned at second.In this embodiment, it is better than other on assembling effect overall GNOVO Composite software.

The Staphylococcus intermedius of embodiment 4 (S.intermedius ATCC 27335) genome is assembled

1) test data introduction

The test data is to download to obtain from NCBI SRA databases, and SRA database network address is Www.ncbi.nlm.nih.gov/sra, the detailed accession number of data is SRX297066 and SRX297065.

A) test data SRX297066 details are as follows：

Upload the date：2012-11-18；

Library size：180bp；

Total amount is sequenced：1.1G；

Depth is sequenced in predicted gene group：564.10X.

B) test data SRX297065 details are as follows：

Upload the date：2012-11-19；

Library size：5kb；

Total amount is sequenced：1.5G；

Depth is sequenced in predicted gene group：769.23X.

2) appraisal procedure

Test and comparison is carried out to 5 composite softwares altogether, the major parameter of each composite software is traveled through, then chosen The best result of assembling result is compared assessment, and the detailed assembly parameter that each software preferably assembles result is as follows：

GNOVO assembly parameters are：K1=25, k2=95, m1=11, m2=2, other parameters are that default parameters is (detailed Assessment details refer to embodiment 1)；

Allpaths-lg：It is default parameters

SPAdes：K=61,73,95, other parameters are default parameters；

MaSuRCA：K=85, other parameters are default parameters；

SOAPdenovo：K=77, other parameters are default parameters；

3) results contrast

The assembling of each composite software the results are shown in Table 6：

Table 6

In the present embodiment, GNOVO, which is assembled, has obtained the complete sequence (1 Scaffold) of bacterium, Contig numbers 7, far Better than other composite softwares (Scaffold of more than 10), it is shown that its great ability in assembling, GNOVO is obviously improved Contigs length；The disposal ability of repetitive sequence is greatly improved by local assembling.

Although above the present invention is described in detail with a general description of the specific embodiments, On the basis of the present invention, it can be made some modifications or improvements, this will be apparent to those skilled in the art.Cause This, these modifications or improvements, belong to the scope of protection of present invention without departing from theon the basis of the spirit of the present invention.

Claims

1. a kind of genome based on high-flux sequence data from the beginning assemble method, comprises the following steps：

(1) de Bruijns are built by high-flux sequence data, and high pass measurement is carried out based on the deBruijn figures after error correction Sequence correcting data error；

(2) super read are built using path search algorithm；

(3) de Bruijns are rebuild using super read, by the error correction of de Bruijns and deconsolidation process, obtained just Level contigs；

(4) the primary contigs and reads of specific portion are transferred according to pair-end and mate-pair comparison information, is carried out Local assembling, all local assembling results are merged together, and carry out deconsolidation process after error correction, so as to obtain contigs；

(5) scaffold connection figures are built using mate-pair link informations, passes through dividing sub-picture algorithm and simulated annealing Contigs is ranked up and obtains final scaffolds；

De Bruijns described in step (1) are present with the data structure form of Hash, and its developing algorithm is：

2) iteration reads every read, and is numbered, and numbers since 0；

3) all kmer are extracted at from 5 ' to 3 ' ends successively, and are stored into Hash table, if kmer has been present, only Kmer routing information need to be stored, that is, store its forerunner and rear-guard；If kmer is not present, newly-built kmer nodes are needed, together When also need store routing information；

4) in storage read during first kmer information, if it is not present in Hash table, its true forerunner kmer section is illustrated Point is not present, it is necessary to which newly-built end sequencing projection node, for replacing true forerunner kmer nodes, is used as the kmer nodes Backtracking predecessor node；

5) when storing non-first kmer node, if it find that the kmer has been present, and before the backtracking of the kmer nodes It is that projection node is sequenced in end to drive node, then needs to remove end sequencing projection node, while the kmer nodes are recalled Predecessor node is set to previous kmer nodes.

2. a kind of genome based on high-flux sequence data from the beginning assemble method, comprises the following steps：

(2) super read are built using path search algorithm；

(4) the primary contigs and reads of specific portion are transferred according to pair-end and mate-pair comparison information, is carried out Local assembling；

The local number of assembling steps of step (4) is：

1) primary contigs and reads are compared, by reads comparison result, obtain between primary contigs away from From the relation of information, and reads and primary contigs, primary contigs and reads information is read in into internal memory；

2) copy number is filtered out>The shorter primary contigs of 2 multicopy or length, according to the distance between primary contigs Relation, builds scaffold connection figures, and selects primary contigs apart from each other and longer as seed wherein, obtains The a range of primary contigs near seed is selected after seed；

3) to each local primary contigs, sequencing piece of the one end on primary contigs is only had according to comparison result selection Section, also selects while will be in indentation, there and super read of the fragment coverage more than 0.9 is sequenced；

4) the local assembling of de Bruijns progress is locally being built；

5) the local assembling result in each Local map is merged, obtains the assembling result of the overall situation, then carry out simplifying with Figure correction process, so as to obtain final contigs.

3. a kind of genome based on high-flux sequence data from the beginning assemble method, comprises the following steps：

(2) super read are built using path search algorithm；

(5) scaffold connection figures are built using mate-pair link informations, passes through dividing sub-picture algorithm and simulated annealing Contigs is ranked up and obtains final scaffolds；By dividing sub-picture algorithm and simulated annealing to contigs When being ranked up, directly sorted and sorted with simulated annealing according to the selection of different sizes of subgraph, subgraph≤8, which are used, directly sorts, ＞ 8 is then sorted using simulated annealing；

The dividing sub-picture algorithm comprises the following steps：

1) each contig is traveled through successively；

2) each contig length and side connection is detected；

3) delete being defined as border contig, that is, carry out figure fractionation；

The direct sequence comprises the following steps：

1) exhaustion may be carried out to all sequences；

2) choose the sequence of conflict side at least and be used as optimal sequencing；

Simulated annealing is ranked up to contigs to be comprised the following steps：

1) random one contigs sequence of generation；

2) probability of acceptance is calculated according to Current Temperatures and the random new sort for changing generation；

3) random chance is generated, using new contigs sequences as currently if the random chance is less than the probability of acceptance Sequence；

4) cooled with particular step size, Simultaneous Iteration step 2 and 3；

5) sequence of conflict side at least is chosen from all current sequences and is used as optimal sequencing.

4. the method according to claim any one of 1-3, it is characterised in that de Bruijns are built in step (1) Kmer length is floated according to coverage information, and length is between 17-37.

5. the method according to claim any one of 1-3, it is characterised in that the path search algorithm of step (2) is depth Preferred path searching algorithm.

6. the method according to claim any one of 1-3, it is characterised in that step (3) is built using larger kmer De Bruijns, kmer sizes are 75-155.

7. the method according to claim any one of 1-2, it is characterised in that step (5) passes through dividing sub-picture algorithm and mould When plan annealing algorithm is ranked up to contigs, is directly sorted and sorted with simulated annealing according to the selection of different sizes of subgraph, son Figure≤8 is using directly sorting, and ＞ 8 is then sorted using simulated annealing.

8. the method according to claim any one of 1-3, it is characterised in that high-flux sequence correcting data error, local assembling During route searching mode be depth-priority-searching method, need opposite side to carry out simultaneously for the route searching mode in figure error correction Weighting.