CN104239750B - Genome based on high-flux sequence data from the beginning assemble method - Google Patents
Genome based on high-flux sequence data from the beginning assemble method Download PDFInfo
- Publication number
- CN104239750B CN104239750B CN201410421844.3A CN201410421844A CN104239750B CN 104239750 B CN104239750 B CN 104239750B CN 201410421844 A CN201410421844 A CN 201410421844A CN 104239750 B CN104239750 B CN 104239750B
- Authority
- CN
- China
- Prior art keywords
- contigs
- kmer
- bruijns
- assembling
- error correction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides the from the beginning assemble method, including step of the genome based on high-flux sequence data:1) de Bruijns are built according to high-flux sequence data, sequencing data error correction and super read assemblings is carried out based on the de Bruijns after error correction;2) primary contigs is carried out using super read to assemble;3) the primary contigs and reads of specific portion are transferred, local assembling merges all local assembling results;4) contigs is ranked up by dividing sub-picture algorithm and simulated annealing and obtains final scaffolds.The present invention eliminates the mistake that high-flux sequence is brought by the error correction of de Bruijns, improves data accuracy;Sequencing reading length is improved using the method for building super read, contigs length is obviously improved;The disposal ability of repetitive sequence is greatly improved by local assembling.
Description
Technical field
The present invention relates to a kind of genome assemble method, more particularly to a kind of genome based on short sequence fragment
From the beginning assemble method.
Background technology
With developing rapidly for second generation sequencing technologies, the dramatic decrease of expense is sequenced, from the beginning gene order-checking more by
To the favor of researcher.But, using substantial amounts of short read data recover again genome original appearance be also faced with it is huge
Challenge, and wherein a most key step is exactly contigs assemblings.De Bruijns structure is the core of graph theory packing algorithm,
It is the core of present main flow from the beginning packing algorithm, and it is the overlay information based on kmer to build Euler diagram, and it is contigs
The foundation stone of structure, therefore the exploitation of the present invention will also be based on De Bruijns.
Current contigs packing algorithms all only carry out a De Bruijn and built, while big for the kmer in figure
Small is also relatively-stationary, although there are some many kmer packing algorithms, but they also all only carry out a composition, then is closed
And.Simple filtering also correction process is also simply carried out for the general composite software of the short sequence used in assembling, can't
Short sequence to these most originals carries out secondary operation, during this De Bruijn also just largely limited is built
The upper limit of kmer sizes.Therefore for the genome assemble method without short serial processing, kmer sizes are all smaller, in De
Bruijn can produce more branch in building, and greatly improve the complexity of De Bruijns, so as to reduce assembling effect
Really.
In addition, a big feature of animal-plant gene group is exactly that repetitive sequence ratio is higher, and repetitive sequence can allow genome
Substantial amounts of optional site and branch are produced in assembling process, and then improves assembling difficulty.Mainly there is the strategy of two kinds of main flows at present
To handle partial picture therein:A kind of strategy is to utilize large fragment library across repetitive sequence, and estimates repetitive sequence region
Size, then chooses the repetitive sequence path of an appropriate length;It is another, it is first to avoid repetitive sequence region, completes just
Return to carry out the assembling in repeated order region after step assembling.From strategy, second method is for complex genome
More effectively, because it localizes global issue, the difficulty of assembling is greatly reduced.
The content of the invention
In view of the deficienciess of the prior art, it is an object of the invention to provide a kind of gene based on high-flux sequence data
From the beginning assemble method --- GNOVO is organized, the technology handles the sequencing mistake of high-flux sequence inherently by correcting data error first,
Shorter read is assembled into by logical super read assemblings simultaneously reads long super read with bigger, so that part overcomes survey
The problem of sequence reads long too short.Secondly, by local assembling, the repetitive sequence on full-length genome is changed into local single copy sequence
Row, so as to greatly reduce the difficulty of repetitive sequence processing, improve the length of contigs assemblings.
In order to realize the object of the invention, a kind of genome based on high-flux sequence data of the invention from the beginning side of assembling
Method --- GNOVO, is mainly comprised the following steps:
1) de Bruijns (using less kmer) are built by high-flux sequence data, and carry out figure correction process,
And sequencing data error correction is carried out based on the de Bruijns after error correction, error correction principles are shown in Fig. 4;
2) super read assemblings are carried out based on the de Bruijns after error correction;
3) de Bruijns (using larger kmer) are rebuild with super read, and carry out figure correction process, it is right
De Bruijns after error correction are split, and obtain primary contigs;
4) the primary contigs of specific portion is transferred according to mate-pair link information, and according to the ratio of sequencing data
Local assembling is carried out to the local reads of information, all local assembling results are merged together, and carries out after error correction
Deconsolidation process, so as to obtain contigs;
5) scaffold connection figures are built according to mate-pair link information, by dividing sub-picture algorithm to contigs
Split, and use simulated annealing to be ranked up in part to contigs and obtain final scaffolds.
GNOVO building block principle flow charts are shown in Fig. 1.
All it is that, using de Bruijns as core texture, in GNOVO, de Bruijns are with Hash in step 1-4
Data structure form exist, its developing algorithm is:
1) space distribution and initialization are carried out to Hash table according to Genome Size and kmer sizes;
2) iteration reads every read, and is numbered, and numbers since 0.
3) all kmer are extracted at from 5 ' to 3 ' ends successively, and are stored into Hash table.If kmer has been present,
Then need to only store kmer routing information just can be to store its forerunner and rear-guard.If kmer is not present, need newly-built
Kmer nodes, while also needing to store routing information.
4), if it is not present in Hash table, its true forerunner is illustrated during first kmer information in storage read
Kmer nodes are not present, and untill current, it is non-existent.Therefore, a newly-built end is accomplished by this when to survey
Sequence projection node, for replacing true forerunner kmer nodes, is used as the backtracking predecessor node of the kmer nodes.
5) when storing non-first kmer node, if it find that the kmer has existed for, and the kmer nodes
It is that projection node is sequenced in end to recall predecessor node, then needs to remove end sequencing projection node, while the kmer is saved
Point backtracking predecessor node is set to previous kmer nodes.Because in current read, the kmer is not first kmer, institute
Necessarily there is a true forerunner with it, i.e. its previous kmer therefore, it can replace end with true forerunner kmer nodes
Projection node is sequenced, so as to reduce end sequencing projection quantity, and then partial memory is saved.
De Bruijns are as kernel data structure, and its accuracy is highly important, therefore, one is developed in GNOVO
The figure correction process of series, key step packaging:1) simplification of de Bruijns is handled;2) end sequencing projection delete processing;3)
Blister path union operation;4) removing of low cover degree side is handled.
1) simplification of de Bruijns is handled:According to Hash table, each kmer nodes are traveled through.For current kmer
Node, is extended according to its true forerunner with rear-guard, if there is also need simultaneously the complementary node of current kmer nodes
Extended according to the two nodes.Extension method:Extended along the direction gone out while with entering, i.e., along true forerunner with after
Drive is extended.Extension condition on single direction:The kmer nodes (including its complementary kmer, if there is) of extended spot have and
Only one true forerunner, while one and only one true rear-guard.Extension on single direction is terminated:Come for rear-guard extension
Say, the current kmer nodes extended to there are two or more true forerunners, there are two or more true rear-guards in other words, in other words
There is no true rear-guard, the current kmer nodes extended in other words exist in De Bruijns.Prolong for forerunner
For stretching, the current kmer nodes extended to have two or more true forerunners, there is two or more true rear-guards in other words, or
Person says that, without true forerunner, the current kmer nodes extended in other words exist in De Bruijns.
2) end sequencing projection is produced mainly due to the sequencing mistake of read ends, and projection is sequenced in end in GNOVO
The criterion of mistake is:A) length is less than 2K (K is kmer length);B) equipotential that there must be high coverage enters side or gone out
Side.
3) blister path refers to the graphic structure being made up of two different paths with identical beginning and end, except rise
Outside point and terminal, other any crossover nodes are not present inside figure.Blister path is mainly in heterozygous sites and read
What the sequencing mistake in portion was produced, the definition in blister path is in GNOVO:1) path length is respectively less than 200bp;2) path is similar
Degree is more than 0.8;3) at least the coverage of a paths is less than some specific threshold value.The core of blister path search algorithm is calculated
Method is that " (dijkstra's algorithm is most write in Shortest Path Searching Algorithm to Dijkstra-like breadth-first search "
The algorithm of name, " breadth-first search " represent breadth first traversal).
4) low cover degree side is mainly is sequenced what mistake was produced by read, its main discrimination standard:1) coverage is less than
Some specific threshold value;2) node at side two ends is present except when at least one true forerunner in addition to front is true with least one
Rear-guard.The selection of coverage threshold value, for monoploid, general acquiescence chooses the average of side coverage median in other words
1/2, for amphiploid genome, acquiescence choose side coverage average in other words median 1/4.But best side
Method is that the selection of threshold value is carried out according to the overall distribution of coverage.
Super Read refer to a longer sequence, it be by the breach between polishing paired-end in other words
A sequence obtained from connecting paired-end two ends by overlay information, Super Read structure principle is shown in Fig. 5.Due to
It is obtained based on paired-end, therefore the desired value of its length will be library fragments size.Due to super read
It is the read and middle breach for being connected to both-end, therefore the general reading than read of its length is long a lot, with super read
There is very big advantage as assembling starting point.Super read assembling is to carry out route searching using depth-priority-searching method to obtain
Arrive.
In many analyses, it is all based on what single copy node set out, reason mainly has:1) first go out from single copy node
Hair, Assembly analysis is easier, and the probability of error can be smaller.2) there is the information of single copy node, then in post-processing
A part of repetitive sequence assembling can be solved during repetitive sequence by it as basic point.
It is assumed here that there is a line, its length is n, and Xi represents the read using the site i on side as read initiation sites
(physical length for noting the length of side here is n-k+1 to number, because side is based on kmer, therefore i maximum is n-k+
1).It is assumed here that Xi is independent stochastic variable, it is to obey to be desired for ρ Poisson distribution, its expectation ρ by side coverage
Distribution determine (this refers to the distribution situation of the coverage on all sides, i.e. overall distribution).
Theoretical according to central limit, it is ρ, standard deviation that the desired value that a length is the Xi on n side, which should obey average,
ForNormal distribution.If certain when being single copy, then Xi average value and ρ difference just should not be too
Greatly.Here the ratio for removing face is accurate as the judgement of side uniqueness:
F is used in order to weigh in the specific uniqueness in other words on side, GNOVO>=5 are used as the standard judged.I.e. F is bigger
(i.e. Xi average value is smaller), the specific stronger but small Xi on side average value is also likely to be to be led due to sequencing mistake
Cause, but this partial error can be typically repaired in error correction procedure above.
Local packing algorithm main thought is the assembling by being localized to genome in GNOVO, reduction assembling
Complexity, obtains preferably local assembling effect.Again by merging the result of each local assembling, the group of whole gene group is obtained
Result is filled, is obviously improved genome assembling effect (contigs), local building block principle may be referred to Fig. 6.It is main
Step has:
1) primary contigs and reads are compared, by reads comparison result, obtained between primary contigs
Range information, and reads and primary contigs relation.Primary contigs and reads information is read in into internal memory.
2) selection of primary contigs seeds.Filter out multicopy (copy number>Or the shorter primary of length 2)
contigs.To the primary contigs of reservation according to the distance between primary contigs relation, scaffold connection figures are built,
And primary contigs apart from each other and longer is selected wherein as seed.Obtain to select one near seed after seed
Determine the primary contigs in scope.
3) part reads chooses:To each local primary contigs, one end is only had first according to comparison result selection
Sequencing fragment on level contigs.Indentation, there will be in simultaneously and the super read that fragment coverage is more than 0.9 are sequenced
Select and.
4) the local assembling of de Bruijns progress is locally being built.
5) the local assembling result in each Local map is merged, obtains the assembling result of the overall situation, then carry out letter
Change and figure correction process, so as to obtain final contigs.
In scaffold assembling process, overall scaffold figures can be subjected to subgraph fractionation first, be divided into one
Individual small independent subgraph, the subgraph and other contigs paired end all fall border contigs (length be more than text
The contigs of storehouse size, normal paired reads can not possibly be across it) on, therefore each subgraph can be regarded as one
Individual small entirety, to its independent progress scaffold assembling.Contigs is ranked up using simulated annealing in GNOVO, arranged
It is final scaffold that the minimum ranking results in conflict side are chosen in program process.After scaffold assemblings, it can be regarded as one
Individual entirety, then assembled again with other contigs.
The contigs in subgraph is completed after sequence using simulated annealing, GNOVO is carried out using quadratic programming algorithm
Object function in the estimation of breach size between adjacent contigs, calculating process is:
In formula, E is the set on side in subgraph, CiFor side i across contigs overall length,For side i across
The overall length of breach, μiFor the corresponding average library sizes of side i,For the variance in side i correspondences library.
Described genome from the beginning assemble method, wherein described assemble method uses C languages on (SuSE) Linux OS
Speech, perl language and fortran Programming with Pascal Language are realized, big gene order-checking data can be handled, and calculating has can be parallel
The advantages of property, relatively low internal memory and fast speed.
The key point of the present invention is:
1) figure error correction is first carried out by the method first to de Bruijn error correction, then again with the de Bruijn after error correction
Figure to carry out correction process to high-flux sequence data.
2) according to the de Bruijns after error correction, pair-end is assembled using path search algorithm, and then
Primary contigs structures are carried out to the longer super read of reading, and using super read.
3) the primary contigs and reads of specific portion are transferred according to pair-end and mate-pair comparison information,
Local assembling is carried out, finally all local assembling results are merged together and obtain contigs.
4) scaffold assemblings are carried out to the scaffold subgraphs after segmentation using simulated annealing.
First assembled using the strategy of local assembling in each part, the complexity of total system is converted to local list
One property, so as to substantially reduce the difficulty of assembling.
The genome based on high-flux sequence data of the present invention is from the beginning in assemble method (being named as GNOVO methods)
Contigs assemblings thinking is different from general contigs packing algorithms, and it uses the strategy for building De Bruijns twice, the
One figure is built based on less K, is mainly used in error correction and is built with super read;And second figure is then to utilize
Super read data, and built based on larger K, it is mainly used in primary contigs and built.Due to primary
Contigs structure is built based on the super read with greater depth, while larger K also can be handled preferably
Part repetitive sequence.Fig. 2 is shown in the application of Super read processing repetitive sequence assembly orientations.Further, since high-flux sequence
Read is shorter, is that genome packing algorithm proposes huge challenge, but the Research idea of the present invention is from genome
Jump comes out in packing algorithm, how focusing on for research and development is improved into read length, so as to be carried for the input of packing algorithm
For higher starting point.According to the distance between pair-end information, the filling of breach between pair-end is completed using graph-theoretical algorithm,
And then obtain the super read (for there is the pair-end of overlay information, being then directly attached) of long length.By
It is all very short in the read of high-flux sequence data, and super read length is much longer compared with read, therefore with super
Read will have bigger advantage as the strategy of the starting point of assembling, such as:1) longer overlay information can be used to carry out read
Connection;2) super read can be across longer repetitive sequence (see Fig. 2);3) GNOVO, which is based on super read, to use
Bigger KMER, reduces figure complexity, and then preferably processing heterozygous sequence (see Fig. 3).Secondly, the think of of local assembling is utilized
Think, the repetitive sequence on full-length genome is changed into local single-copy sequence, so as to greatly reduce repetitive sequence processing
Difficulty, improves the length of contigs assemblings.Local packing algorithm in GNOVO is according to pair-end and mate-pair ratio
The primary contigs and reads of specific portion are transferred to information, local assembling is carried out, finally by all local assembling results
It is merged together and obtains contigs.Finally, GNOVO herein in connection with dividing sub-picture algorithm and simulated annealing to obtaining
Contigs is ranked up, and obtains final scaffolds.
Beneficial effects of the present invention are mainly manifested in:
1) by carrying out error correction to raw sequencing data based on the de Bruijns after error correction, it is ensured that sequencing result
Accuracy, base error rate is generally less than 0.0001, while also providing a kind of new sequencing data error correction method.
2) de Bruijns are based on, pair-end is assembled using path search algorithm, can be obtained longer
Sequence, so as to substantially reduce the difficulty of assembling, can obtain 150bp to 230bp sequence for 180bp libraries.
3) genome assembling is carried out using super read, be conducive in de Bruijn packing algorithms using bigger
kmer(>95), so as to reduce figure complexity, primary contigs length is improved, it is ensured that final assembling effect, for thin
Bacterium number can be directly more than 10kb according to N50, it might even be possible to reach 30-50kb.
4) using local assembling plan, portion gene group is assembled in part, assembling difficulty is greatly reduced, particularly
The difficulty of repetitive sequence assembling, it is ensured that final contigs length, can be directly more than 50kb, very for bacterium data N50
100-500kb can be extremely reached, while also providing a kind of new local assemble method;
5) scaffold assemblings are carried out to the scaffold subgraphs after segmentation using simulated annealing, it is constructed
Scaffold length is longer, and N50 generally reaches more than 500kb.
6) Linux cluster advantages are made full use of, fortune is improved by the approach such as parallel computation design and the design of path Hash
Efficiency is calculated, restriction of the calculator memory to large data sets computing is overcome, the genome assembling within 10G can be completed.
Brief description of the drawings
Fig. 1 is GNOVO assembling flow path overview diagrams, and wherein A filters for initial data, if the N base ratios in sequencing read
Example it is higher (>5%), low quality base (mass value in other words<20) ratio it is higher (>5%), then such read will be original
Sequencing data processing stage is filtered.B is read error correction, based on the de Bruijns after error correction, paired ends with
Mate-pairs data carry out error correction by different strategies are respectively adopted.C is that Super read are built into, based on the de after error correction
Bruijn, super read structures are carried out using path search algorithm.D assembles for primary contigs, utilizes super
Reads data, using big kmer, according to de Bruijns the Theory Construction primary contigs.E assembles to be local, first transfers kind
Single copy primary contigs near son, inning read is then transferred from single-ended comparison data, is finally built in small part
De Bruijns are assembled.F builds for Scaffold, i.e., build scaffold according to mate-pair link information.
Fig. 2 is repetitive sequence solution figure, and wherein A is big kmer strategies, i.e., entered using the kmer longer than repetitive sequence
Row repetitive sequence is assembled.B and C is connection strategy, i.e., using paired ends and mate-pairs link information, in other words
Be super reads across link information, the repetitive sequence of moderate-length is assembled.D is local packaging strategy, in office
Portion, many repetitive sequences are all single copies, are easily assembled.E is breach filling Strategy, that is, completes scaffold groups
After dress, local assembling is carried out for each breach.
Fig. 3 is heterozygous sequence processing scheme figure, and wherein A is that simply isolated SNP regions are merged.B is to pass through
Super read across link information, recognize the assembling mode of adjacent heterozygous sequence, and carry out merging treatment.C be for away from
From nearer larger heterozygosis region, using paired ends or perhaps mate-pairs link information, heterozygosis part is entered
Row merges.
Fig. 4 is Read error correction figures, and wherein A is to build de Bruijns using the raw sequencing data after simple filtration, so
After carry out correction process, processing mode is mainly that end sequencing projection is deleted, mistake edge contract and blister path merge.B is
Paired ends error correction, its medium and small grey rectangle is sequencing mistake, and PE is original read, and PE* is the read after error correction.C is
Mate-pairs correcting data errors, MP is the read for including sequencing mistake, and MP* is the result after error correction.The wherein rectangular portion of grey
It is divided into the E in sequencing mistake, i.e. figure, will be deleted in error correction procedure.J is the cyclisation site introduced in library construction process.
Fig. 5 is the two ends sequence that Super read build that R1 and R2 in principle overview diagram, wherein A is respectively paired end
Row, are navigated to by way of kmer is retrieved in the de Bruijns after error correction.B is using the path search algorithm in graph theory
The path that dotted line between progress R1 and R2 in the search in path, figure obtains for search.C is according to the kmer letters in searching route
Breath, extracts the path sequence completed, i.e. super read.
Fig. 6 be in local building block principle overview diagram, wherein A " c1, " " c2, " " c3, " " c4, " " c5, " " r1, " " r2, " and
" r3 " is primary contigs, and " c1, " " c2, " " c3, " " c4, " and " c5 " is all single copy, and " r1, " " r2, " and " r3 " are single
Copy, " c2 " and " c4 " is that obtained seed is screened from all single copy primary contigs.The camber line of grey is mate- in B
Link informations of the pairs between different primary contigs." c1 " and " c3 " is the neighbouring primary contigs of " c2 ", " c3 " and
" c5 " is the neighbouring primary contigs of " c4 ".Short grey rectangle is UARs, i.e., do not compare single copy primary contigs's
read.C is locally to build de Bruijns, and carries out error correction.D is based on the de Bruijns after local error correction, to having
The primary contigs of the relation of connection carries out route searching, carries out local assembling.E is to merge all assembling results,
Obtain final genome assembling result.
Fig. 7 is bifidobacterium bifidum (Bifidobacterium bifidum PRL2010) genome synteny figure.
GNOVO assemble result and bifidobacterium bifidum (genome sequence accession number is CP001840.1, and assembling Genome Size is 2,214,
Synteny figure 656bp), wherein longitudinal direction is GNOVO assembling genome, it is laterally reference gene group, black dotted line is gene
Group synteny part.
Fig. 8 is streptomycete (Streptomyces roseosporus NRRL 15998) genome synteny figure.GNOVO
Assembling result, (Streptomyces roseosporus NRRL15998, genome sequence accession number is NZ_ with streptomycete
DS999644.1, assembling Genome Size be 7,817,295bp) synteny figure, wherein longitudinal direction be GNOVO assembling gene
Group, is laterally reference gene group, and dotted line is genome synteny part.
Embodiment
Following examples are used to illustrate the present invention, but are not limited to the scope of the present invention.
The Escherichia coli of embodiment 1 (E.coli) genome is assembled
1) test data introduction
The test data is from NCBI (National Center for Biotechnology Information, i.e. U.S.
National Biotechnology information centre of state) SRA (Short Read Archive) database download and obtain, SRA database nets
Location is www.ncbi.nlm.nih.gov/sra, and the detailed accession number of data is SRX016044.The details of test data are such as
Under:
Upload the date:2009-05-22;
Library size:180bp;
Total amount is sequenced:2.1G;
Depth is sequenced in predicted gene group:456.5x.
2) appraisal procedure
Test and comparison is carried out to 7 composite softwares altogether, the major parameter of each composite software is traveled through, then chosen
The best result of assembling result is compared assessment, and the detailed assembly parameter that each software preferably assembles result is as follows:
GNOVO (the inventive method) assembly parameter is:K1=25, k2=95, m1=5, m2=2, other parameters are silent
Recognize parameter, k1 is builds the kmer sizes of de Bruijns for the first time here, and k2 is carried out just for second based on super read
The kmer sizes of de Bruijns are built during level contig assemblings;When m1 is builds de Bruijns for the first time, low cover is carried out
The parameter of cover degree edge contract correction process, the threshold value for defining low cover degree, when m2 builds de Bruijns for second,
Carry out the parameter of low cover degree edge contract correction process, the threshold value for defining low cover degree;
JR-Assembler assembly parameters are:It is default parameters;
Edena:M=53, other parameters are default parameters;
Taipan:K=50, other parameters are default parameters;
Velvet:K=45, other parameters are default parameters;
ABySS:K=45, other parameters are default parameters;
SOAPdenovo:K=53, other parameters are default parameters;
GNOVO is when being assembled, and the first step builds de Bruijns by high-flux sequence data, and based on error correction
De Bruijns afterwards carry out sequencing data error correction.Kmer sizes (are used using 25 when carrying out building de Bruijns here
Parameter k1 is specified), while carrying out error correction to original de Bruijns, 5 are used when deleting here low cover degree side
The side for being less than 5 for threshold value, i.e. depth will be all deleted.After figure error correction is completed, primitive sequencer reads will be compared de
Error correction is carried out in Bruijn.
De Bruijns main information is shown in Table 1 before and after error correction:
De Bruijn main informations before and after the assembling E.coli genome first step error correction of the GNOVO methods of table 1
Contrasted from the index before and after error correction and can be seen that kmer sums reduction about 10%, but nodes and kmer kinds
Class number but all have dropped 400 times and 80 times respectively, it is seen that sequencing mistake generates substantial amounts of low depth kmer and extra node,
Greatly improved thereby using the complexity of de Bruijns.
GNOVO is when being assembled, and second step compares back read in de Bruijns, using path search algorithm structure
Build super read.In path search process is carried out, the searching for reference of acquiescence is accurate poor for 3, i.e., 3 times.Original read numbers
For 6096923, the read numbers for successfully building Super read are 5843968, and search efficiency is 95.8% (5843968/
6096923)。
GNOVO is when being assembled, and the 3rd step rebuilds de Bruijns using super read, passes through figure error correction
And deconsolidation process, obtain primary contig.Kmer sizes (use parameter k2 using 95 when carrying out building de Bruijns here
Specified), while carrying out error correction to original de Bruijns, 2 are used when deleting here low cover degree side for threshold
The side that value, i.e. depth are less than 2 will be all deleted.After figure error correction is completed, de Bruijns are split at node, obtained
To primary contigs.
De Bruijns main information is as follows before and after error correction:
De Bruijn main informations before and after the step error correction of the GNOVO methods of table 2 assembling E.coli genomes the 3rd
As can be seen from the above table, now figure complexity is very low, only 2324 nodes, that is, what is assembled is complete
Whole property is extraordinary.
The statistical information of first group of obtained contigs is:Contig overall lengths are 4.55Mb, and contigs sums are 169,
Contig N50 length is 60284bp.
GNOVO is when being assembled, and the 4th step transfers specific portion according to pair-end and mate-pair comparison information
Primary contigs and reads, carry out local assembling.In locally assembling, acquiescence minimum support is the primary of 3, i.e., two
Connection number between contigs is only more than or equal to (parameter-cutoff is configured, general it is not recommended that modification) just effective when 3.Office
Kmer sizes in portion's assembling use many kmer, (are entered for 19,57 and 95 by parameter "-k ", "-q " and "-n " under default situations
Row is set).Assembling obtained contigs statistical information is:Contig overall lengths are 4.55Mb, and contigs sums are 161,
Contig N50 length is 63618bp.
GNOVO is when being assembled, and the 5th step builds scaffold connection figures using mate-pair link informations, passes through
Dividing sub-picture algorithm and simulated annealing are ranked up to contigs obtains final scaffolds.In scaffold assemblings
During, same acquiescence minimum support is only more than or equal to just effective when 3 for the connection number between the primary contigs of 3, i.e., two
(parameter-cutoff is configured, general it is not recommended that modification).Here due to no big library, therefore scaffold assembling effects
Do not lifted, it is consistent with the assembling of the 4th step.
3) results contrast
The assembling of each composite software the results are shown in Table 3:
Table 3 respectively assembles software combination E.coli genome results
Contigs numbers:Length below 300bp Contigs without statistics.
Overall length:All contigs overall length.
Maximum contig length:Assemble the most long most long contig of length of most long contig in result length and
Average contig length:The average value of all contig length.
N50:Represent all Contigs according to being ranked up from long to short, then by Contig according to this order
It is added successively, when the length of addition reaches the half of Contig total lengths, last Contig length added is
Contig N50。
Assembly defect contig numbers:The contig numbers in original reference gene group can not be compared.
In the present embodiment, assemble method GNOVO of the invention has obtained 161 contigs, is secondly JR-Assembler
The method of software, obtains 192 contigs, is much better than other composite softwares, and GNOVO N50 length is 63.618K, than
JR-Assembler (48.673K) and Velvet (43.998K) are higher by more than 10K, illustrate GNOVO assembling integrality in the reality
It is much better than other composite softwares in example.The most long contig that GNOVO is obtained be 334.908K, than other software be higher by 100K with
On.The contig numbers of GNOVO mistake assembling are 0, consistent with other most of softwares, it is shown that its high accuracy.In this reality
Apply in example, GNOVO shows larger advantage compared with other composite softwares.
The streptomycete of embodiment 2 (S.roseosporus) genome is assembled
1) test data introduction
The test data is to download to obtain from NCBI SRA databases, and SRA database network address is
Www.ncbi.nlm.nih.gov/sra, the detailed accession number of data is SRX026747 and SRX016085.
A) test data SRX026747 details are as follows:
Upload the date:2010-08-06;
Library size:180bp;
Total amount is sequenced:10.7G;
Depth is sequenced in predicted gene group:1389.6X.
B) test data SRX016085 details are as follows:
Upload the date:2009-09-20;
Library size:4kb;
Total amount is sequenced:3.5G;
Depth is sequenced in predicted gene group:454.5X.
2) appraisal procedure
Here test and comparison is carried out to 5 composite softwares altogether, the major parameter of each composite software traveled through, then
Choose the best result of assembling result and be compared assessment, the detailed assembly parameter that each software preferably assembles result is as follows:
GNOVO assembly parameters are:K1=25, k2=95, m1=11, m2=5, other parameters are that default parameters is (detailed
Assessment details refer to embodiment 1);
JR-Assembler:It is default parameters
ABySS:K=45, other parameters are default parameters;
Velvet:K=49, other parameters are default parameters;
SOAPdenovo:K=63, other parameters are default parameters;
3) results contrast
The assembling of each composite software the results are shown in Table 4:
Table 4
In the present embodiment, GNOVO N50 is highest 13.134K, is secondly Velvet (12.499K);It is most long
Contig length is 73.115K, and more than 10K is higher by than Velvet (61.423K).Contig quantity is 1,242, is more than
Minimum 1,127 of ABySS.In this example, GNOVO is in contig maximum lengths, average length, N50 length than other
Composite software is slightly excellent, a little higher than ABySS only on contig numbers, and preferable assembling ability is illustrated on the whole.
It will be appreciated that the size of GNOVO assembling result is 9.79M, hence it is evident that assemble result more than others.
Therefore, inventor has carried out nt comparings to initial data, and comparison result shows the number containing two bacteriums in initial data
According to, therefore speculate that initial data is a Mixed Microbes.By downloading the reference gene group of correspondence bacterium, i.e. streptomycete from NCBI
(Streptomyces roseosporus NRRL 15998, genome sequence accession number is NZ_DS999644.1, assembles genome
Size is 7,817,295bp) and bifidobacterium bifidum (Bifidobacterium bifidum PRL2010, genome sequence login
Number be CP001840.1, assembling Genome Size be 2,214,656bp), pass through MUMMER carry out full-length genome comparison (similarity
It is required that 99%) find, GNOVO assembling result can be very good to compare onto the two genomes, and comparison result is shown in Fig. 7 and figure
8.Meanwhile, it is a Mixed Microbes that this, which also demonstrates the supposition that inventor starts, i.e. initial data, while also being demonstrated from side
GNOVO is obviously improved contigs length, and with higher assembling accuracy, repetitive sequence is greatly improved by local assembling
Disposal ability.
The Neuraspora crassa of embodiment 3 (N.crassa) genome is assembled
1) test data introduction
The test data is to download to obtain from NCBI SRA databases, and SRA database network address is
Www.ncbi.nlm.nih.gov/sra, the detailed accession number of data is SRX030834.
A) test data SRX030834 details are as follows:
Upload the date:2010-11-11;
Library size:180bp;
Total amount is sequenced:5.5G;
Depth is sequenced in predicted gene group:148.3X.
2) appraisal procedure
Here test and comparison is carried out to 6 composite softwares altogether, the major parameter of each composite software traveled through here,
Then choose the best result of assembling result and be compared assessment, the detailed assembly parameter that each software preferably assembles result is as follows:
GNOVO assembly parameters are:K1=25, k2=95, m1=5, m2=2, other parameters are that default parameters is (detailed
Assessment details refer to embodiment 1);
JR-Assembler:It is default parameters
ABySS:K=35, other parameters are default parameters;
Velvet:K=37, other parameters are default parameters;
SOAPdenovo:K=47, other parameters are default parameters;
Edena:M=45, other parameters are default parameters;
3) results contrast
The assembling of each composite software the results are shown in Table 5:
Table 5
In the present embodiment, GNOVO N50 is 10.473K, has preferably assembling complete compared with other composite softwares (4~6K)
Property;The contig maximum lengths and average length of assembling are superior to other composite softwares.GNOVO contig numbers are 11,300,
More than the 10 of Velvet, 187, positioned at second.In this embodiment, it is better than other on assembling effect overall GNOVO
Composite software.
The Staphylococcus intermedius of embodiment 4 (S.intermedius ATCC 27335) genome is assembled
1) test data introduction
The test data is to download to obtain from NCBI SRA databases, and SRA database network address is
Www.ncbi.nlm.nih.gov/sra, the detailed accession number of data is SRX297066 and SRX297065.
A) test data SRX297066 details are as follows:
Upload the date:2012-11-18;
Library size:180bp;
Total amount is sequenced:1.1G;
Depth is sequenced in predicted gene group:564.10X.
B) test data SRX297065 details are as follows:
Upload the date:2012-11-19;
Library size:5kb;
Total amount is sequenced:1.5G;
Depth is sequenced in predicted gene group:769.23X.
2) appraisal procedure
Test and comparison is carried out to 5 composite softwares altogether, the major parameter of each composite software is traveled through, then chosen
The best result of assembling result is compared assessment, and the detailed assembly parameter that each software preferably assembles result is as follows:
GNOVO assembly parameters are:K1=25, k2=95, m1=11, m2=2, other parameters are that default parameters is (detailed
Assessment details refer to embodiment 1);
Allpaths-lg:It is default parameters
SPAdes:K=61,73,95, other parameters are default parameters;
MaSuRCA:K=85, other parameters are default parameters;
SOAPdenovo:K=77, other parameters are default parameters;
3) results contrast
The assembling of each composite software the results are shown in Table 6:
Table 6
In the present embodiment, GNOVO, which is assembled, has obtained the complete sequence (1 Scaffold) of bacterium, Contig numbers 7, far
Better than other composite softwares (Scaffold of more than 10), it is shown that its great ability in assembling, GNOVO is obviously improved
Contigs length;The disposal ability of repetitive sequence is greatly improved by local assembling.
Although above the present invention is described in detail with a general description of the specific embodiments,
On the basis of the present invention, it can be made some modifications or improvements, this will be apparent to those skilled in the art.Cause
This, these modifications or improvements, belong to the scope of protection of present invention without departing from theon the basis of the spirit of the present invention.
Claims (8)
1. a kind of genome based on high-flux sequence data from the beginning assemble method, comprises the following steps:
(1) de Bruijns are built by high-flux sequence data, and high pass measurement is carried out based on the deBruijn figures after error correction
Sequence correcting data error;
(2) super read are built using path search algorithm;
(3) de Bruijns are rebuild using super read, by the error correction of de Bruijns and deconsolidation process, obtained just
Level contigs;
(4) the primary contigs and reads of specific portion are transferred according to pair-end and mate-pair comparison information, is carried out
Local assembling, all local assembling results are merged together, and carry out deconsolidation process after error correction, so as to obtain contigs;
(5) scaffold connection figures are built using mate-pair link informations, passes through dividing sub-picture algorithm and simulated annealing
Contigs is ranked up and obtains final scaffolds;
De Bruijns described in step (1) are present with the data structure form of Hash, and its developing algorithm is:
1) space distribution and initialization are carried out to Hash table according to Genome Size and kmer sizes;
2) iteration reads every read, and is numbered, and numbers since 0;
3) all kmer are extracted at from 5 ' to 3 ' ends successively, and are stored into Hash table, if kmer has been present, only
Kmer routing information need to be stored, that is, store its forerunner and rear-guard;If kmer is not present, newly-built kmer nodes are needed, together
When also need store routing information;
4) in storage read during first kmer information, if it is not present in Hash table, its true forerunner kmer section is illustrated
Point is not present, it is necessary to which newly-built end sequencing projection node, for replacing true forerunner kmer nodes, is used as the kmer nodes
Backtracking predecessor node;
5) when storing non-first kmer node, if it find that the kmer has been present, and before the backtracking of the kmer nodes
It is that projection node is sequenced in end to drive node, then needs to remove end sequencing projection node, while the kmer nodes are recalled
Predecessor node is set to previous kmer nodes.
2. a kind of genome based on high-flux sequence data from the beginning assemble method, comprises the following steps:
(1) de Bruijns are built by high-flux sequence data, and high pass measurement is carried out based on the deBruijn figures after error correction
Sequence correcting data error;
(2) super read are built using path search algorithm;
(3) de Bruijns are rebuild using super read, by the error correction of de Bruijns and deconsolidation process, obtained just
Level contigs;
(4) the primary contigs and reads of specific portion are transferred according to pair-end and mate-pair comparison information, is carried out
Local assembling;
(5) scaffold connection figures are built using mate-pair link informations, passes through dividing sub-picture algorithm and simulated annealing
Contigs is ranked up and obtains final scaffolds;
The local number of assembling steps of step (4) is:
1) primary contigs and reads are compared, by reads comparison result, obtain between primary contigs away from
From the relation of information, and reads and primary contigs, primary contigs and reads information is read in into internal memory;
2) copy number is filtered out>The shorter primary contigs of 2 multicopy or length, according to the distance between primary contigs
Relation, builds scaffold connection figures, and selects primary contigs apart from each other and longer as seed wherein, obtains
The a range of primary contigs near seed is selected after seed;
3) to each local primary contigs, sequencing piece of the one end on primary contigs is only had according to comparison result selection
Section, also selects while will be in indentation, there and super read of the fragment coverage more than 0.9 is sequenced;
4) the local assembling of de Bruijns progress is locally being built;
5) the local assembling result in each Local map is merged, obtains the assembling result of the overall situation, then carry out simplifying with
Figure correction process, so as to obtain final contigs.
3. a kind of genome based on high-flux sequence data from the beginning assemble method, comprises the following steps:
(1) de Bruijns are built by high-flux sequence data, and high pass measurement is carried out based on the deBruijn figures after error correction
Sequence correcting data error;
(2) super read are built using path search algorithm;
(3) de Bruijns are rebuild using super read, by the error correction of de Bruijns and deconsolidation process, obtained just
Level contigs;
(4) the primary contigs and reads of specific portion are transferred according to pair-end and mate-pair comparison information, is carried out
Local assembling, all local assembling results are merged together, and carry out deconsolidation process after error correction, so as to obtain contigs;
(5) scaffold connection figures are built using mate-pair link informations, passes through dividing sub-picture algorithm and simulated annealing
Contigs is ranked up and obtains final scaffolds;By dividing sub-picture algorithm and simulated annealing to contigs
When being ranked up, directly sorted and sorted with simulated annealing according to the selection of different sizes of subgraph, subgraph≤8, which are used, directly sorts,
> 8 is then sorted using simulated annealing;
The dividing sub-picture algorithm comprises the following steps:
1) each contig is traveled through successively;
2) each contig length and side connection is detected;
3) delete being defined as border contig, that is, carry out figure fractionation;
The direct sequence comprises the following steps:
1) exhaustion may be carried out to all sequences;
2) choose the sequence of conflict side at least and be used as optimal sequencing;
Simulated annealing is ranked up to contigs to be comprised the following steps:
1) random one contigs sequence of generation;
2) probability of acceptance is calculated according to Current Temperatures and the random new sort for changing generation;
3) random chance is generated, using new contigs sequences as currently if the random chance is less than the probability of acceptance
Sequence;
4) cooled with particular step size, Simultaneous Iteration step 2 and 3;
5) sequence of conflict side at least is chosen from all current sequences and is used as optimal sequencing.
4. the method according to claim any one of 1-3, it is characterised in that de Bruijns are built in step (1)
Kmer length is floated according to coverage information, and length is between 17-37.
5. the method according to claim any one of 1-3, it is characterised in that the path search algorithm of step (2) is depth
Preferred path searching algorithm.
6. the method according to claim any one of 1-3, it is characterised in that step (3) is built using larger kmer
De Bruijns, kmer sizes are 75-155.
7. the method according to claim any one of 1-2, it is characterised in that step (5) passes through dividing sub-picture algorithm and mould
When plan annealing algorithm is ranked up to contigs, is directly sorted and sorted with simulated annealing according to the selection of different sizes of subgraph, son
Figure≤8 is using directly sorting, and > 8 is then sorted using simulated annealing.
8. the method according to claim any one of 1-3, it is characterised in that high-flux sequence correcting data error, local assembling
During route searching mode be depth-priority-searching method, need opposite side to carry out simultaneously for the route searching mode in figure error correction
Weighting.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410421844.3A CN104239750B (en) | 2014-08-25 | 2014-08-25 | Genome based on high-flux sequence data from the beginning assemble method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410421844.3A CN104239750B (en) | 2014-08-25 | 2014-08-25 | Genome based on high-flux sequence data from the beginning assemble method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104239750A CN104239750A (en) | 2014-12-24 |
CN104239750B true CN104239750B (en) | 2017-07-28 |
Family
ID=52227798
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410421844.3A Active CN104239750B (en) | 2014-08-25 | 2014-08-25 | Genome based on high-flux sequence data from the beginning assemble method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104239750B (en) |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106033502B (en) * | 2015-03-20 | 2018-03-30 | 深圳华大基因股份有限公司 | The method and apparatus for identifying virus |
CN105447336B (en) * | 2015-12-29 | 2018-06-19 | 北京百迈客生物科技有限公司 | Analysis of Microbial Diversity system based on biological cloud platform |
CN105787295B (en) * | 2016-03-17 | 2018-03-06 | 中南大学 | Contig incorrect link area recognizing methods based on both-end reading insert size distributions |
CN109817280B (en) * | 2016-04-06 | 2023-04-14 | 晶能生物技术(上海)有限公司 | Sequencing data assembling method |
WO2018000174A1 (en) * | 2016-06-28 | 2018-01-04 | 深圳大学 | Rapid and parallelstorage-oriented dna sequence matching method and system thereof |
US20180060484A1 (en) * | 2016-08-23 | 2018-03-01 | Pacific Biosciences Of California, Inc. | Extending assembly contigs by analyzing local assembly sub-graph topology and connections |
CN108866173A (en) * | 2017-05-16 | 2018-11-23 | 深圳华大基因科技服务有限公司 | A kind of verification method of standard sequence, device and its application |
CN107590362B (en) * | 2017-08-21 | 2019-12-06 | 武汉菲沙基因信息有限公司 | Method for judging whether overlapping assembly is correct or incorrect based on long read sequence sequencing |
CN110317856B (en) * | 2018-03-28 | 2023-08-11 | 中国科学院分子植物科学卓越创新中心 | Low cost assembly of resolved bio-core genome information based on apparent group information |
CN108753765B (en) * | 2018-06-08 | 2020-12-08 | 中国科学院遗传与发育生物学研究所 | Genome assembly method for constructing ultra-long continuous DNA sequence |
CN109326323B (en) * | 2018-09-13 | 2022-03-18 | 北京百迈客生物科技有限公司 | Genome assembly method and device |
CN110016498B (en) * | 2019-04-24 | 2020-05-08 | 北京诺赛基因组研究中心有限公司 | Method for determining single nucleotide polymorphism in Sanger method sequencing |
CN112430678A (en) * | 2019-08-26 | 2021-03-02 | 江苏省农业科学院 | InDel molecular marker combination for identifying cotton varieties and development method and application thereof |
CN111370064B (en) * | 2020-03-19 | 2023-05-05 | 山东大学 | Rapid classification method and system for gene sequences of SIMD (Single instruction multiple data) -based hash function |
CN113035277A (en) * | 2021-03-12 | 2021-06-25 | 南开大学 | Automatic analysis method and system for fungal genome sequencing data |
CN114694755B (en) * | 2022-03-28 | 2023-01-24 | 中山大学 | Genome assembly method, apparatus, device and storage medium |
CN114724632B (en) * | 2022-04-21 | 2023-03-21 | 内江师范学院 | Method and device for evaluating genome assembly integrity |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101504697A (en) * | 2008-12-12 | 2009-08-12 | 深圳华大基因研究院 | Construction method and system for genome sequencing equipment and its fragment connection stand |
CN102831330A (en) * | 2011-11-30 | 2012-12-19 | 北京诺禾致源生物信息科技有限公司 | Method and device for processing sequencing data |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100063742A1 (en) * | 2008-09-10 | 2010-03-11 | Hart Christopher E | Multi-scale short read assembly |
-
2014
- 2014-08-25 CN CN201410421844.3A patent/CN104239750B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101504697A (en) * | 2008-12-12 | 2009-08-12 | 深圳华大基因研究院 | Construction method and system for genome sequencing equipment and its fragment connection stand |
CN102831330A (en) * | 2011-11-30 | 2012-12-19 | 北京诺禾致源生物信息科技有限公司 | Method and device for processing sequencing data |
Non-Patent Citations (2)
Title |
---|
基于De Bruijin图的De Novo序列组装软件性能分析;孟金涛等;《科研信息化技术与应用》;20131231;第4卷(第5期);第58-69页 * |
基于第二代测序的转录组组装软件比较研究;卢戌;《中国博士学位论文全文数据库 基础科学辑》;20140515(第5期);第29-33页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104239750A (en) | 2014-12-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104239750B (en) | Genome based on high-flux sequence data from the beginning assemble method | |
Jansen et al. | Constructing dense genetic linkage maps | |
Parsons et al. | Genetic algorithms, operators, and DNA fragment assembly | |
Prasanna et al. | Model choice, missing data, and taxon sampling impact phylogenomic inference of deep Basidiomycota relationships | |
CN108121897A (en) | A kind of genome mutation detection method and detection device | |
Meksangsouy et al. | DNA fragment assembly using an ant colony system algorithm | |
CN111340303B (en) | Travel business route planning method based on novel hybrid frog-leaping algorithm | |
CN106682448A (en) | Sequential test optimization method based on multi-objective genetic programming algorithm | |
He et al. | De novo assembly methods for next generation sequencing data | |
CN107229842A (en) | A kind of three generations's sequencing sequence bearing calibration based on Local map | |
Poladian et al. | Multi-objective evolutionary algorithms and phylogenetic inference with multiple data sets | |
CN103761298B (en) | Distributed-architecture-based entity matching method | |
CN112183001B (en) | Hypergraph-based multistage clustering method for integrated circuits | |
CN103036796B (en) | Route information update method and device | |
CA2400890A1 (en) | Method and system for the assembly of a whole genome using a shot-gun data set | |
Huebler et al. | Constructing semi-directed level-1 phylogenetic networks from quarnets | |
CN114694755B (en) | Genome assembly method, apparatus, device and storage medium | |
CN108491687B (en) | Scafffolding method based on contig quality evaluation classification and graph optimization | |
Bertrand et al. | Reconstruction of ancestral genome subject to whole genome duplication, speciation, rearrangement and loss | |
CN108753765A (en) | A kind of genome assemble method of structure overlength continuous DNA sequence | |
CN115148290A (en) | Hole filling method based on third-generation sequencing data | |
CN114490799A (en) | Method and device for mining frequent subgraphs of single graph | |
Minh et al. | Budgeted phylogenetic diversity on circular split systems | |
Warnke-Sommer et al. | Parallel NGS assembly using distributed assembly graphs enriched with biological knowledge | |
CN115602246B (en) | Sequence alignment method based on group genome |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |