CN103714263B

CN103714263B - The wrong two-way side identification of two-way multistep De Bruijns and minimizing technology

Info

Publication number: CN103714263B
Application number: CN201310672170.XA
Authority: CN
Inventors: 孟金涛; 张慧琳; 彭丰斌; 魏彦杰; 冯圣中
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2013-12-10
Filing date: 2013-12-10
Publication date: 2017-06-13
Anticipated expiration: 2033-12-10
Also published as: CN103714263A

Abstract

The present invention discloses the identification of wrong two-way side and the minimizing technology of a kind of two-way multistep De Bruijns, including step, S1, reading sequencing data source file, and constructs two-way multistep De Bruijns；The weighted average coverage W on all two-way side of the whole two-way multistep De Bruijns of S2, statistics；All sides of S3, the whole two-way multistep De Bruijns of traversal, 0.25 times by coverage less than mean coverage W two-way when being defined as wrong two-way, and remove it.The present invention can partly remove the mistake of generally existing in De Bruijns, and above-mentioned mistake includes false links, Tip types mistake, alveolitoid mistake；Contigs will be allow to continue extension, De Bruijns continue to shrink；The two-way side of mistake can be effectively found, and avoids the destruction to correct two-way side, such that it is able to improve the length of contigs to a certain extent, improve the quality of contig.

Description

The wrong two-way side identification of two-way multistep De Bruijns and minimizing technology

Technical field

The present invention relates to gene sequencing field, and in particular to the identification of the wrong two-way side of two-way multistep De Bruijns with Minimizing technology.

Background technology

Gene sequencing with algorithm and Mathematical Modeling as core, including：The storage of gene data and acquisition, sequence ratio To, sequencing with splicing, predictive genes, biological evolution and Phylogenetic Analysis, protein structure prediction, RNA structure predictions, molecule Design and drug design, metabolism network analysis, genetic chip, DNA calculating etc..Biotechnology and Computerized Information Processing Tech Combine closely, accelerate the speed for the treatment of biological data so that biology is made within the time short to the greatest extent accurately being annotated Release, accelerate the development of bioinformatics.

Gene sequencing is that magnanimity gene sequence data is analyzed, so as to extract and dig know according to new biological information Know.It is related to machine learning in computer technology, pattern-recognition, books analysis and excavation, Combinational Mathematics, stochastic model, word Symbol string, pattern algorithm, Distributed Calculation, high-performance calculation, parallel computation etc..

Gene is the most basic genetic code of the mankind, represents everyone life-information.Something lost is there is in gene order The nuance of open position point, the polymorphism of these genetic codes has quite close with the health of the mankind, pathogenesis, therapeutic treatment The relation cut.

Since being come out from Sanger sequencing technologies in 1977, by the development of more than 30 years, DNA sequencing technology development is prominent to fly Push ahead vigorously, the second generation sequencing technologies with the characteristics of high flux, short sequence gradually dominate the market, with the characteristics of single-molecule sequencing Three generations's sequencing technologies also engender, occupy different advantages in sequencing feature respectively.The number of traditional gene order surveying method It is more perfect at present according to extracting with analysis software by the research and development over nearly 10 years.But, the hair of sequencing technologies Exhibition, brings the change of sequencing data so that the data processing software that there is currently can not meet current biological medical research Demand.

High-flux sequence method of new generation can in a short time complete the survey of whole gene group data in the application of technology It is fixed.The analysis and processing method of the gene data also simultaneously to obtaining with rapid changepl. never-ending changes and improvements of high-flux sequence method proposes challenge.Mesh Before, the wide bioinformatics platform of the mass data processing of high throughput sequencing technologies can be met in the urgent need to exploitation.In face of personal base Because of group plan and the personalized medicine prospect in future, the sequencing technologies of high efficiency, low cost turn into inevitable trend.Meanwhile, simplify high The complete sequencing solution such as one-stop complete bioinformatic data analysis platform of effect, be also it is particularly important can not or Scarce developing direction.

Although but the high-flux sequence method sequencing throughput of a new generation is high, can but introduce sequencing error, while surveying Sequence sample is in itself due to gene mutation, SNP, the uneven two-way multistep De Bruijns that will be constructed when genome is assembled of sequencing It is middle introduce mistake summit, and mistake two-way side.And the two-way side of these wrong summits and mistake is to whole De Bifurcated is readily incorporated in Bruijn, and hinders graph embedding process.

The assembling of the short genetic fragment that the high-flux sequence method of a new generation is produced causes substantial amounts of sequencing mistake, increases The amount of calculation of packing algorithm.Substantial amounts of sequencing mistake so that assembly defect rate increases, and has had a strong impact on assembling result.

Current packing algorithm strategy is divided into two classes, and one is the algorithm based on Overlap-Layout-Consensus (OLC), Another is the algorithm based on DeBruijn figures.Wherein, based on OLC packing algorithms exploitation software, such as SSAKE, VCAKE, SHARCGS etc., more takes advantage in gene length sequence assembling, but is not fully applied to the short sequence assembling of a new generation.With OLC packing algorithms are different, and DeBruijn algorithms no longer organize data in units of read, but enter line number in units of k-mers According to assembling, its advantage mainly has the following aspects：First, sequence assembling is carried out in units of k-mers, node is not influenceed Quality, reduces amount of redundant data；Secondly, repeat region only occurs once in figure, is easy to identification, can avoid the group of mistake Dress, reduces error rate；Finally, the strategy that there will be overlapping region to be mapped on same arc is taken, so as to simplify searching route. At present, many short sequence assembling algorithms all use this framework, such as Velvet, IDBA, SOAPdenovo, ABySS.

Velvet effectively utilizes De Bruijns, realizes efficient short sequence assembling.Velvet is with k-mer as base Our unit builds De Bruijns, and using the structure of figure, with reference to corresponding sequence signature, the construction of simplification figure eventually finds one Bar optimal path completes assembling process.In three kinds of structures that Velvet produces the data that focus concentrates on mistake, i.e. tip, Bubble and erroneous connection.According to length principle and minority principle, by length going less than 2k Remove；Merge bubble using the depth-first search strategy in Tour Bus algorithms, finally eliminated using coverage threshold method erroneous connection.The method also takes full advantage of paired-end both end informations, further solves repeat and asks Topic, optimizes assembling effect.Velvet makes full use of the structural property of figure, simplifies data redundancy, speed relatively before algorithm There is very big improvement.Although it does not carry out error correction in pretreatment stage to sequence, its prevention mechanism to mistake, very The defect that compensate for this respect in big degree.This causes that it is preferably applied in the assembling of large-scale genome sequence.

IDBA is also based on De Bruijns, realizes easily and efficiently short sequence assembling.IDBA is basic with k-mer Unit, using a k codomain (Kmin-Kmax) for change, the length of k-mers is obtained instead of using fixed k values.Due to Genome is filled with k-mers for unit, it will usually form many overlapped elements so that assembling is faced with errors present assembling, top Point missing and the low problem of coverage.The size of correct selection k values turns into a key factor of assembling.Some mistakes The generation of reads, also causes to generate substantial amounts of branching.K values are smaller, and branching problems are more serious, and k values are bigger, The reapt regions for then occurring then tail off, and directly affects the quality of assembling.IDBA is assembled using unfixed k values, very well The quality for solving the problems, such as branching, improve assembling.Other IDBA is made by deleting the wrong k-mers of low coverage rate The memory usage for obtaining IDBA is substantially reduced, while also improving the processing speed of IDBA.

SOAPdenovo is capable of the assembling for completing hundreds of millions of reads of high-effect high-quality.SOAPdenovo is inherited The advantage of OLC algorithms and De Bruijn algorithms so that its assembling quality is greatly improved.SOAP is by preset k-mer threshold values Method, takes filtering, the mode of error correction reduces the generation of faulty sequence.Meanwhile, use for reference the method success of Velvet softwares Bubble is processed so that its mean coverage increases.In addition, SOAPdenovo make use of the both end information to carry out overlay region Domain matches, and merges read generation contig fragments, graph structure of the generation based on contig, so that, SOAPdenovo is significantly simple The complexity of contig figures is changed.

ABySS introduces the thought of parallel computation, builds a Linux system in cluster, and a distributed De is set up on cluster Bruijn structure, data distribution formula is stored on each node.Its use MPI communication mechanisms complete node between it is mutual Communication.From figure, correction process to fixed point fusion below is built, the reproduction of whole gene group sequence is finally completed, it is in operation Time and memory consumption aspect occupy very big advantage, and its error rate is extremely low, single in aspect of performance particularly cluster Machine internal memory is had greatly improved using upper, is increasingly widely applied.

But the above error correction goes the strategy of mistake to be all based on the homogeneous hypotheses of abundance, but actual sequencing middle part Point summit is erroneously interpreted as wrong summit because sequencing abundance ratio is relatively low, and those are in the SNP or prominent in repetitive sequence Change but due to going abundance too high, and may be considered as correct summit.Strategy more than in addition is both for De Bruijns Summit classified and gone mistake, and be directed to De Bruijns, be based particularly on the side of two-way multistep De Bruijns Mistake is classified and goes the work of mistake not relate to but.

The content of the invention

Present invention aim at solving the problems, such as prior art, there is provided a kind of mistake of two-way multistep De Bruijns Two-way side recognizes and minimizing technology by mistake.

Technical scheme includes a kind of wrong two-way side identification of two-way multistep De Bruijns and removal side Method, including step,

S1, reading sequencing data source file, and construct two-way multistep De Bruijns；

The weighted average coverage W on all two-way side of the whole two-way multistep De Bruijns of S2, statistics；

All sides of the whole two-way multistep De Bruijns of S3, traversal, are less than mean coverage W's by coverage 0.25 times two-way when being defined as wrong two-way, and remove it.

Preferably, the De Bruijns constitution step is,

S11, one sequence s of reading；

S12, sequence s sliding windows are cut into multiple fragment t, choose fragment t its conventional number and be cur and mark The conventional number of its forward and backward fragment is respectively pre, lat；

If the coding of S13, t is encoded less than its complementary fragment, pre, the value of lat are exchanged；

S14, the side for pointing to pre to represent in the corresponding bit positions 1 of the forward position mapping table of cur；

S15, the side for pointing to lat to represent in the corresponding bit positions 1 of the reverse position mapping table of cur；

S16, repeat step S12-S15, other fragments t for the treatment of sequence s, until completing whole fragment t of sequence s, hold Row step S17；

S17, reading one new sequence s, repeat step S12-S16；Until having processed all of sequence, step is performed S18；

S18, the construction for completing two-way multistep de Bruijns.

Preferably, the step S2 includes,

S21, the initialization weighted average coverage W are 0, and the coverage summation Sum on all sides is 0, the length on all sides Degree Len is 0；

The whole two-way multistep De Bruijns of S22, traversal, access each summit V in figure, to each of the summit V Bar side performs step S23；

S23, assume treatment summit V i-th side, take i-th while coverage V.multiplicity [i] be multiplied by while Weighted value V.arcs [i] .length (), and be accumulated in coverage summation Sum；Simultaneously by the length value V.arcs of each edge [i] .length () is accumulated in the length Len on side, until all sides of summit V are all accumulated；

S24, the weighted average coverage W of the whole two-way multistep De Bruijns of calculating are：

W=Sum/Len。

Preferably, the step S3 includes,

The whole two-way multistep De Bruijns of S31, traversal, access each summit V in figure；To the summit V's Step S32 is performed per a line；

S32, i-th side for assuming treatment summit V, take i-th coverage V.multiplicity [i] on side, obtain i-th Weighting coverage wi=V.multiplicity [i] on bar side；

If S33, wi<W*0.25, then delete i-th side, i.e. coverage V.multiplicity [i]=0, the weight on side Value V.arcs [i]=string (" ")；

If S34, i<8, then continue step S32, otherwise into step S35；

S35, judge whether all of side is processed, be to terminate, otherwise return to step S31 is continued with.

Beneficial effects of the present invention include：The mistake of generally existing in De Bruijns, above-mentioned mistake can partly be removed Including false links, Tip types mistake, alveolitoid mistake；Contigs will be allow to continue extension, De Bruijns continue to shrink； The two-way side of mistake can be effectively found, and avoids the destruction to correct two-way side, such that it is able to carry to a certain extent The length of contigs high, improves the quality of contig.

Brief description of the drawings

Fig. 1 is the identification process figure on the wrong two-way side of one embodiment of the invention.

Fig. 2 is the removal flow chart on the wrong two-way side of one embodiment of the invention.

Specific embodiment

The present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.

The embodiment of the present invention is solved in high flux gene sequencing data, the mistake produced due to the inaccurate of sequencing instrument Sequencing sequence.And the mistake that the two-way multistep De Bruijns that above-mentioned mistake will be constructed to follow-up sequence assembling algorithm are introduced Two-way side is missed, and the two-way side of above-mentioned mistake can produce in target figure：(1) false links, (2) Tip type mistakes, (3) alveolitoid Mistake, above-mentioned mistake and source genome sequence repetitive sequence in itself, gene mutation point position etc. are stirred together, will make Contigs cannot extend, and De Bruijns cannot continue to shrink, and follow-up gene sequencing cannot be carried out effectively.

The identification of wrong two-way side and the minimizing technology of a kind of two-way multistep De Bruijns, including step are provided,

All sides of the whole two-way multistep De Bruijns of S3, traversal, are less than mean coverage W's by coverage 0.25 times two-way when being defined as wrong two-way, and remove it.System multiple can also be other feasible coefficient values, no It is limited to 0.25.

The embodiment of the present invention can partly remove the mistake of generally existing in De Bruijns, and above-mentioned mistake includes mistake Link, Tip types mistake, alveolitoid mistake；Contigs will be allow to continue extension, De Bruijns continue to shrink；Can be effective Discovery mistake two-way side, and the destruction to correct two-way side is avoided, such that it is able to improve contigs to a certain extent Length, improve contig quality.

Preferably, the De Bruijns constitution step is,

S11, one sequence s of reading；

S18, the construction for completing two-way multistep de Bruijns.

As shown in figure 1, the step S2 includes,

W=Sum/Len。

As shown in Fig. 2 the step S3 includes,

If S34, i<8, then continue step S32, otherwise into step S35；

The specific embodiment of present invention described above, is not intended to limit the scope of the present invention..Any basis Various other corresponding change and deformation done by technology design of the invention, should be included in the guarantor of the claims in the present invention In the range of shield.

Claims

1. a kind of wrong two-way side of two-way multistep De Bruijns recognizes and minimizing technology, it is characterised in that including step：

The weighted average coverage W on all two-way side of the whole two-way multistep De Bruijns of S2, statistics；The step S2 includes：

S21, the initialization weighted average coverage W are 0, and the coverage summation Sum on all sides is 0, the length Len on all sides It is 0；

The whole two-way multistep De Bruijns of S22, traversal, access each summit V in figure, to every a line of the summit V Perform step S23；

S23, assume treatment summit V i-th side, take i-th while coverage V.multiplicity [i] be multiplied by while weight Value V.arcs [i] .length (), and be accumulated in coverage summation Sum；Simultaneously by the length value V.arcs [i] of each edge .length () is accumulated in the length Len on side, until all sides of summit V are all accumulated；

W=Sum/Len；

All sides of the whole two-way multistep De Bruijns of S3, traversal, by coverage less than the 0.25 of mean coverage W Again two-way when being defined as wrong two-way, and remove it.

2. wrong two-way side as claimed in claim 1 recognizes and minimizing technology, it is characterised in that the De Bruijns structure Making step is：

S11, one sequence s of reading；

S12, sequence s sliding windows are cut into multiple fragment t, choose fragment t its conventional number be cur and before marking it, The conventional number of fragment afterwards is respectively pre, lat；

S16, repeat step S12-S15, other fragments t for the treatment of sequence s, until completing whole fragment t of sequence s, perform step Rapid S17；

S17, reading one new sequence s, repeat step S12-S16；Until having processed all of sequence, step S18 is performed；

S18, the construction for completing two-way multistep de Bruijns.

3. wrong two-way side as claimed in claim 1 recognizes and minimizing technology, it is characterised in that the step S3 includes：

The whole two-way multistep De Bruijns of S31, traversal, access each summit V in figure；To each of the summit V Bar side performs step S32；

S32, i-th side for assuming treatment summit V, take i-th coverage V.multiplicity [i] on side, obtain i-th side Weighting coverage wi=V.multiplicity [i]；

If S33, wi<W*0.25, then delete i-th side, i.e. coverage V.multiplicity [i]=0, the weighted value on side V.arcs [i]=string (" ")；

If S34, i<8, then continue step S32, otherwise into step S35；