CN104951672B

CN104951672B - Joining method and system associated with a kind of second generation, three generations's gene order-checking data

Info

Publication number: CN104951672B
Application number: CN201510346970.1A
Authority: CN
Inventors: 卜东波; 张仁玉; 陈挺; 李帅成; 孙世伟; 刘兴武; 许情; 郑全刚; 王超
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2015-06-19
Filing date: 2015-06-19
Publication date: 2017-08-29
Anticipated expiration: 2035-06-19
Also published as: CN104951672A

Abstract

The present invention relates to biology information technology and calculation biology field, joining method and system associated with more particularly to a kind of second generation, three generations's gene order-checking data, this method includes obtaining second generation gene order-checking data, pass through the quality information of part base sequence reads in the second generation gene order-checking data, the second generation gene order-checking data are pre-processed, de Brui jn figures are built；Sequencing error handle is carried out to de Brui jn figures, new de Brui jn figures is generated, the new de Brui jn figures is compressed, generation compression de Brui jn figures obtain the sequence tuple that side is compressed in the compression de Brui jn figures；Obtain third generation gene order-checking data, by on the unimolecule figure gapped fragments of the third generation gene order-checking data money order receipt to be signed and returned to the sender to the second generation gene order-checking data, compression de Brui jn figures are disassembled by optimal arrangement, and the space between optimal arrangement is filled, to complete the splicing of gene order-checking data.

Description

Joining method and system associated with a kind of second generation, three generations's gene order-checking data

Technical field

The present invention relates to biology information technology and calculation biology field, more particularly to a kind of second generation, three generations's genome Joining method associated with sequencing data and system.(herein without adding feature of the present invention, therefore deleting)

Background technology

Genome is all hereditary information in DNA (being RNA for fractionated viral) in organism.DNA be by The complementary double-strand of tetra- kinds of base compositions of A, C, T, G, according to " central dogma " of biology, turn of DNA base sequence guide RNA Record, and the further translation building-up process of protein, therefore, understanding DNA base sequence is the weight for recognizing biological law It is basic, by sequencing technologies acquisition DNA number of base sequence (reads), the complete genome sequence for being spliced into, from And be used for further analysis and study.

DNA sequencing technology mainly experienced the development of three phases, be first generation sequencing technologies, second generation sequencing skill respectively Art and third generation sequencing technologies, first generation sequencing technologies are that Sanger reacts sequencing in the dideoxy chain-termination of invention in 1977 Method, exactly using the Sanger PCR sequencing PCRs after improving, researcher completes the Human Genome Project (Human Genome Project, HGP, 1995~2003) almost all of sequencing；At the beginning of second generation sequencing technologies are born in 21 century, representing instrument is 454th, the new-generation sequencing instrument (i.e. second generation sequenator) that Illumina and ABI companies release one after another, these sequenators can be same Shi Binghang carries out substantial amounts of sequencing reaction, so as to significantly reduce sequencing time and cost, compared with traditional sequencing methods, second Significant advantage for sequencing technologies is that sequencing throughput is high, such as SOLiD3 sequenators single operation can obtain 20GB sequencing numbers According to it has the disadvantage：The DNA of generation read it is long it is shorter than Sanger PCR sequencing PCR a lot, such as the reading length that Sanger sequencings are produced can be with 900bp is reached, and a length of 250-400bp of reading of 454 sequenators, Solexa read a length of 50-75bp, short sequence length makes splicing Algorithm is difficult to solve repetitive sequence region, causes splicing fragmentation occur, in addition, the error rate of second generation sequencing technologies is also more It is high；Third generation sequencing technologies are started from 2008, are characterized in being sequenced using " single-molecule sequencing " strategy, mainly have The HeliScope single-molecule sequencings technology of BioScience companies, the unimolecule of Pacific Biosciences companies are surveyed in real time Sequence technology and the nano-pore nanometer pore single-molecule sequencing technologies of Oxford Nanopore Technology Ltd companies, unimolecule Sequencing technologies are noteworthy characterized by no longer to be expanded to sample, and ensure that sequencing data (i.e. reads) exists to the full extent Uniform fold on genome, the reads that single-molecule sequencing is produced is up to 3kb~20kb, and its potential advantage is to solve The splicing of certainly long repetitive sequence, has the disadvantage that reads error rates are higher (about 5%~15%).

Either first generation Sanger PCR sequencing PCRs, or the second generation, third generation PCR sequencing PCR, all " reading " can only go out DNA every time In a short fragment, can not once run in just genome completely be read from the beginning to the end, accordingly, it would be desirable to will be short Fragment is assembled into complete genome, and this process is referred to as " from the beginning sequence assembly " (De Novo assembly).

Common three generations's sequencing data splicing strategy has：

The mixing splicing strategy of AHA splicing softwares：Three codes or datas are joined first and are fitted on the overlapping of two codes or datas splicing generation On group (contigs), then produce scaffolds as connection using these three generations reads and scheme, with reference to from Illumina, The sequence datas of Roche 454 and PacBio sequences, carry out scaffolding, overlap-layout-consensus and mistake Processing, finally produces complete genome, and it has the disadvantage three codes or datas connection being fitted in complete genome group relatively correctly, and joins and match somebody with somebody Accuracy has declined on to relatively short contigs.

The mixing splicing strategy of SSPACE-LongRead splicing softwares：What continuous iteratively assembly had been produced Contigs, but scaffolding is carried out using a kind of fast and reliable mode, similar with AHA, it has the disadvantage three codes or datas Connection is fitted in complete genome group relatively correctly, and connection is fitted on accuracy on relatively short contigs and declined.

The mixing splicing strategy of PBcR splicing softwares：Using the potentiality of its de novo sequence assemblies, a kind of scheme is to use The sequence of short high-accuracy corrects long monomolecular sequence, such as PBcR (PacBio corrected Reads) conduct A part for Celera splicers, by replying to the topic on short reads to the reads of single length and to produce high-accuracy unanimously short Reads wipe out and correct the read of single length, the reads of the mixing after correction individually carries out de novo splicing, or Person is spliced with other data mixings, and it has the disadvantage to need to carry out error correction using substantial amounts of computing resource.

HGAP (Hierarchical Genome-assembly Process) splices the splicing strategy of software：Use one Long insert the distance air gun DNA library simultaneously combines unimolecule (SMRT) DNA sequencing technology in real time, micro- to carry out high-quality de novo Organism genomic sequence splices, and HGAP uses most long reads as the every other reads of seed collection, and by based on The structure uniformity process of directed acyclic graph carrys out pre-splicing reads, is then spliced using ready-made long reads splicers, Tactful different from mixing splicing, HGAP does not need the reads of high-accuracy to carry out error correction.It has the disadvantage to obtain high-quality Splicing result is, it is necessary to which very high sequencing depth, which adds sequencing cost.

Error correction is carried out to third generation sequencing data using the second codes or data, because the amount of two kinds of data is all very big, can be consumed Very big computing resource, the contigs with the formation of the second codes or data is iteratively disassembled with PacBio data, still there is long repetition Sequence is adulterated wherein, it is difficult to be disassembled.

On the other hand, directly spliced with three codes or datas, it is necessary to which consuming the substantial amounts of time is used for from error correction；It is simultaneously guarantor The good splicing effect of card, it is necessary to use sufficiently high sequencing depth, this just significantly increases the cost of experiment.

It has been generally acknowledged that in the case where sequencing depth is not very high, CLR (long continuous reads) cannot be used for high-quality spelling Connect, Chin et al. propose a kind of new non-associated form HGAP, only complete bacterial genomes sequence assembly with CLR, although Sequencing depth needs to reach 50 × error correction is carried out, higher sequencing depth is used for across repetitive sequence region, in addition it is also necessary to by hand Intervene and carry out error correction, consider from sequencing cost angle, this needs the splicing of relatively higher cost completion single-gene group, particularly true Core is biological.

At present, a kind of joint connecting method attempts to carry out error correction to CLR, in principle, with PacBio CCS or short This is feasible to NGS (or mixing both), and some have improved the method for splicing length using two codes or datas and three codes or datas It is suggested, these methods further add the strategy of mixing splicing, such as Celera, MIRA and ALLPATHS-LG, although achieving Good result, longer reads (reads are needed using two codes or data error correction>75bp) and higher sequencing depth, also have compared with Many computing resources, PacBioToCA error correction flow equally supports non-mixed PacBio to splice.

In scaffolding, AHA strategies are the most frequently used strategies, in this strategy, and CLR is only used as to splicing two The contig that codes or data is produced carries out scaffolding, and it generally produces incomplete splicing result, and is not suitable for big rule The genome of mould, recently, Cerulean are issued out as a new mixing splicing tool, and it is produced using ABySS Contig figures information and CLR without error correction produce scaffolds, although generating good result, Cerulean needs The contigs that ABySS is produced, others splicing software is there may be preferably splicing result, and finally, some are used for PacBio Gap software development in reads fillings scaffolds comes out, and has PBJelly in these softwares.Due to second generation sequencing data The limitation of length and third generation sequencing data error rate, intactly splices prokaryotes and eucaryote is still relatively difficult.

The content of the invention

In view of the shortcomings of the prior art, the present invention proposes a kind of second generation, spliced associated with three generations's gene order-checking data Method and system.

Joining method associated with a kind of second generation of present invention proposition, three generations's gene order-checking data, including：

Step 1, second generation gene order-checking data are obtained, pass through number of base in the second generation gene order-checking data The second generation gene order-checking data are pre-processed by sequence reads quality information, build de Bruijns；

Step 2, sequencing error handle is carried out to the de Bruijns, generates new de Bruijns, to described new De Bruijns be compressed, generation compression de Bruijns obtain the sequence on compression side in the compression de Bruijns Row tuple；

Step 3, third generation gene order-checking data are obtained, by the third generation gene order-checking data money order receipt to be signed and returned to the sender to described the On the unimolecule figure gapped fragments of two generation gene order-checking data, compression de Bruijn are disassembled by optimal arrangement Figure, and the space between optimal arrangement is filled, to complete the splicing of gene order-checking data.

To described second in joining method associated with the described second generation, third generation gene order-checking data, the step 1 The step of being pre-processed for gene order-checking data includes deleting low-quality number of base sequence reads, generates new portion Divide base sequence reads, the new number of base sequence reads is broken into length identical kmer.

Joining method associated with the described second generation, third generation gene order-checking data, the step 1 is also included according to using The kmer length k generation kmer of family input, are saved in Hash table, and record kmer occurrence number.

Joining method, de described in the step 2 associated with the described second generation, third generation gene order-checking data The side fusion that there is no multiple outlet or entrance in Bruijn turns into a line, is used as compression side.

Joining method associated with the described second generation, third generation gene order-checking data, it is single that the step 3 also includes generation The distance between Molecular Graphs gapped fragments estimate, and solve the arrangement of linear programming acquisition global optimum.

Joining method associated with the described second generation, third generation gene order-checking data, the step 3 also includes deleting institute State the incorrect link of unimolecule figure and mark the repetitive sequence of the unimolecule figure, carry out linearization process.

The present invention also proposes a kind of second generation, splicing system associated with three generations's gene order-checking data, including：

De Bruijn modules are built, for obtaining second generation gene order-checking data, pass through the second generation genome The second generation gene order-checking data are pre-processed, structure by part base sequence reads quality information in sequencing data Build de Bruijns；

Generation compression de Bruijn modules, for carrying out sequencing error handle to the de Bruijns, generation is new De Bruijns, the new de Bruijns are compressed, generation compression de Bruijns obtain the compression The sequence tuple on side is compressed in de Bruijns；

Concatenation module, for obtaining third generation gene order-checking data, by the third generation gene order-checking data money order receipt to be signed and returned to the sender Onto the unimolecule figure gapped fragments of the second generation gene order-checking data, compression de is disassembled by optimal arrangement Bruijn, and the space between optimal arrangement is filled, to complete spliced gene group sequencing data.

Splicing system, the structure de Bruijn moulds associated with the described second generation, third generation gene order-checking data The step of being pre-processed in block to the second generation gene order-checking data includes deleting low-quality number of base sequence Reads, generates new number of base sequence reads, the new number of base sequence reads is broken into length identical kmer。

Splicing system, the structure de Bruijn moulds associated with the described second generation, third generation gene order-checking data Block also includes the kmer length k generation kmer inputted according to user, is saved in Hash table, and record kmer occurrence number.

Splicing system associated with the described second generation, third generation gene order-checking data, the generation compression de Bruijn The side fusion that there is no multiple outlet or entrance in de Bruijns described in module turns into a line, is used as compression side. It is below the overall technology effect of the present invention：

Evaluating the standard of splicing effect quality has splicing length and splicing error rate, and splicing length is generally longer Contig length, contig N50 and contig N90.Using GAGE (Salzberg, S.L., et al., GAGE:A critical evaluation of genome assemblies and assembly algorithms(vol 22,pg 557,2012).Genome Research,2012.22(6):P.1196-1196 four kinds of splicing mistakes) are compared, are respectively Indels, inversions, translocations and relocations.We compare the same SSPACE- of ARCS23 LongRead performance.

The ARCS23 of table 4.1 and SSPACE-LongRead compares in E.coli splicing result

As can be seen from the results, ARCS23 and SSPACE-LongRead achieve relatively good splicing result, ARCS23 splicing is relatively long, has been nearly completed the splicing of whole gene group.

We have run GAGE assessment software, have counted between three kinds of mistakes, including indels, different scaffolds Intersection；Inversion a, scaffold changes DNA inside chromosome；Translocation, scaffolds Reply to the topic on different chromosomes；Relocation, is longer than 200bp contigs insertion and deletion.

Three kinds of error numbers of the GAGE of table 4.2 statistics

	#indels	#inversions	#translocations	#relocations
					SSPACE-LongRead	0	33	0	1
ARCS23	2	12	0	0

Brief description of the drawings

Fig. 1 is removal tip, bubble and long range connection figure；

Fig. 2 is to form CDB (reads, k) figure；

Fig. 3 is fitting convex functional curves figure；

Fig. 4 schemes for gapped fragment；

Fig. 5 is construction linear programming figure；

Fig. 6 is search long range connection figure；

Fig. 7 is search repetitive sequence figure；

Fig. 8 is gap blank map.

Embodiment

It is below the specific steps of the present invention, it is as follows：

Step 1：Utilize two generation sequencing datas formation de Bruijns.Second generation sequencing data typically contains reads mass Information, is pre-processed first with quality information to sequencing data, is removed low-quality fragment, is then broken into reads (k-mer is referred to a reads length identical kmer, continuous to cut, and the sequence length that base stroke is obtained in turn is K nucleotide sequence), de Bruijns are built, during ARCS23 reads in reads, the kmer length inputted according to user K generates kmer, is saved in Hash table, the number of times that record kmer occurs, and code is realized in SOAPdenovo2, using 1 Individual byte represents kmer occurrence number, and this mode can only at most preserve 255, in second generation sequence assembly, general to survey Sequence depth can be higher, kmer occurrence number can more than 255, in the present invention, it is necessary to preserve accurate kmer occurrence numbers, For carrying out except the estimation of wrong and sequence tuple；

Step 2：Error handle is sequenced in de Bruijns.The reads pre-processed is still wrong containing substantial amounts of sequencing By mistake, ARCS23 is not merely with kmer coverage, while using the topological property of de Bruijns, distinguishing that sequencing mistake is led The sequence of cause, the error handle of this step removes the kmer of some apparent errors, the scale of de Bruijns is reduced, to DB (reads, k) the upper average kmer occurrence numbers of each h-path calculating a, if h-path average kmer occurrence numbers Less than threshold value, then in DB, (reads, k) edge contract for representing kmer all on this h-path can so remove big portion Divide due to tips, bubbles caused by sequencing mistake.

Such as Fig. 1, there is a bubble in (A), the side of wherein black box is the generally random survey by being sequenced caused by mistake The kmer occurrence numbers that sequence mistake is produced are fewer, and the side of (B) wherein black box is because reads 5 ' ends or 3 ' ends are surveyed Tips caused by sequence mistake.(C) remotely connected caused by chimeric reads or random sequencing mistake.

Random sequencing mistake causes the kmer occurrence numbers in de Bruijns on a certain bar h-path fewer.In de In Bruijn, a tip or bubble causes often caused by one or several neighbouring sequencing mistakes, this tip Or on bubble kmer occurrence number it is few, directly by tip or bubble kmer all delete, in addition, so doing Some can be prevented from deleting due to the sequencing low kmer for causing occurrence number less of depth.

Step 3：Form compression de Bruijns.To not there is no the side fusion of multiple outlet or entrance in de Bruijns Into a line, it is called compression side, the figure of formation is called compression de Bruijns, and compression de Bruijns significantly reduce figure Scale, maintains the link information between sequence.

DB (reads, k) in hub nodes as CDB (reads, k) in node, if having one between node u and v Bar h-path, then connect a line between u and v, preserves h-path sequence information, it is noted that may have many between 2 points herein Weight side, in fig. 2, A) sequence is ATCGGTCGC；B it is 4) to select kmer length, forms de Bruijns；C compression de) is formed Bruijn.

Forming CDB, (scale of figure, can k) be reduced significantly by reads, remove the sequence information of bulk redundancy, every pressure Cissing, only preserves the sequence read, it is not necessary to preserve all kmer.

Step 4：The sequence tuple on estimation compression side.The entirety that side can be estimated as sequence tuple is compressed, that is, compresses side Upper all kmer are identicals in the occurrence number of genome, and the present invention utilizes the coverage information for compressing kmer on side, The cost function on side is designed, solves to compress the tuple estimation problem on side using minimum cost flow algorithm.

The present invention is compressed the sequence tuple estimation on side using the model of maximum likelihood, specific as follows shown：

Annular genome is represented with D, length is N (D), d_iRepresent the sequence tuple on i-th compression side, X_ijRepresent i-th On individual compression side, the kmer of j-th of position occurrence number, in probability theory, n kmer is the output of n independent experiment, In testing each time, a position is sampled with identical outline from D to be come out, and output is positions of this kmer on compression side, Given i and j, the output for being once experiment is that the probability of this position isConsider each stochastic variable, they obey two Item distribution, when considering together, they obey multinomial distribution：

The number on wherein compression side is g, and the kmer positional numbers on i-th of compression side are n_i。

For Bonding Problem, D is ignorant, but the result of n experiment knows, so, tested at given n time As a result X_ij, present invention consideration d_iDistribution, be referred to as the likelihood of global kmer number：

In this strategy, the maximum genome of the present invention one overall situation kmer occurrence numbers likelihood of splicing minimizes The negative logarithm of this likelihood ,-logL, the present invention solved using the oriented fee flows of convex expense, it is therefore desirable to-logL be for d_iThe convex function that can divide, it is, the present invention needs to find a convex function c_i, satisfaction-logL=∑s c_i(d_i), because this Multinomial distribution has constant N (D)=∑ d_i, can not find such function.

As test number (TN) tends to be infinite, X_ijStochastic variable tends to independence, because global test number (TN) is generally than larger, sheet Invention can be come with the product of single bi-distribution approximate polynomial distribution, binomial it is approximate in, the length N (D) of genome is one Individual constant, independently of each d_i, can be represented with N, N size can be by Bioexperiment or EM strategy come near Seemingly, the present invention passes throughTo calculate.

Then, the approximate of L becomes：

The present present invention can be write as formula-logL=K ∑s c_i(d_i), wherein K is independently of all d_iIt is normal Number, and：

As shown in figure 3, function is a convex function, the present invention is fitted with two straight lines, and the slope of straight line represents unit The expense of stream.

CDB (reads, k) in, for each compression when forming two, wherein the expense of one is the negative left side The slope of straight line, flow rate zone arrives function minimum point for 0, and the expense of another is the slope of negative straight right, flow rate zone Arrive infinite for 0；Source point s and meeting point t is added on the diagram, a line is drawn from source point t to all nodes, expense is 0, flow rate zone Arrive infinite for 0；Draw a line to meeting point t from all nodes, expense is 0, flow rate zone arrives infinite for 0.

Step 5：On the unimolecule figure gapped fragments that three generations's sequencing data of replying to the topic is formed to two codes or datas, formed The distance between gapped fragments estimate, solve the arrangement that linear programming obtains global optimum.

When building gapped fragment figures, choosing CDB, (reads, k) sequence tuple is 1 compression side conduct The point of gapped fragment figures.In general, in genome, sequence tuple is the distance between 1 compression side and relative Order is unique.

The compression side that these sequence tuples are 1 is picked out, fairly simple figure can be constructed by money order receipt to be signed and returned to the sender pair-kmer, I.e. discounting for the influence of sequencing mistake or repetitive sequence, this gapped fragment figure should be directed acyclic graph ((Directed Acyclic Graph, DAG).If add the compression side that sequence tuple is not 1, although can keep as far as possible Relative distance relation between all compression sides, but gapped fragment figures can be made extremely complex.

A total of 5 sections of sequences on such as Fig. 4, (a) genome, the tuple of sequence B is 2, and others are 1, and de is compressed in (b) formation Bruijn, a total of 4 compressions side, pair-kmer is replied to the topic onto this four edges, if (c) gapped fragment scheme It is upper to retain in all 4 compressions sides, the extremely complex gapped fragment figures of formation, (d) gapped fragment figures Only retain the compression side that tuple is 1, scheme relatively easy.

Only select sequence tuple be 1 compression side as the point of gapped fragment figures, enough information can be kept To extend sequence, the repetitive sequence that those are shorter than insert the distance in genome is all the compression that sequence tuple is 1 before and after them Side, can have pair-kmer to link up the compression side before and after them, for the long repetitive sequence of those length, if Their length is more than insert the distance, it is difficult to which they are spelled out.

For any two point u and v, if there is pair-kmer two ends exist respectively on u and v, then a line u → v.ARCS23 reads in pair-ends again, forms pair-kmer, is replied to the topic to compressing on side, preserves for the pair-kmer that replies to the topic Number and range information.

For two compression side C₁And C₂, length is l respectively₁And l₂.The distance between the two compression sides are represented with d：

R represents the length at read two ends, and a total of n pair-kmer replies to the topic to C₁And C₂, i-th of pair-kmer reply to the topic Position be d_1i,d_2i, f represents the experience distribution of insert the distance, and most likely distance d can obtain with EM algorithms.

Present invention x_iRepresent compression side C_iRelative position, formalize this problem for linear programming problem：

s.t.x_j-x_i+e_ij=d_ij

|e_ij|≤E_ij

D herein_ijRepresent compression side C_iAnd C_jEstimated distance, e_ijRepresent d_ijAnd x_j-x_iDeviation.Optimization aim is minimum Change deviation add and.

Step 6：In this optimal arrangement, the conflict on some compression side positions may possibly still be present.From these positions Conflict start, search remove unimolecule figure incorrect link, remove gapped fragment figures incorrect link and mark The repetitive sequence of gapped fragment figures, carries out linearization process.

Exist in pair-kmer figures it is substantial amounts of due to chimeric reads or sequencing mistake caused by side, if Delete the fewer side of pair-kmer numbers using fixed threshold, threshold value set it is big a bit, can delete more wrong Side, while can also delete some because two ends are compressed when comparing short-range missile and causing pair-kmer originally fewer；It is small that threshold value is set A bit, the side for having many mistakes is not deleted.

In ARCS23, the present invention deletes the wrong side in pair-kmer using the threshold value of change.For two pressures Cissing C₁And C₂, length is l respectively₁And l₂, the distance between the two compression sides are represented with d.Relatively total pair-kmer's Number, left and right ends fall in C₁And C₂On pair-kmer it is less.Left and right ends fall in C₁And C₂On pair-kmer numbers with Machine variable X₁₂Represent, X₁₂Poisson distribution is obeyed, parameter lambda can be estimated also by following mode：

Wherein, the kmer in the averagely each sites of e number.The marking of each edge can be represented with likelihood, it is considered to side C₁→ C₂, its quality is

So, the quality on all sides in pair-kmer figures is calculated, the relatively low side of mass ratio is deleted.

Because tuple estimation may malfunction on some compression sides, so the present invention can be by the compression of some repetitive sequences While being placed on pair-kmer figures.Mistake or chimeric reads is sequenced by distant on genome or even complementation in some Compression side on chain connects together, and there are conflict, such as Fig. 5 in this position for also resulting in the compression side that Solutions of Linear Programming comes out.

In Figure 5, it is such while only appear in two conflict compression while between path on, ARCS23 finds out conflict Path of the compression between improve quality minimum while, if the quality of this edge is significantly less than the quality on all sides of surrounding, Then this edge is deleted.

The compression side position different on genome that repetitive sequence is represented, the compression for so causing Solutions of Linear Programming to come out Collided with each other on the position on side, such as Fig. 6.

In the figure 7, all sides are all 2 when compression is the compression for adding repetitive sequence while the reason for 4 and 9 conflict, in figure It is correct, thus quality is all very high, at this moment, ARCS23 finds the public ancestors or public descendants on the compression side of conflict, this A little nodes are likely to be repetitive sequence, and they are deleted.

Step 7：Compression de Bruijns are disassembled using optimal arrangement, the space between optimal arrangement is filled.This when The present invention obtained relatively complicated de Bruijns DB (reads, k) and across the wider pair-kmer of scope Figure, can form contig under the auxiliary of pair-kmer figures.The weak connectedness branch of each pair-kmer figure is exactly one contig.The relative ranks of each connected component's internal pressure cissing and distance between any two in known pair-kmer figures, are utilized (reads k) gets up the gap filling between them DB above.

The compression side that these gaps largely all or by repetitive sequence are produced is constituted, or because sequencing depth ratio is relatively low Cause no reads to cover this region, for any a line u → v in pair-kmer figures, DB (reads, k) in look for from Compress in u to v during compression the distance path similar with estimated distance：

1) if the sequence between an only paths, u to v is fairly simple, directly filled out with the sequence on this paths Fill B in u to v gap, such as Fig. 8.

If 2) without such path, the sequencing depth ratio of this section of sequence is relatively low from u to v, with ' N ' fill, in such as Fig. 8 A。

3) if mulitpath, one scoring functions of design are given a mark to these paths, selection fraction highest road Fill C in this intersegmental gap such as Fig. 8 in footpath.

The step of being pre-processed in the structure de Bruijn modules to the second generation gene order-checking data is wrapped The low-quality number of base sequence reads of deletion is included, new number of base sequence reads is generated, by the new number of base Sequence reads is broken into length identical kmer.

The structure de Bruijns module also includes the kmer length k generation kmer inputted according to user, is saved in Kazakhstan In uncommon table, and record kmer occurrence number.

There is no multiple outlet or entrance in de Bruijns described in the generation compression de Bruijn modules Side fusion turn into a line, be used as compression side.

Claims

1. joining method associated with a kind of second generation, third generation gene order-checking data, it is characterised in that including：

Step 1, second generation gene order-checking data are obtained, pass through part base sequence in the second generation gene order-checking data The second generation gene order-checking data are pre-processed by reads quality information, build de Bruijns；

Step 2, sequencing error handle is carried out to the de Bruijns, new de Bruijns is generated, to the new de Bruijn is compressed, generation compression de Bruijns, obtains the sequence weight that side is compressed in the compression de Bruijns Number；

Step 3, third generation gene order-checking data are obtained, by the third generation gene order-checking data money order receipt to be signed and returned to the sender to the second generation On the unimolecule figure gapped fragments of gene order-checking data, compression de Bruijns are disassembled by optimal arrangement, and The space between optimal arrangement is filled, to complete the splicing of gene order-checking data.

2. joining method associated with the second generation as claimed in claim 1, third generation gene order-checking data, it is characterised in that The step of being pre-processed in the step 1 to the second generation gene order-checking data includes deleting low-quality number of base Sequence reads, generates new number of base sequence reads, the new number of base sequence reads is broken into length identical Kmer.

3. joining method associated with the second generation as claimed in claim 2, third generation gene order-checking data, it is characterised in that The step 1 also includes the kmer length k generation kmer inputted according to user, is saved in Hash table, and record going out for kmer Occurrence number.

4. joining method associated with the second generation as claimed in claim 1, third generation gene order-checking data, it is characterised in that The side fusion that there is no multiple outlet or entrance described in the step 2 in de Bruijns turns into a line, is used as compression Side.

5. joining method associated with the second generation as claimed in claim 1, third generation gene order-checking data, it is characterised in that The step 3 also includes the distance between generation unimolecule figure gapped fragments estimation, and solves linear programming acquisition Global optimum arranges.

6. joining method associated with the second generation as claimed in claim 1, third generation gene order-checking data, it is characterised in that The step 3 also includes the repetitive sequence deleted the incorrect link of the unimolecule figure and mark the unimolecule figure, enters line Propertyization processing.

7. splicing system associated with a kind of second generation, third generation gene order-checking data, it is characterised in that including：

De Bruijn modules are built, for obtaining second generation gene order-checking data, pass through the second generation gene order-checking The second generation gene order-checking data are pre-processed by part base sequence reads quality information in data, build de Bruijn；

Generation compression de Bruijn modules, for carrying out sequencing error handle to the de Bruijns, generate new de Bruijn, is compressed to the new de Bruijns, generation compression de Bruijns, obtains the compression de The sequence tuple on side is compressed in Bruijn；

Concatenation module, for obtaining third generation gene order-checking data, by the third generation gene order-checking data money order receipt to be signed and returned to the sender to institute State on the unimolecule figure gapped fragments of second generation gene order-checking data, compression de is disassembled by optimal arrangement Bruijn, and the space between optimal arrangement is filled, to complete spliced gene group sequencing data.

8. splicing system associated with the second generation as claimed in claim 7, third generation gene order-checking data, it is characterised in that The step of being pre-processed in the structure de Bruijn modules to the second generation gene order-checking data includes deleting low The number of base sequence reads of quality, generates new number of base sequence reads, by the new number of base sequence reads It is broken into length identical kmer.

9. splicing system associated with the second generation as claimed in claim 8, third generation gene order-checking data, it is characterised in that The structure de Bruijns module also includes the kmer length k generation kmer inputted according to user, is saved in Hash table, And record kmer occurrence number.

10. splicing system associated with the second generation as claimed in claim 7, third generation gene order-checking data, it is characterised in that There is no multiple outlet or the side fusion of entrance described in the generation compression de Bruijn modules in de Bruijns As a line, compression side is used as.