CN104951672B - Joining method and system associated with a kind of second generation, three generations's gene order-checking data - Google Patents

Joining method and system associated with a kind of second generation, three generations's gene order-checking data Download PDF

Info

Publication number
CN104951672B
CN104951672B CN201510346970.1A CN201510346970A CN104951672B CN 104951672 B CN104951672 B CN 104951672B CN 201510346970 A CN201510346970 A CN 201510346970A CN 104951672 B CN104951672 B CN 104951672B
Authority
CN
China
Prior art keywords
generation
gene order
checking data
compression
bruijns
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510346970.1A
Other languages
Chinese (zh)
Other versions
CN104951672A (en
Inventor
卜东波
张仁玉
陈挺
李帅成
孙世伟
刘兴武
许情
郑全刚
王超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201510346970.1A priority Critical patent/CN104951672B/en
Publication of CN104951672A publication Critical patent/CN104951672A/en
Application granted granted Critical
Publication of CN104951672B publication Critical patent/CN104951672B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention relates to biology information technology and calculation biology field, joining method and system associated with more particularly to a kind of second generation, three generations's gene order-checking data, this method includes obtaining second generation gene order-checking data, pass through the quality information of part base sequence reads in the second generation gene order-checking data, the second generation gene order-checking data are pre-processed, de Brui jn figures are built;Sequencing error handle is carried out to de Brui jn figures, new de Brui jn figures is generated, the new de Brui jn figures is compressed, generation compression de Brui jn figures obtain the sequence tuple that side is compressed in the compression de Brui jn figures;Obtain third generation gene order-checking data, by on the unimolecule figure gapped fragments of the third generation gene order-checking data money order receipt to be signed and returned to the sender to the second generation gene order-checking data, compression de Brui jn figures are disassembled by optimal arrangement, and the space between optimal arrangement is filled, to complete the splicing of gene order-checking data.

Description

Joining method and system associated with a kind of second generation, three generations's gene order-checking data
Technical field
The present invention relates to biology information technology and calculation biology field, more particularly to a kind of second generation, three generations's genome Joining method associated with sequencing data and system.(herein without adding feature of the present invention, therefore deleting)
Background technology
Genome is all hereditary information in DNA (being RNA for fractionated viral) in organism.DNA be by The complementary double-strand of tetra- kinds of base compositions of A, C, T, G, according to " central dogma " of biology, turn of DNA base sequence guide RNA Record, and the further translation building-up process of protein, therefore, understanding DNA base sequence is the weight for recognizing biological law It is basic, by sequencing technologies acquisition DNA number of base sequence (reads), the complete genome sequence for being spliced into, from And be used for further analysis and study.
DNA sequencing technology mainly experienced the development of three phases, be first generation sequencing technologies, second generation sequencing skill respectively Art and third generation sequencing technologies, first generation sequencing technologies are that Sanger reacts sequencing in the dideoxy chain-termination of invention in 1977 Method, exactly using the Sanger PCR sequencing PCRs after improving, researcher completes the Human Genome Project (Human Genome Project, HGP, 1995~2003) almost all of sequencing;At the beginning of second generation sequencing technologies are born in 21 century, representing instrument is 454th, the new-generation sequencing instrument (i.e. second generation sequenator) that Illumina and ABI companies release one after another, these sequenators can be same Shi Binghang carries out substantial amounts of sequencing reaction, so as to significantly reduce sequencing time and cost, compared with traditional sequencing methods, second Significant advantage for sequencing technologies is that sequencing throughput is high, such as SOLiD3 sequenators single operation can obtain 20GB sequencing numbers According to it has the disadvantage:The DNA of generation read it is long it is shorter than Sanger PCR sequencing PCR a lot, such as the reading length that Sanger sequencings are produced can be with 900bp is reached, and a length of 250-400bp of reading of 454 sequenators, Solexa read a length of 50-75bp, short sequence length makes splicing Algorithm is difficult to solve repetitive sequence region, causes splicing fragmentation occur, in addition, the error rate of second generation sequencing technologies is also more It is high;Third generation sequencing technologies are started from 2008, are characterized in being sequenced using " single-molecule sequencing " strategy, mainly have The HeliScope single-molecule sequencings technology of BioScience companies, the unimolecule of Pacific Biosciences companies are surveyed in real time Sequence technology and the nano-pore nanometer pore single-molecule sequencing technologies of Oxford Nanopore Technology Ltd companies, unimolecule Sequencing technologies are noteworthy characterized by no longer to be expanded to sample, and ensure that sequencing data (i.e. reads) exists to the full extent Uniform fold on genome, the reads that single-molecule sequencing is produced is up to 3kb~20kb, and its potential advantage is to solve The splicing of certainly long repetitive sequence, has the disadvantage that reads error rates are higher (about 5%~15%).
Either first generation Sanger PCR sequencing PCRs, or the second generation, third generation PCR sequencing PCR, all " reading " can only go out DNA every time In a short fragment, can not once run in just genome completely be read from the beginning to the end, accordingly, it would be desirable to will be short Fragment is assembled into complete genome, and this process is referred to as " from the beginning sequence assembly " (De Novo assembly).
Common three generations's sequencing data splicing strategy has:
The mixing splicing strategy of AHA splicing softwares:Three codes or datas are joined first and are fitted on the overlapping of two codes or datas splicing generation On group (contigs), then produce scaffolds as connection using these three generations reads and scheme, with reference to from Illumina, The sequence datas of Roche 454 and PacBio sequences, carry out scaffolding, overlap-layout-consensus and mistake Processing, finally produces complete genome, and it has the disadvantage three codes or datas connection being fitted in complete genome group relatively correctly, and joins and match somebody with somebody Accuracy has declined on to relatively short contigs.
The mixing splicing strategy of SSPACE-LongRead splicing softwares:What continuous iteratively assembly had been produced Contigs, but scaffolding is carried out using a kind of fast and reliable mode, similar with AHA, it has the disadvantage three codes or datas Connection is fitted in complete genome group relatively correctly, and connection is fitted on accuracy on relatively short contigs and declined.
The mixing splicing strategy of PBcR splicing softwares:Using the potentiality of its de novo sequence assemblies, a kind of scheme is to use The sequence of short high-accuracy corrects long monomolecular sequence, such as PBcR (PacBio corrected Reads) conduct A part for Celera splicers, by replying to the topic on short reads to the reads of single length and to produce high-accuracy unanimously short Reads wipe out and correct the read of single length, the reads of the mixing after correction individually carries out de novo splicing, or Person is spliced with other data mixings, and it has the disadvantage to need to carry out error correction using substantial amounts of computing resource.
HGAP (Hierarchical Genome-assembly Process) splices the splicing strategy of software:Use one Long insert the distance air gun DNA library simultaneously combines unimolecule (SMRT) DNA sequencing technology in real time, micro- to carry out high-quality de novo Organism genomic sequence splices, and HGAP uses most long reads as the every other reads of seed collection, and by based on The structure uniformity process of directed acyclic graph carrys out pre-splicing reads, is then spliced using ready-made long reads splicers, Tactful different from mixing splicing, HGAP does not need the reads of high-accuracy to carry out error correction.It has the disadvantage to obtain high-quality Splicing result is, it is necessary to which very high sequencing depth, which adds sequencing cost.
Error correction is carried out to third generation sequencing data using the second codes or data, because the amount of two kinds of data is all very big, can be consumed Very big computing resource, the contigs with the formation of the second codes or data is iteratively disassembled with PacBio data, still there is long repetition Sequence is adulterated wherein, it is difficult to be disassembled.
On the other hand, directly spliced with three codes or datas, it is necessary to which consuming the substantial amounts of time is used for from error correction;It is simultaneously guarantor The good splicing effect of card, it is necessary to use sufficiently high sequencing depth, this just significantly increases the cost of experiment.
It has been generally acknowledged that in the case where sequencing depth is not very high, CLR (long continuous reads) cannot be used for high-quality spelling Connect, Chin et al. propose a kind of new non-associated form HGAP, only complete bacterial genomes sequence assembly with CLR, although Sequencing depth needs to reach 50 × error correction is carried out, higher sequencing depth is used for across repetitive sequence region, in addition it is also necessary to by hand Intervene and carry out error correction, consider from sequencing cost angle, this needs the splicing of relatively higher cost completion single-gene group, particularly true Core is biological.
At present, a kind of joint connecting method attempts to carry out error correction to CLR, in principle, with PacBio CCS or short This is feasible to NGS (or mixing both), and some have improved the method for splicing length using two codes or datas and three codes or datas It is suggested, these methods further add the strategy of mixing splicing, such as Celera, MIRA and ALLPATHS-LG, although achieving Good result, longer reads (reads are needed using two codes or data error correction>75bp) and higher sequencing depth, also have compared with Many computing resources, PacBioToCA error correction flow equally supports non-mixed PacBio to splice.
In scaffolding, AHA strategies are the most frequently used strategies, in this strategy, and CLR is only used as to splicing two The contig that codes or data is produced carries out scaffolding, and it generally produces incomplete splicing result, and is not suitable for big rule The genome of mould, recently, Cerulean are issued out as a new mixing splicing tool, and it is produced using ABySS Contig figures information and CLR without error correction produce scaffolds, although generating good result, Cerulean needs The contigs that ABySS is produced, others splicing software is there may be preferably splicing result, and finally, some are used for PacBio Gap software development in reads fillings scaffolds comes out, and has PBJelly in these softwares.Due to second generation sequencing data The limitation of length and third generation sequencing data error rate, intactly splices prokaryotes and eucaryote is still relatively difficult.
The content of the invention
In view of the shortcomings of the prior art, the present invention proposes a kind of second generation, spliced associated with three generations's gene order-checking data Method and system.
Joining method associated with a kind of second generation of present invention proposition, three generations's gene order-checking data, including:
Step 1, second generation gene order-checking data are obtained, pass through number of base in the second generation gene order-checking data The second generation gene order-checking data are pre-processed by sequence reads quality information, build de Bruijns;
Step 2, sequencing error handle is carried out to the de Bruijns, generates new de Bruijns, to described new De Bruijns be compressed, generation compression de Bruijns obtain the sequence on compression side in the compression de Bruijns Row tuple;
Step 3, third generation gene order-checking data are obtained, by the third generation gene order-checking data money order receipt to be signed and returned to the sender to described the On the unimolecule figure gapped fragments of two generation gene order-checking data, compression de Bruijn are disassembled by optimal arrangement Figure, and the space between optimal arrangement is filled, to complete the splicing of gene order-checking data.
To described second in joining method associated with the described second generation, third generation gene order-checking data, the step 1 The step of being pre-processed for gene order-checking data includes deleting low-quality number of base sequence reads, generates new portion Divide base sequence reads, the new number of base sequence reads is broken into length identical kmer.
Joining method associated with the described second generation, third generation gene order-checking data, the step 1 is also included according to using The kmer length k generation kmer of family input, are saved in Hash table, and record kmer occurrence number.
Joining method, de described in the step 2 associated with the described second generation, third generation gene order-checking data The side fusion that there is no multiple outlet or entrance in Bruijn turns into a line, is used as compression side.
Joining method associated with the described second generation, third generation gene order-checking data, it is single that the step 3 also includes generation The distance between Molecular Graphs gapped fragments estimate, and solve the arrangement of linear programming acquisition global optimum.
Joining method associated with the described second generation, third generation gene order-checking data, the step 3 also includes deleting institute State the incorrect link of unimolecule figure and mark the repetitive sequence of the unimolecule figure, carry out linearization process.
The present invention also proposes a kind of second generation, splicing system associated with three generations's gene order-checking data, including:
De Bruijn modules are built, for obtaining second generation gene order-checking data, pass through the second generation genome The second generation gene order-checking data are pre-processed, structure by part base sequence reads quality information in sequencing data Build de Bruijns;
Generation compression de Bruijn modules, for carrying out sequencing error handle to the de Bruijns, generation is new De Bruijns, the new de Bruijns are compressed, generation compression de Bruijns obtain the compression The sequence tuple on side is compressed in de Bruijns;
Concatenation module, for obtaining third generation gene order-checking data, by the third generation gene order-checking data money order receipt to be signed and returned to the sender Onto the unimolecule figure gapped fragments of the second generation gene order-checking data, compression de is disassembled by optimal arrangement Bruijn, and the space between optimal arrangement is filled, to complete spliced gene group sequencing data.
Splicing system, the structure de Bruijn moulds associated with the described second generation, third generation gene order-checking data The step of being pre-processed in block to the second generation gene order-checking data includes deleting low-quality number of base sequence Reads, generates new number of base sequence reads, the new number of base sequence reads is broken into length identical kmer。
Splicing system, the structure de Bruijn moulds associated with the described second generation, third generation gene order-checking data Block also includes the kmer length k generation kmer inputted according to user, is saved in Hash table, and record kmer occurrence number.
Splicing system associated with the described second generation, third generation gene order-checking data, the generation compression de Bruijn The side fusion that there is no multiple outlet or entrance in de Bruijns described in module turns into a line, is used as compression side. It is below the overall technology effect of the present invention:
Evaluating the standard of splicing effect quality has splicing length and splicing error rate, and splicing length is generally longer Contig length, contig N50 and contig N90.Using GAGE (Salzberg, S.L., et al., GAGE:A critical evaluation of genome assemblies and assembly algorithms(vol 22,pg 557,2012).Genome Research,2012.22(6):P.1196-1196 four kinds of splicing mistakes) are compared, are respectively Indels, inversions, translocations and relocations.We compare the same SSPACE- of ARCS23 LongRead performance.
The ARCS23 of table 4.1 and SSPACE-LongRead compares in E.coli splicing result
As can be seen from the results, ARCS23 and SSPACE-LongRead achieve relatively good splicing result, ARCS23 splicing is relatively long, has been nearly completed the splicing of whole gene group.
We have run GAGE assessment software, have counted between three kinds of mistakes, including indels, different scaffolds Intersection;Inversion a, scaffold changes DNA inside chromosome;Translocation, scaffolds Reply to the topic on different chromosomes;Relocation, is longer than 200bp contigs insertion and deletion.
Three kinds of error numbers of the GAGE of table 4.2 statistics
#indels #inversions #translocations #relocations
SSPACE-LongRead 0 33 0 1
ARCS23 2 12 0 0
Brief description of the drawings
Fig. 1 is removal tip, bubble and long range connection figure;
Fig. 2 is to form CDB (reads, k) figure;
Fig. 3 is fitting convex functional curves figure;
Fig. 4 schemes for gapped fragment;
Fig. 5 is construction linear programming figure;
Fig. 6 is search long range connection figure;
Fig. 7 is search repetitive sequence figure;
Fig. 8 is gap blank map.
Embodiment
It is below the specific steps of the present invention, it is as follows:
Step 1:Utilize two generation sequencing datas formation de Bruijns.Second generation sequencing data typically contains reads mass Information, is pre-processed first with quality information to sequencing data, is removed low-quality fragment, is then broken into reads (k-mer is referred to a reads length identical kmer, continuous to cut, and the sequence length that base stroke is obtained in turn is K nucleotide sequence), de Bruijns are built, during ARCS23 reads in reads, the kmer length inputted according to user K generates kmer, is saved in Hash table, the number of times that record kmer occurs, and code is realized in SOAPdenovo2, using 1 Individual byte represents kmer occurrence number, and this mode can only at most preserve 255, in second generation sequence assembly, general to survey Sequence depth can be higher, kmer occurrence number can more than 255, in the present invention, it is necessary to preserve accurate kmer occurrence numbers, For carrying out except the estimation of wrong and sequence tuple;
Step 2:Error handle is sequenced in de Bruijns.The reads pre-processed is still wrong containing substantial amounts of sequencing By mistake, ARCS23 is not merely with kmer coverage, while using the topological property of de Bruijns, distinguishing that sequencing mistake is led The sequence of cause, the error handle of this step removes the kmer of some apparent errors, the scale of de Bruijns is reduced, to DB (reads, k) the upper average kmer occurrence numbers of each h-path calculating a, if h-path average kmer occurrence numbers Less than threshold value, then in DB, (reads, k) edge contract for representing kmer all on this h-path can so remove big portion Divide due to tips, bubbles caused by sequencing mistake.
Such as Fig. 1, there is a bubble in (A), the side of wherein black box is the generally random survey by being sequenced caused by mistake The kmer occurrence numbers that sequence mistake is produced are fewer, and the side of (B) wherein black box is because reads 5 ' ends or 3 ' ends are surveyed Tips caused by sequence mistake.(C) remotely connected caused by chimeric reads or random sequencing mistake.
Random sequencing mistake causes the kmer occurrence numbers in de Bruijns on a certain bar h-path fewer.In de In Bruijn, a tip or bubble causes often caused by one or several neighbouring sequencing mistakes, this tip Or on bubble kmer occurrence number it is few, directly by tip or bubble kmer all delete, in addition, so doing Some can be prevented from deleting due to the sequencing low kmer for causing occurrence number less of depth.
Step 3:Form compression de Bruijns.To not there is no the side fusion of multiple outlet or entrance in de Bruijns Into a line, it is called compression side, the figure of formation is called compression de Bruijns, and compression de Bruijns significantly reduce figure Scale, maintains the link information between sequence.
DB (reads, k) in hub nodes as CDB (reads, k) in node, if having one between node u and v Bar h-path, then connect a line between u and v, preserves h-path sequence information, it is noted that may have many between 2 points herein Weight side, in fig. 2, A) sequence is ATCGGTCGC;B it is 4) to select kmer length, forms de Bruijns;C compression de) is formed Bruijn.
Forming CDB, (scale of figure, can k) be reduced significantly by reads, remove the sequence information of bulk redundancy, every pressure Cissing, only preserves the sequence read, it is not necessary to preserve all kmer.
Step 4:The sequence tuple on estimation compression side.The entirety that side can be estimated as sequence tuple is compressed, that is, compresses side Upper all kmer are identicals in the occurrence number of genome, and the present invention utilizes the coverage information for compressing kmer on side, The cost function on side is designed, solves to compress the tuple estimation problem on side using minimum cost flow algorithm.
The present invention is compressed the sequence tuple estimation on side using the model of maximum likelihood, specific as follows shown:
Annular genome is represented with D, length is N (D), diRepresent the sequence tuple on i-th compression side, XijRepresent i-th On individual compression side, the kmer of j-th of position occurrence number, in probability theory, n kmer is the output of n independent experiment, In testing each time, a position is sampled with identical outline from D to be come out, and output is positions of this kmer on compression side, Given i and j, the output for being once experiment is that the probability of this position isConsider each stochastic variable, they obey two Item distribution, when considering together, they obey multinomial distribution:
The number on wherein compression side is g, and the kmer positional numbers on i-th of compression side are ni
For Bonding Problem, D is ignorant, but the result of n experiment knows, so, tested at given n time As a result Xij, present invention consideration diDistribution, be referred to as the likelihood of global kmer number:
In this strategy, the maximum genome of the present invention one overall situation kmer occurrence numbers likelihood of splicing minimizes The negative logarithm of this likelihood ,-logL, the present invention solved using the oriented fee flows of convex expense, it is therefore desirable to-logL be for diThe convex function that can divide, it is, the present invention needs to find a convex function ci, satisfaction-logL=∑s ci(di), because this Multinomial distribution has constant N (D)=∑ di, can not find such function.
As test number (TN) tends to be infinite, XijStochastic variable tends to independence, because global test number (TN) is generally than larger, sheet Invention can be come with the product of single bi-distribution approximate polynomial distribution, binomial it is approximate in, the length N (D) of genome is one Individual constant, independently of each di, can be represented with N, N size can be by Bioexperiment or EM strategy come near Seemingly, the present invention passes throughTo calculate.
Then, the approximate of L becomes:
The present present invention can be write as formula-logL=K ∑s ci(di), wherein K is independently of all diIt is normal Number, and:
As shown in figure 3, function is a convex function, the present invention is fitted with two straight lines, and the slope of straight line represents unit The expense of stream.
CDB (reads, k) in, for each compression when forming two, wherein the expense of one is the negative left side The slope of straight line, flow rate zone arrives function minimum point for 0, and the expense of another is the slope of negative straight right, flow rate zone Arrive infinite for 0;Source point s and meeting point t is added on the diagram, a line is drawn from source point t to all nodes, expense is 0, flow rate zone Arrive infinite for 0;Draw a line to meeting point t from all nodes, expense is 0, flow rate zone arrives infinite for 0.
Step 5:On the unimolecule figure gapped fragments that three generations's sequencing data of replying to the topic is formed to two codes or datas, formed The distance between gapped fragments estimate, solve the arrangement that linear programming obtains global optimum.
When building gapped fragment figures, choosing CDB, (reads, k) sequence tuple is 1 compression side conduct The point of gapped fragment figures.In general, in genome, sequence tuple is the distance between 1 compression side and relative Order is unique.
The compression side that these sequence tuples are 1 is picked out, fairly simple figure can be constructed by money order receipt to be signed and returned to the sender pair-kmer, I.e. discounting for the influence of sequencing mistake or repetitive sequence, this gapped fragment figure should be directed acyclic graph ((Directed Acyclic Graph, DAG).If add the compression side that sequence tuple is not 1, although can keep as far as possible Relative distance relation between all compression sides, but gapped fragment figures can be made extremely complex.
A total of 5 sections of sequences on such as Fig. 4, (a) genome, the tuple of sequence B is 2, and others are 1, and de is compressed in (b) formation Bruijn, a total of 4 compressions side, pair-kmer is replied to the topic onto this four edges, if (c) gapped fragment scheme It is upper to retain in all 4 compressions sides, the extremely complex gapped fragment figures of formation, (d) gapped fragment figures Only retain the compression side that tuple is 1, scheme relatively easy.
Only select sequence tuple be 1 compression side as the point of gapped fragment figures, enough information can be kept To extend sequence, the repetitive sequence that those are shorter than insert the distance in genome is all the compression that sequence tuple is 1 before and after them Side, can have pair-kmer to link up the compression side before and after them, for the long repetitive sequence of those length, if Their length is more than insert the distance, it is difficult to which they are spelled out.
For any two point u and v, if there is pair-kmer two ends exist respectively on u and v, then a line u → v.ARCS23 reads in pair-ends again, forms pair-kmer, is replied to the topic to compressing on side, preserves for the pair-kmer that replies to the topic Number and range information.
For two compression side C1And C2, length is l respectively1And l2.The distance between the two compression sides are represented with d:
R represents the length at read two ends, and a total of n pair-kmer replies to the topic to C1And C2, i-th of pair-kmer reply to the topic Position be d1i,d2i, f represents the experience distribution of insert the distance, and most likely distance d can obtain with EM algorithms.
Present invention xiRepresent compression side CiRelative position, formalize this problem for linear programming problem:
s.t.xj-xi+eij=dij
|eij|≤Eij
D hereinijRepresent compression side CiAnd CjEstimated distance, eijRepresent dijAnd xj-xiDeviation.Optimization aim is minimum Change deviation add and.
Step 6:In this optimal arrangement, the conflict on some compression side positions may possibly still be present.From these positions Conflict start, search remove unimolecule figure incorrect link, remove gapped fragment figures incorrect link and mark The repetitive sequence of gapped fragment figures, carries out linearization process.
Exist in pair-kmer figures it is substantial amounts of due to chimeric reads or sequencing mistake caused by side, if Delete the fewer side of pair-kmer numbers using fixed threshold, threshold value set it is big a bit, can delete more wrong Side, while can also delete some because two ends are compressed when comparing short-range missile and causing pair-kmer originally fewer;It is small that threshold value is set A bit, the side for having many mistakes is not deleted.
In ARCS23, the present invention deletes the wrong side in pair-kmer using the threshold value of change.For two pressures Cissing C1And C2, length is l respectively1And l2, the distance between the two compression sides are represented with d.Relatively total pair-kmer's Number, left and right ends fall in C1And C2On pair-kmer it is less.Left and right ends fall in C1And C2On pair-kmer numbers with Machine variable X12Represent, X12Poisson distribution is obeyed, parameter lambda can be estimated also by following mode:
Wherein, the kmer in the averagely each sites of e number.The marking of each edge can be represented with likelihood, it is considered to side C1→ C2, its quality is
So, the quality on all sides in pair-kmer figures is calculated, the relatively low side of mass ratio is deleted.
Because tuple estimation may malfunction on some compression sides, so the present invention can be by the compression of some repetitive sequences While being placed on pair-kmer figures.Mistake or chimeric reads is sequenced by distant on genome or even complementation in some Compression side on chain connects together, and there are conflict, such as Fig. 5 in this position for also resulting in the compression side that Solutions of Linear Programming comes out.
In Figure 5, it is such while only appear in two conflict compression while between path on, ARCS23 finds out conflict Path of the compression between improve quality minimum while, if the quality of this edge is significantly less than the quality on all sides of surrounding, Then this edge is deleted.
The compression side position different on genome that repetitive sequence is represented, the compression for so causing Solutions of Linear Programming to come out Collided with each other on the position on side, such as Fig. 6.
In the figure 7, all sides are all 2 when compression is the compression for adding repetitive sequence while the reason for 4 and 9 conflict, in figure It is correct, thus quality is all very high, at this moment, ARCS23 finds the public ancestors or public descendants on the compression side of conflict, this A little nodes are likely to be repetitive sequence, and they are deleted.
Step 7:Compression de Bruijns are disassembled using optimal arrangement, the space between optimal arrangement is filled.This when The present invention obtained relatively complicated de Bruijns DB (reads, k) and across the wider pair-kmer of scope Figure, can form contig under the auxiliary of pair-kmer figures.The weak connectedness branch of each pair-kmer figure is exactly one contig.The relative ranks of each connected component's internal pressure cissing and distance between any two in known pair-kmer figures, are utilized (reads k) gets up the gap filling between them DB above.
The compression side that these gaps largely all or by repetitive sequence are produced is constituted, or because sequencing depth ratio is relatively low Cause no reads to cover this region, for any a line u → v in pair-kmer figures, DB (reads, k) in look for from Compress in u to v during compression the distance path similar with estimated distance:
1) if the sequence between an only paths, u to v is fairly simple, directly filled out with the sequence on this paths Fill B in u to v gap, such as Fig. 8.
If 2) without such path, the sequencing depth ratio of this section of sequence is relatively low from u to v, with ' N ' fill, in such as Fig. 8 A。
3) if mulitpath, one scoring functions of design are given a mark to these paths, selection fraction highest road Fill C in this intersegmental gap such as Fig. 8 in footpath.
The present invention also proposes a kind of second generation, splicing system associated with three generations's gene order-checking data, including:
De Bruijn modules are built, for obtaining second generation gene order-checking data, pass through the second generation genome The second generation gene order-checking data are pre-processed, structure by part base sequence reads quality information in sequencing data Build de Bruijns;
Generation compression de Bruijn modules, for carrying out sequencing error handle to the de Bruijns, generation is new De Bruijns, the new de Bruijns are compressed, generation compression de Bruijns obtain the compression The sequence tuple on side is compressed in de Bruijns;
Concatenation module, for obtaining third generation gene order-checking data, by the third generation gene order-checking data money order receipt to be signed and returned to the sender Onto the unimolecule figure gapped fragments of the second generation gene order-checking data, compression de is disassembled by optimal arrangement Bruijn, and the space between optimal arrangement is filled, to complete spliced gene group sequencing data.
The step of being pre-processed in the structure de Bruijn modules to the second generation gene order-checking data is wrapped The low-quality number of base sequence reads of deletion is included, new number of base sequence reads is generated, by the new number of base Sequence reads is broken into length identical kmer.
The structure de Bruijns module also includes the kmer length k generation kmer inputted according to user, is saved in Kazakhstan In uncommon table, and record kmer occurrence number.
There is no multiple outlet or entrance in de Bruijns described in the generation compression de Bruijn modules Side fusion turn into a line, be used as compression side.

Claims (10)

1. joining method associated with a kind of second generation, third generation gene order-checking data, it is characterised in that including:
Step 1, second generation gene order-checking data are obtained, pass through part base sequence in the second generation gene order-checking data The second generation gene order-checking data are pre-processed by reads quality information, build de Bruijns;
Step 2, sequencing error handle is carried out to the de Bruijns, new de Bruijns is generated, to the new de Bruijn is compressed, generation compression de Bruijns, obtains the sequence weight that side is compressed in the compression de Bruijns Number;
Step 3, third generation gene order-checking data are obtained, by the third generation gene order-checking data money order receipt to be signed and returned to the sender to the second generation On the unimolecule figure gapped fragments of gene order-checking data, compression de Bruijns are disassembled by optimal arrangement, and The space between optimal arrangement is filled, to complete the splicing of gene order-checking data.
2. joining method associated with the second generation as claimed in claim 1, third generation gene order-checking data, it is characterised in that The step of being pre-processed in the step 1 to the second generation gene order-checking data includes deleting low-quality number of base Sequence reads, generates new number of base sequence reads, the new number of base sequence reads is broken into length identical Kmer.
3. joining method associated with the second generation as claimed in claim 2, third generation gene order-checking data, it is characterised in that The step 1 also includes the kmer length k generation kmer inputted according to user, is saved in Hash table, and record going out for kmer Occurrence number.
4. joining method associated with the second generation as claimed in claim 1, third generation gene order-checking data, it is characterised in that The side fusion that there is no multiple outlet or entrance described in the step 2 in de Bruijns turns into a line, is used as compression Side.
5. joining method associated with the second generation as claimed in claim 1, third generation gene order-checking data, it is characterised in that The step 3 also includes the distance between generation unimolecule figure gapped fragments estimation, and solves linear programming acquisition Global optimum arranges.
6. joining method associated with the second generation as claimed in claim 1, third generation gene order-checking data, it is characterised in that The step 3 also includes the repetitive sequence deleted the incorrect link of the unimolecule figure and mark the unimolecule figure, enters line Propertyization processing.
7. splicing system associated with a kind of second generation, third generation gene order-checking data, it is characterised in that including:
De Bruijn modules are built, for obtaining second generation gene order-checking data, pass through the second generation gene order-checking The second generation gene order-checking data are pre-processed by part base sequence reads quality information in data, build de Bruijn;
Generation compression de Bruijn modules, for carrying out sequencing error handle to the de Bruijns, generate new de Bruijn, is compressed to the new de Bruijns, generation compression de Bruijns, obtains the compression de The sequence tuple on side is compressed in Bruijn;
Concatenation module, for obtaining third generation gene order-checking data, by the third generation gene order-checking data money order receipt to be signed and returned to the sender to institute State on the unimolecule figure gapped fragments of second generation gene order-checking data, compression de is disassembled by optimal arrangement Bruijn, and the space between optimal arrangement is filled, to complete spliced gene group sequencing data.
8. splicing system associated with the second generation as claimed in claim 7, third generation gene order-checking data, it is characterised in that The step of being pre-processed in the structure de Bruijn modules to the second generation gene order-checking data includes deleting low The number of base sequence reads of quality, generates new number of base sequence reads, by the new number of base sequence reads It is broken into length identical kmer.
9. splicing system associated with the second generation as claimed in claim 8, third generation gene order-checking data, it is characterised in that The structure de Bruijns module also includes the kmer length k generation kmer inputted according to user, is saved in Hash table, And record kmer occurrence number.
10. splicing system associated with the second generation as claimed in claim 7, third generation gene order-checking data, it is characterised in that There is no multiple outlet or the side fusion of entrance described in the generation compression de Bruijn modules in de Bruijns As a line, compression side is used as.
CN201510346970.1A 2015-06-19 2015-06-19 Joining method and system associated with a kind of second generation, three generations's gene order-checking data Expired - Fee Related CN104951672B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510346970.1A CN104951672B (en) 2015-06-19 2015-06-19 Joining method and system associated with a kind of second generation, three generations's gene order-checking data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510346970.1A CN104951672B (en) 2015-06-19 2015-06-19 Joining method and system associated with a kind of second generation, three generations's gene order-checking data

Publications (2)

Publication Number Publication Date
CN104951672A CN104951672A (en) 2015-09-30
CN104951672B true CN104951672B (en) 2017-08-29

Family

ID=54166325

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510346970.1A Expired - Fee Related CN104951672B (en) 2015-06-19 2015-06-19 Joining method and system associated with a kind of second generation, three generations's gene order-checking data

Country Status (1)

Country Link
CN (1) CN104951672B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106022002B (en) * 2016-05-17 2019-03-29 杭州和壹基因科技有限公司 A kind of filling-up hole method based on three generations's PacBio sequencing data
CN106021997B (en) * 2016-05-17 2019-03-29 杭州和壹基因科技有限公司 A kind of comparison method of three generations PacBio sequencing data
CN106022003B (en) * 2016-05-17 2019-03-29 杭州和壹基因科技有限公司 A kind of scaffold construction method based on three generations's PacBio sequencing data
CN106021985B (en) * 2016-05-17 2019-03-29 杭州和壹基因科技有限公司 A kind of genomic data compression method
WO2018000174A1 (en) * 2016-06-28 2018-01-04 深圳大学 Rapid and parallelstorage-oriented dna sequence matching method and system thereof
CN107784198B (en) * 2016-08-26 2021-06-15 深圳华大基因科技服务有限公司 Combined assembly method and system for second-generation sequence and third-generation single-molecule real-time sequencing sequence
CN107841542A (en) * 2016-09-19 2018-03-27 深圳华大基因科技服务有限公司 A kind of generation sequence assemble method of genome contig two and system
CN108460245B (en) * 2017-02-21 2020-11-06 深圳华大基因科技服务有限公司 Method and apparatus for optimizing second generation assembly results using third generation sequences
CN108573127B (en) * 2017-03-14 2021-04-27 深圳华大基因科技服务有限公司 Processing method and application of original data of third-generation nucleic acid sequencing
CN108629156B (en) * 2017-03-21 2020-08-28 深圳华大基因科技服务有限公司 Method, device and computer readable storage medium for correcting error of third generation sequencing data
CN110313033A (en) * 2017-04-01 2019-10-08 深圳华大基因科技服务有限公司 Two generation sequences of one kind and the united assemble method of three generations's sequence gene group and system
CN107229842A (en) * 2017-06-02 2017-10-03 肖传乐 A kind of three generations's sequencing sequence bearing calibration based on Local map
CN107256335A (en) * 2017-06-02 2017-10-17 肖传乐 A kind of preferred three generations's sequencing sequence comparison method of being given a mark based on global seed
CN107944221B (en) * 2017-11-21 2020-12-29 南京溯远基因科技有限公司 Splicing algorithm for parallel separation of nucleic acid fragments and application thereof
CN108897986B (en) * 2018-05-29 2020-11-27 中南大学 Genome sequence splicing method based on protein information
CN108830047A (en) * 2018-06-21 2018-11-16 河南理工大学 A kind of scaffolding method based on long reading and contig classification
CN109192246B (en) * 2018-06-22 2020-10-16 深圳市达仁基因科技有限公司 Method, apparatus and storage medium for detecting chromosomal copy number abnormalities
CN109658985B (en) * 2018-12-25 2020-07-17 人和未来生物科技(长沙)有限公司 Redundancy removal optimization method and system for gene reference sequence
CN110016498B (en) * 2019-04-24 2020-05-08 北京诺赛基因组研究中心有限公司 Method for determining single nucleotide polymorphism in Sanger method sequencing
CN110379462B (en) * 2019-06-21 2021-11-26 中南民族大学 Method for assembling Chinese Jinyao chloroplast genome sequence based on Illumina technology
US11515011B2 (en) 2019-08-09 2022-11-29 International Business Machines Corporation K-mer based genomic reference data compression
CN112802554B (en) * 2021-01-28 2023-09-22 中国科学院成都生物研究所 Animal mitochondrial genome assembly method based on second-generation data
CN115620810B (en) * 2022-12-19 2023-03-28 北京诺禾致源科技股份有限公司 Method and device for detecting exogenous insertion information based on third-generation gene sequencing data

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103258145B (en) * 2012-12-22 2016-06-29 中国科学院深圳先进技术研究院 A kind of parallel gene-splicing method based on De Bruijn
CN103093121B (en) * 2012-12-28 2016-01-27 深圳先进技术研究院 The compression storage of two-way multistep deBruijn figure and building method
CN103699813B (en) * 2013-12-10 2017-05-10 深圳先进技术研究院 Method for identifying and removing repeated bidirectional edges of bidirectional multistep De Bruijn graph
CN104200133B (en) * 2014-09-19 2017-03-29 中南大学 A kind of genome De novo sequence assembly methods based on reading and range distribution

Also Published As

Publication number Publication date
CN104951672A (en) 2015-09-30

Similar Documents

Publication Publication Date Title
CN104951672B (en) Joining method and system associated with a kind of second generation, three generations's gene order-checking data
Ma et al. Reconstructing contiguous regions of an ancestral genome
Sundquist et al. Whole-genome sequencing and assembly with high-throughput, short-read technologies
Batzoglou et al. ARACHNE: a whole-genome shotgun assembler
Löytynoja Phylogeny-aware alignment with PRANK
Deonier et al. Physical Mapping of DNA
CN108121897B (en) Genome variation detection method and detection device
Coombe et al. Assembly of the complete Sitka spruce chloroplast genome using 10X Genomics’ GemCode sequencing data
Haghshenas et al. HASLR: fast hybrid assembly of long reads
CN102206704B (en) Method and device for assembling genome sequence
Sahraeian et al. PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences
KR20160073406A (en) Systems and methods for using paired-end data in directed acyclic structure
Hossain et al. Crystallizing short-read assemblies around seeds
WO2002026934A2 (en) System and process for validating, aligning and reordering genetic sequence maps using ordered restriction map
CN104200133A (en) Read and distance distribution based genome De novo sequence splicing method
KR101930253B1 (en) Apparatus and method constructing consensus reference genome map
CN107798216A (en) The comparison method of high similitude sequence is carried out using divide and conquer
Löytynoja Phylogeny-aware alignment with PRANK and PAGAN
Zhang et al. An Eulerian path approach to global multiple alignment for DNA sequences
CN106355000B (en) The scaffolding methods of insert size statistical natures are read based on both-end
Goltsman et al. Meraculous-2D: Haplotype-sensitive assembly of highly heterozygous genomes
US20150142328A1 (en) Calculation method for interchromosomal translocation position
Penner et al. An algebro-topological description of protein domain structure
CN105069325B (en) It is a kind of that matched method is carried out to nucleic acid sequence information
KR20160039386A (en) Apparatus and method for detection of internal tandem duplication

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170829

Termination date: 20210619

CF01 Termination of patent right due to non-payment of annual fee