CN104951672B - Joining method and system associated with a kind of second generation, three generations's gene order-checking data - Google Patents
Joining method and system associated with a kind of second generation, three generations's gene order-checking data Download PDFInfo
- Publication number
- CN104951672B CN104951672B CN201510346970.1A CN201510346970A CN104951672B CN 104951672 B CN104951672 B CN 104951672B CN 201510346970 A CN201510346970 A CN 201510346970A CN 104951672 B CN104951672 B CN 104951672B
- Authority
- CN
- China
- Prior art keywords
- generation
- gene order
- checking data
- compression
- bruijns
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Abstract
The present invention relates to biology information technology and calculation biology field, joining method and system associated with more particularly to a kind of second generation, three generations's gene order-checking data, this method includes obtaining second generation gene order-checking data, pass through the quality information of part base sequence reads in the second generation gene order-checking data, the second generation gene order-checking data are pre-processed, de Brui jn figures are built;Sequencing error handle is carried out to de Brui jn figures, new de Brui jn figures is generated, the new de Brui jn figures is compressed, generation compression de Brui jn figures obtain the sequence tuple that side is compressed in the compression de Brui jn figures;Obtain third generation gene order-checking data, by on the unimolecule figure gapped fragments of the third generation gene order-checking data money order receipt to be signed and returned to the sender to the second generation gene order-checking data, compression de Brui jn figures are disassembled by optimal arrangement, and the space between optimal arrangement is filled, to complete the splicing of gene order-checking data.
Description
Technical field
The present invention relates to biology information technology and calculation biology field, more particularly to a kind of second generation, three generations's genome
Joining method associated with sequencing data and system.(herein without adding feature of the present invention, therefore deleting)
Background technology
Genome is all hereditary information in DNA (being RNA for fractionated viral) in organism.DNA be by
The complementary double-strand of tetra- kinds of base compositions of A, C, T, G, according to " central dogma " of biology, turn of DNA base sequence guide RNA
Record, and the further translation building-up process of protein, therefore, understanding DNA base sequence is the weight for recognizing biological law
It is basic, by sequencing technologies acquisition DNA number of base sequence (reads), the complete genome sequence for being spliced into, from
And be used for further analysis and study.
DNA sequencing technology mainly experienced the development of three phases, be first generation sequencing technologies, second generation sequencing skill respectively
Art and third generation sequencing technologies, first generation sequencing technologies are that Sanger reacts sequencing in the dideoxy chain-termination of invention in 1977
Method, exactly using the Sanger PCR sequencing PCRs after improving, researcher completes the Human Genome Project (Human Genome
Project, HGP, 1995~2003) almost all of sequencing;At the beginning of second generation sequencing technologies are born in 21 century, representing instrument is
454th, the new-generation sequencing instrument (i.e. second generation sequenator) that Illumina and ABI companies release one after another, these sequenators can be same
Shi Binghang carries out substantial amounts of sequencing reaction, so as to significantly reduce sequencing time and cost, compared with traditional sequencing methods, second
Significant advantage for sequencing technologies is that sequencing throughput is high, such as SOLiD3 sequenators single operation can obtain 20GB sequencing numbers
According to it has the disadvantage:The DNA of generation read it is long it is shorter than Sanger PCR sequencing PCR a lot, such as the reading length that Sanger sequencings are produced can be with
900bp is reached, and a length of 250-400bp of reading of 454 sequenators, Solexa read a length of 50-75bp, short sequence length makes splicing
Algorithm is difficult to solve repetitive sequence region, causes splicing fragmentation occur, in addition, the error rate of second generation sequencing technologies is also more
It is high;Third generation sequencing technologies are started from 2008, are characterized in being sequenced using " single-molecule sequencing " strategy, mainly have
The HeliScope single-molecule sequencings technology of BioScience companies, the unimolecule of Pacific Biosciences companies are surveyed in real time
Sequence technology and the nano-pore nanometer pore single-molecule sequencing technologies of Oxford Nanopore Technology Ltd companies, unimolecule
Sequencing technologies are noteworthy characterized by no longer to be expanded to sample, and ensure that sequencing data (i.e. reads) exists to the full extent
Uniform fold on genome, the reads that single-molecule sequencing is produced is up to 3kb~20kb, and its potential advantage is to solve
The splicing of certainly long repetitive sequence, has the disadvantage that reads error rates are higher (about 5%~15%).
Either first generation Sanger PCR sequencing PCRs, or the second generation, third generation PCR sequencing PCR, all " reading " can only go out DNA every time
In a short fragment, can not once run in just genome completely be read from the beginning to the end, accordingly, it would be desirable to will be short
Fragment is assembled into complete genome, and this process is referred to as " from the beginning sequence assembly " (De Novo assembly).
Common three generations's sequencing data splicing strategy has:
The mixing splicing strategy of AHA splicing softwares:Three codes or datas are joined first and are fitted on the overlapping of two codes or datas splicing generation
On group (contigs), then produce scaffolds as connection using these three generations reads and scheme, with reference to from Illumina,
The sequence datas of Roche 454 and PacBio sequences, carry out scaffolding, overlap-layout-consensus and mistake
Processing, finally produces complete genome, and it has the disadvantage three codes or datas connection being fitted in complete genome group relatively correctly, and joins and match somebody with somebody
Accuracy has declined on to relatively short contigs.
The mixing splicing strategy of SSPACE-LongRead splicing softwares:What continuous iteratively assembly had been produced
Contigs, but scaffolding is carried out using a kind of fast and reliable mode, similar with AHA, it has the disadvantage three codes or datas
Connection is fitted in complete genome group relatively correctly, and connection is fitted on accuracy on relatively short contigs and declined.
The mixing splicing strategy of PBcR splicing softwares:Using the potentiality of its de novo sequence assemblies, a kind of scheme is to use
The sequence of short high-accuracy corrects long monomolecular sequence, such as PBcR (PacBio corrected Reads) conduct
A part for Celera splicers, by replying to the topic on short reads to the reads of single length and to produce high-accuracy unanimously short
Reads wipe out and correct the read of single length, the reads of the mixing after correction individually carries out de novo splicing, or
Person is spliced with other data mixings, and it has the disadvantage to need to carry out error correction using substantial amounts of computing resource.
HGAP (Hierarchical Genome-assembly Process) splices the splicing strategy of software:Use one
Long insert the distance air gun DNA library simultaneously combines unimolecule (SMRT) DNA sequencing technology in real time, micro- to carry out high-quality de novo
Organism genomic sequence splices, and HGAP uses most long reads as the every other reads of seed collection, and by based on
The structure uniformity process of directed acyclic graph carrys out pre-splicing reads, is then spliced using ready-made long reads splicers,
Tactful different from mixing splicing, HGAP does not need the reads of high-accuracy to carry out error correction.It has the disadvantage to obtain high-quality
Splicing result is, it is necessary to which very high sequencing depth, which adds sequencing cost.
Error correction is carried out to third generation sequencing data using the second codes or data, because the amount of two kinds of data is all very big, can be consumed
Very big computing resource, the contigs with the formation of the second codes or data is iteratively disassembled with PacBio data, still there is long repetition
Sequence is adulterated wherein, it is difficult to be disassembled.
On the other hand, directly spliced with three codes or datas, it is necessary to which consuming the substantial amounts of time is used for from error correction;It is simultaneously guarantor
The good splicing effect of card, it is necessary to use sufficiently high sequencing depth, this just significantly increases the cost of experiment.
It has been generally acknowledged that in the case where sequencing depth is not very high, CLR (long continuous reads) cannot be used for high-quality spelling
Connect, Chin et al. propose a kind of new non-associated form HGAP, only complete bacterial genomes sequence assembly with CLR, although
Sequencing depth needs to reach 50 × error correction is carried out, higher sequencing depth is used for across repetitive sequence region, in addition it is also necessary to by hand
Intervene and carry out error correction, consider from sequencing cost angle, this needs the splicing of relatively higher cost completion single-gene group, particularly true
Core is biological.
At present, a kind of joint connecting method attempts to carry out error correction to CLR, in principle, with PacBio CCS or short
This is feasible to NGS (or mixing both), and some have improved the method for splicing length using two codes or datas and three codes or datas
It is suggested, these methods further add the strategy of mixing splicing, such as Celera, MIRA and ALLPATHS-LG, although achieving
Good result, longer reads (reads are needed using two codes or data error correction>75bp) and higher sequencing depth, also have compared with
Many computing resources, PacBioToCA error correction flow equally supports non-mixed PacBio to splice.
In scaffolding, AHA strategies are the most frequently used strategies, in this strategy, and CLR is only used as to splicing two
The contig that codes or data is produced carries out scaffolding, and it generally produces incomplete splicing result, and is not suitable for big rule
The genome of mould, recently, Cerulean are issued out as a new mixing splicing tool, and it is produced using ABySS
Contig figures information and CLR without error correction produce scaffolds, although generating good result, Cerulean needs
The contigs that ABySS is produced, others splicing software is there may be preferably splicing result, and finally, some are used for PacBio
Gap software development in reads fillings scaffolds comes out, and has PBJelly in these softwares.Due to second generation sequencing data
The limitation of length and third generation sequencing data error rate, intactly splices prokaryotes and eucaryote is still relatively difficult.
The content of the invention
In view of the shortcomings of the prior art, the present invention proposes a kind of second generation, spliced associated with three generations's gene order-checking data
Method and system.
Joining method associated with a kind of second generation of present invention proposition, three generations's gene order-checking data, including:
Step 1, second generation gene order-checking data are obtained, pass through number of base in the second generation gene order-checking data
The second generation gene order-checking data are pre-processed by sequence reads quality information, build de Bruijns;
Step 2, sequencing error handle is carried out to the de Bruijns, generates new de Bruijns, to described new
De Bruijns be compressed, generation compression de Bruijns obtain the sequence on compression side in the compression de Bruijns
Row tuple;
Step 3, third generation gene order-checking data are obtained, by the third generation gene order-checking data money order receipt to be signed and returned to the sender to described the
On the unimolecule figure gapped fragments of two generation gene order-checking data, compression de Bruijn are disassembled by optimal arrangement
Figure, and the space between optimal arrangement is filled, to complete the splicing of gene order-checking data.
To described second in joining method associated with the described second generation, third generation gene order-checking data, the step 1
The step of being pre-processed for gene order-checking data includes deleting low-quality number of base sequence reads, generates new portion
Divide base sequence reads, the new number of base sequence reads is broken into length identical kmer.
Joining method associated with the described second generation, third generation gene order-checking data, the step 1 is also included according to using
The kmer length k generation kmer of family input, are saved in Hash table, and record kmer occurrence number.
Joining method, de described in the step 2 associated with the described second generation, third generation gene order-checking data
The side fusion that there is no multiple outlet or entrance in Bruijn turns into a line, is used as compression side.
Joining method associated with the described second generation, third generation gene order-checking data, it is single that the step 3 also includes generation
The distance between Molecular Graphs gapped fragments estimate, and solve the arrangement of linear programming acquisition global optimum.
Joining method associated with the described second generation, third generation gene order-checking data, the step 3 also includes deleting institute
State the incorrect link of unimolecule figure and mark the repetitive sequence of the unimolecule figure, carry out linearization process.
The present invention also proposes a kind of second generation, splicing system associated with three generations's gene order-checking data, including:
De Bruijn modules are built, for obtaining second generation gene order-checking data, pass through the second generation genome
The second generation gene order-checking data are pre-processed, structure by part base sequence reads quality information in sequencing data
Build de Bruijns;
Generation compression de Bruijn modules, for carrying out sequencing error handle to the de Bruijns, generation is new
De Bruijns, the new de Bruijns are compressed, generation compression de Bruijns obtain the compression
The sequence tuple on side is compressed in de Bruijns;
Concatenation module, for obtaining third generation gene order-checking data, by the third generation gene order-checking data money order receipt to be signed and returned to the sender
Onto the unimolecule figure gapped fragments of the second generation gene order-checking data, compression de is disassembled by optimal arrangement
Bruijn, and the space between optimal arrangement is filled, to complete spliced gene group sequencing data.
Splicing system, the structure de Bruijn moulds associated with the described second generation, third generation gene order-checking data
The step of being pre-processed in block to the second generation gene order-checking data includes deleting low-quality number of base sequence
Reads, generates new number of base sequence reads, the new number of base sequence reads is broken into length identical
kmer。
Splicing system, the structure de Bruijn moulds associated with the described second generation, third generation gene order-checking data
Block also includes the kmer length k generation kmer inputted according to user, is saved in Hash table, and record kmer occurrence number.
Splicing system associated with the described second generation, third generation gene order-checking data, the generation compression de Bruijn
The side fusion that there is no multiple outlet or entrance in de Bruijns described in module turns into a line, is used as compression side.
It is below the overall technology effect of the present invention:
Evaluating the standard of splicing effect quality has splicing length and splicing error rate, and splicing length is generally longer
Contig length, contig N50 and contig N90.Using GAGE (Salzberg, S.L., et al., GAGE:A
critical evaluation of genome assemblies and assembly algorithms(vol 22,pg
557,2012).Genome Research,2012.22(6):P.1196-1196 four kinds of splicing mistakes) are compared, are respectively
Indels, inversions, translocations and relocations.We compare the same SSPACE- of ARCS23
LongRead performance.
The ARCS23 of table 4.1 and SSPACE-LongRead compares in E.coli splicing result
As can be seen from the results, ARCS23 and SSPACE-LongRead achieve relatively good splicing result,
ARCS23 splicing is relatively long, has been nearly completed the splicing of whole gene group.
We have run GAGE assessment software, have counted between three kinds of mistakes, including indels, different scaffolds
Intersection;Inversion a, scaffold changes DNA inside chromosome;Translocation, scaffolds
Reply to the topic on different chromosomes;Relocation, is longer than 200bp contigs insertion and deletion.
Three kinds of error numbers of the GAGE of table 4.2 statistics
#indels | #inversions | #translocations | #relocations | |
SSPACE-LongRead | 0 | 33 | 0 | 1 |
ARCS23 | 2 | 12 | 0 | 0 |
Brief description of the drawings
Fig. 1 is removal tip, bubble and long range connection figure;
Fig. 2 is to form CDB (reads, k) figure;
Fig. 3 is fitting convex functional curves figure;
Fig. 4 schemes for gapped fragment;
Fig. 5 is construction linear programming figure;
Fig. 6 is search long range connection figure;
Fig. 7 is search repetitive sequence figure;
Fig. 8 is gap blank map.
Embodiment
It is below the specific steps of the present invention, it is as follows:
Step 1:Utilize two generation sequencing datas formation de Bruijns.Second generation sequencing data typically contains reads mass
Information, is pre-processed first with quality information to sequencing data, is removed low-quality fragment, is then broken into reads
(k-mer is referred to a reads length identical kmer, continuous to cut, and the sequence length that base stroke is obtained in turn is
K nucleotide sequence), de Bruijns are built, during ARCS23 reads in reads, the kmer length inputted according to user
K generates kmer, is saved in Hash table, the number of times that record kmer occurs, and code is realized in SOAPdenovo2, using 1
Individual byte represents kmer occurrence number, and this mode can only at most preserve 255, in second generation sequence assembly, general to survey
Sequence depth can be higher, kmer occurrence number can more than 255, in the present invention, it is necessary to preserve accurate kmer occurrence numbers,
For carrying out except the estimation of wrong and sequence tuple;
Step 2:Error handle is sequenced in de Bruijns.The reads pre-processed is still wrong containing substantial amounts of sequencing
By mistake, ARCS23 is not merely with kmer coverage, while using the topological property of de Bruijns, distinguishing that sequencing mistake is led
The sequence of cause, the error handle of this step removes the kmer of some apparent errors, the scale of de Bruijns is reduced, to DB
(reads, k) the upper average kmer occurrence numbers of each h-path calculating a, if h-path average kmer occurrence numbers
Less than threshold value, then in DB, (reads, k) edge contract for representing kmer all on this h-path can so remove big portion
Divide due to tips, bubbles caused by sequencing mistake.
Such as Fig. 1, there is a bubble in (A), the side of wherein black box is the generally random survey by being sequenced caused by mistake
The kmer occurrence numbers that sequence mistake is produced are fewer, and the side of (B) wherein black box is because reads 5 ' ends or 3 ' ends are surveyed
Tips caused by sequence mistake.(C) remotely connected caused by chimeric reads or random sequencing mistake.
Random sequencing mistake causes the kmer occurrence numbers in de Bruijns on a certain bar h-path fewer.In de
In Bruijn, a tip or bubble causes often caused by one or several neighbouring sequencing mistakes, this tip
Or on bubble kmer occurrence number it is few, directly by tip or bubble kmer all delete, in addition, so doing
Some can be prevented from deleting due to the sequencing low kmer for causing occurrence number less of depth.
Step 3:Form compression de Bruijns.To not there is no the side fusion of multiple outlet or entrance in de Bruijns
Into a line, it is called compression side, the figure of formation is called compression de Bruijns, and compression de Bruijns significantly reduce figure
Scale, maintains the link information between sequence.
DB (reads, k) in hub nodes as CDB (reads, k) in node, if having one between node u and v
Bar h-path, then connect a line between u and v, preserves h-path sequence information, it is noted that may have many between 2 points herein
Weight side, in fig. 2, A) sequence is ATCGGTCGC;B it is 4) to select kmer length, forms de Bruijns;C compression de) is formed
Bruijn.
Forming CDB, (scale of figure, can k) be reduced significantly by reads, remove the sequence information of bulk redundancy, every pressure
Cissing, only preserves the sequence read, it is not necessary to preserve all kmer.
Step 4:The sequence tuple on estimation compression side.The entirety that side can be estimated as sequence tuple is compressed, that is, compresses side
Upper all kmer are identicals in the occurrence number of genome, and the present invention utilizes the coverage information for compressing kmer on side,
The cost function on side is designed, solves to compress the tuple estimation problem on side using minimum cost flow algorithm.
The present invention is compressed the sequence tuple estimation on side using the model of maximum likelihood, specific as follows shown:
Annular genome is represented with D, length is N (D), diRepresent the sequence tuple on i-th compression side, XijRepresent i-th
On individual compression side, the kmer of j-th of position occurrence number, in probability theory, n kmer is the output of n independent experiment,
In testing each time, a position is sampled with identical outline from D to be come out, and output is positions of this kmer on compression side,
Given i and j, the output for being once experiment is that the probability of this position isConsider each stochastic variable, they obey two
Item distribution, when considering together, they obey multinomial distribution:
The number on wherein compression side is g, and the kmer positional numbers on i-th of compression side are ni。
For Bonding Problem, D is ignorant, but the result of n experiment knows, so, tested at given n time
As a result Xij, present invention consideration diDistribution, be referred to as the likelihood of global kmer number:
In this strategy, the maximum genome of the present invention one overall situation kmer occurrence numbers likelihood of splicing minimizes
The negative logarithm of this likelihood ,-logL, the present invention solved using the oriented fee flows of convex expense, it is therefore desirable to-logL be for
diThe convex function that can divide, it is, the present invention needs to find a convex function ci, satisfaction-logL=∑s ci(di), because this
Multinomial distribution has constant N (D)=∑ di, can not find such function.
As test number (TN) tends to be infinite, XijStochastic variable tends to independence, because global test number (TN) is generally than larger, sheet
Invention can be come with the product of single bi-distribution approximate polynomial distribution, binomial it is approximate in, the length N (D) of genome is one
Individual constant, independently of each di, can be represented with N, N size can be by Bioexperiment or EM strategy come near
Seemingly, the present invention passes throughTo calculate.
Then, the approximate of L becomes:
The present present invention can be write as formula-logL=K ∑s ci(di), wherein K is independently of all diIt is normal
Number, and:
As shown in figure 3, function is a convex function, the present invention is fitted with two straight lines, and the slope of straight line represents unit
The expense of stream.
CDB (reads, k) in, for each compression when forming two, wherein the expense of one is the negative left side
The slope of straight line, flow rate zone arrives function minimum point for 0, and the expense of another is the slope of negative straight right, flow rate zone
Arrive infinite for 0;Source point s and meeting point t is added on the diagram, a line is drawn from source point t to all nodes, expense is 0, flow rate zone
Arrive infinite for 0;Draw a line to meeting point t from all nodes, expense is 0, flow rate zone arrives infinite for 0.
Step 5:On the unimolecule figure gapped fragments that three generations's sequencing data of replying to the topic is formed to two codes or datas, formed
The distance between gapped fragments estimate, solve the arrangement that linear programming obtains global optimum.
When building gapped fragment figures, choosing CDB, (reads, k) sequence tuple is 1 compression side conduct
The point of gapped fragment figures.In general, in genome, sequence tuple is the distance between 1 compression side and relative
Order is unique.
The compression side that these sequence tuples are 1 is picked out, fairly simple figure can be constructed by money order receipt to be signed and returned to the sender pair-kmer,
I.e. discounting for the influence of sequencing mistake or repetitive sequence, this gapped fragment figure should be directed acyclic graph
((Directed Acyclic Graph, DAG).If add the compression side that sequence tuple is not 1, although can keep as far as possible
Relative distance relation between all compression sides, but gapped fragment figures can be made extremely complex.
A total of 5 sections of sequences on such as Fig. 4, (a) genome, the tuple of sequence B is 2, and others are 1, and de is compressed in (b) formation
Bruijn, a total of 4 compressions side, pair-kmer is replied to the topic onto this four edges, if (c) gapped fragment scheme
It is upper to retain in all 4 compressions sides, the extremely complex gapped fragment figures of formation, (d) gapped fragment figures
Only retain the compression side that tuple is 1, scheme relatively easy.
Only select sequence tuple be 1 compression side as the point of gapped fragment figures, enough information can be kept
To extend sequence, the repetitive sequence that those are shorter than insert the distance in genome is all the compression that sequence tuple is 1 before and after them
Side, can have pair-kmer to link up the compression side before and after them, for the long repetitive sequence of those length, if
Their length is more than insert the distance, it is difficult to which they are spelled out.
For any two point u and v, if there is pair-kmer two ends exist respectively on u and v, then a line u →
v.ARCS23 reads in pair-ends again, forms pair-kmer, is replied to the topic to compressing on side, preserves for the pair-kmer that replies to the topic
Number and range information.
For two compression side C1And C2, length is l respectively1And l2.The distance between the two compression sides are represented with d:
R represents the length at read two ends, and a total of n pair-kmer replies to the topic to C1And C2, i-th of pair-kmer reply to the topic
Position be d1i,d2i, f represents the experience distribution of insert the distance, and most likely distance d can obtain with EM algorithms.
Present invention xiRepresent compression side CiRelative position, formalize this problem for linear programming problem:
s.t.xj-xi+eij=dij
|eij|≤Eij
D hereinijRepresent compression side CiAnd CjEstimated distance, eijRepresent dijAnd xj-xiDeviation.Optimization aim is minimum
Change deviation add and.
Step 6:In this optimal arrangement, the conflict on some compression side positions may possibly still be present.From these positions
Conflict start, search remove unimolecule figure incorrect link, remove gapped fragment figures incorrect link and mark
The repetitive sequence of gapped fragment figures, carries out linearization process.
Exist in pair-kmer figures it is substantial amounts of due to chimeric reads or sequencing mistake caused by side, if
Delete the fewer side of pair-kmer numbers using fixed threshold, threshold value set it is big a bit, can delete more wrong
Side, while can also delete some because two ends are compressed when comparing short-range missile and causing pair-kmer originally fewer;It is small that threshold value is set
A bit, the side for having many mistakes is not deleted.
In ARCS23, the present invention deletes the wrong side in pair-kmer using the threshold value of change.For two pressures
Cissing C1And C2, length is l respectively1And l2, the distance between the two compression sides are represented with d.Relatively total pair-kmer's
Number, left and right ends fall in C1And C2On pair-kmer it is less.Left and right ends fall in C1And C2On pair-kmer numbers with
Machine variable X12Represent, X12Poisson distribution is obeyed, parameter lambda can be estimated also by following mode:
Wherein, the kmer in the averagely each sites of e number.The marking of each edge can be represented with likelihood, it is considered to side C1→
C2, its quality is
So, the quality on all sides in pair-kmer figures is calculated, the relatively low side of mass ratio is deleted.
Because tuple estimation may malfunction on some compression sides, so the present invention can be by the compression of some repetitive sequences
While being placed on pair-kmer figures.Mistake or chimeric reads is sequenced by distant on genome or even complementation in some
Compression side on chain connects together, and there are conflict, such as Fig. 5 in this position for also resulting in the compression side that Solutions of Linear Programming comes out.
In Figure 5, it is such while only appear in two conflict compression while between path on, ARCS23 finds out conflict
Path of the compression between improve quality minimum while, if the quality of this edge is significantly less than the quality on all sides of surrounding,
Then this edge is deleted.
The compression side position different on genome that repetitive sequence is represented, the compression for so causing Solutions of Linear Programming to come out
Collided with each other on the position on side, such as Fig. 6.
In the figure 7, all sides are all 2 when compression is the compression for adding repetitive sequence while the reason for 4 and 9 conflict, in figure
It is correct, thus quality is all very high, at this moment, ARCS23 finds the public ancestors or public descendants on the compression side of conflict, this
A little nodes are likely to be repetitive sequence, and they are deleted.
Step 7:Compression de Bruijns are disassembled using optimal arrangement, the space between optimal arrangement is filled.This when
The present invention obtained relatively complicated de Bruijns DB (reads, k) and across the wider pair-kmer of scope
Figure, can form contig under the auxiliary of pair-kmer figures.The weak connectedness branch of each pair-kmer figure is exactly one
contig.The relative ranks of each connected component's internal pressure cissing and distance between any two in known pair-kmer figures, are utilized
(reads k) gets up the gap filling between them DB above.
The compression side that these gaps largely all or by repetitive sequence are produced is constituted, or because sequencing depth ratio is relatively low
Cause no reads to cover this region, for any a line u → v in pair-kmer figures, DB (reads, k) in look for from
Compress in u to v during compression the distance path similar with estimated distance:
1) if the sequence between an only paths, u to v is fairly simple, directly filled out with the sequence on this paths
Fill B in u to v gap, such as Fig. 8.
If 2) without such path, the sequencing depth ratio of this section of sequence is relatively low from u to v, with ' N ' fill, in such as Fig. 8
A。
3) if mulitpath, one scoring functions of design are given a mark to these paths, selection fraction highest road
Fill C in this intersegmental gap such as Fig. 8 in footpath.
The present invention also proposes a kind of second generation, splicing system associated with three generations's gene order-checking data, including:
De Bruijn modules are built, for obtaining second generation gene order-checking data, pass through the second generation genome
The second generation gene order-checking data are pre-processed, structure by part base sequence reads quality information in sequencing data
Build de Bruijns;
Generation compression de Bruijn modules, for carrying out sequencing error handle to the de Bruijns, generation is new
De Bruijns, the new de Bruijns are compressed, generation compression de Bruijns obtain the compression
The sequence tuple on side is compressed in de Bruijns;
Concatenation module, for obtaining third generation gene order-checking data, by the third generation gene order-checking data money order receipt to be signed and returned to the sender
Onto the unimolecule figure gapped fragments of the second generation gene order-checking data, compression de is disassembled by optimal arrangement
Bruijn, and the space between optimal arrangement is filled, to complete spliced gene group sequencing data.
The step of being pre-processed in the structure de Bruijn modules to the second generation gene order-checking data is wrapped
The low-quality number of base sequence reads of deletion is included, new number of base sequence reads is generated, by the new number of base
Sequence reads is broken into length identical kmer.
The structure de Bruijns module also includes the kmer length k generation kmer inputted according to user, is saved in Kazakhstan
In uncommon table, and record kmer occurrence number.
There is no multiple outlet or entrance in de Bruijns described in the generation compression de Bruijn modules
Side fusion turn into a line, be used as compression side.
Claims (10)
1. joining method associated with a kind of second generation, third generation gene order-checking data, it is characterised in that including:
Step 1, second generation gene order-checking data are obtained, pass through part base sequence in the second generation gene order-checking data
The second generation gene order-checking data are pre-processed by reads quality information, build de Bruijns;
Step 2, sequencing error handle is carried out to the de Bruijns, new de Bruijns is generated, to the new de
Bruijn is compressed, generation compression de Bruijns, obtains the sequence weight that side is compressed in the compression de Bruijns
Number;
Step 3, third generation gene order-checking data are obtained, by the third generation gene order-checking data money order receipt to be signed and returned to the sender to the second generation
On the unimolecule figure gapped fragments of gene order-checking data, compression de Bruijns are disassembled by optimal arrangement, and
The space between optimal arrangement is filled, to complete the splicing of gene order-checking data.
2. joining method associated with the second generation as claimed in claim 1, third generation gene order-checking data, it is characterised in that
The step of being pre-processed in the step 1 to the second generation gene order-checking data includes deleting low-quality number of base
Sequence reads, generates new number of base sequence reads, the new number of base sequence reads is broken into length identical
Kmer.
3. joining method associated with the second generation as claimed in claim 2, third generation gene order-checking data, it is characterised in that
The step 1 also includes the kmer length k generation kmer inputted according to user, is saved in Hash table, and record going out for kmer
Occurrence number.
4. joining method associated with the second generation as claimed in claim 1, third generation gene order-checking data, it is characterised in that
The side fusion that there is no multiple outlet or entrance described in the step 2 in de Bruijns turns into a line, is used as compression
Side.
5. joining method associated with the second generation as claimed in claim 1, third generation gene order-checking data, it is characterised in that
The step 3 also includes the distance between generation unimolecule figure gapped fragments estimation, and solves linear programming acquisition
Global optimum arranges.
6. joining method associated with the second generation as claimed in claim 1, third generation gene order-checking data, it is characterised in that
The step 3 also includes the repetitive sequence deleted the incorrect link of the unimolecule figure and mark the unimolecule figure, enters line
Propertyization processing.
7. splicing system associated with a kind of second generation, third generation gene order-checking data, it is characterised in that including:
De Bruijn modules are built, for obtaining second generation gene order-checking data, pass through the second generation gene order-checking
The second generation gene order-checking data are pre-processed by part base sequence reads quality information in data, build de
Bruijn;
Generation compression de Bruijn modules, for carrying out sequencing error handle to the de Bruijns, generate new de
Bruijn, is compressed to the new de Bruijns, generation compression de Bruijns, obtains the compression de
The sequence tuple on side is compressed in Bruijn;
Concatenation module, for obtaining third generation gene order-checking data, by the third generation gene order-checking data money order receipt to be signed and returned to the sender to institute
State on the unimolecule figure gapped fragments of second generation gene order-checking data, compression de is disassembled by optimal arrangement
Bruijn, and the space between optimal arrangement is filled, to complete spliced gene group sequencing data.
8. splicing system associated with the second generation as claimed in claim 7, third generation gene order-checking data, it is characterised in that
The step of being pre-processed in the structure de Bruijn modules to the second generation gene order-checking data includes deleting low
The number of base sequence reads of quality, generates new number of base sequence reads, by the new number of base sequence reads
It is broken into length identical kmer.
9. splicing system associated with the second generation as claimed in claim 8, third generation gene order-checking data, it is characterised in that
The structure de Bruijns module also includes the kmer length k generation kmer inputted according to user, is saved in Hash table,
And record kmer occurrence number.
10. splicing system associated with the second generation as claimed in claim 7, third generation gene order-checking data, it is characterised in that
There is no multiple outlet or the side fusion of entrance described in the generation compression de Bruijn modules in de Bruijns
As a line, compression side is used as.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510346970.1A CN104951672B (en) | 2015-06-19 | 2015-06-19 | Joining method and system associated with a kind of second generation, three generations's gene order-checking data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510346970.1A CN104951672B (en) | 2015-06-19 | 2015-06-19 | Joining method and system associated with a kind of second generation, three generations's gene order-checking data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104951672A CN104951672A (en) | 2015-09-30 |
CN104951672B true CN104951672B (en) | 2017-08-29 |
Family
ID=54166325
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510346970.1A Expired - Fee Related CN104951672B (en) | 2015-06-19 | 2015-06-19 | Joining method and system associated with a kind of second generation, three generations's gene order-checking data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104951672B (en) |
Families Citing this family (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106022002B (en) * | 2016-05-17 | 2019-03-29 | 杭州和壹基因科技有限公司 | A kind of filling-up hole method based on three generations's PacBio sequencing data |
CN106021997B (en) * | 2016-05-17 | 2019-03-29 | 杭州和壹基因科技有限公司 | A kind of comparison method of three generations PacBio sequencing data |
CN106022003B (en) * | 2016-05-17 | 2019-03-29 | 杭州和壹基因科技有限公司 | A kind of scaffold construction method based on three generations's PacBio sequencing data |
CN106021985B (en) * | 2016-05-17 | 2019-03-29 | 杭州和壹基因科技有限公司 | A kind of genomic data compression method |
WO2018000174A1 (en) * | 2016-06-28 | 2018-01-04 | 深圳大学 | Rapid and parallelstorage-oriented dna sequence matching method and system thereof |
CN107784198B (en) * | 2016-08-26 | 2021-06-15 | 深圳华大基因科技服务有限公司 | Combined assembly method and system for second-generation sequence and third-generation single-molecule real-time sequencing sequence |
CN107841542A (en) * | 2016-09-19 | 2018-03-27 | 深圳华大基因科技服务有限公司 | A kind of generation sequence assemble method of genome contig two and system |
CN108460245B (en) * | 2017-02-21 | 2020-11-06 | 深圳华大基因科技服务有限公司 | Method and apparatus for optimizing second generation assembly results using third generation sequences |
CN108573127B (en) * | 2017-03-14 | 2021-04-27 | 深圳华大基因科技服务有限公司 | Processing method and application of original data of third-generation nucleic acid sequencing |
CN108629156B (en) * | 2017-03-21 | 2020-08-28 | 深圳华大基因科技服务有限公司 | Method, device and computer readable storage medium for correcting error of third generation sequencing data |
CN110313033A (en) * | 2017-04-01 | 2019-10-08 | 深圳华大基因科技服务有限公司 | Two generation sequences of one kind and the united assemble method of three generations's sequence gene group and system |
CN107229842A (en) * | 2017-06-02 | 2017-10-03 | 肖传乐 | A kind of three generations's sequencing sequence bearing calibration based on Local map |
CN107256335A (en) * | 2017-06-02 | 2017-10-17 | 肖传乐 | A kind of preferred three generations's sequencing sequence comparison method of being given a mark based on global seed |
CN107944221B (en) * | 2017-11-21 | 2020-12-29 | 南京溯远基因科技有限公司 | Splicing algorithm for parallel separation of nucleic acid fragments and application thereof |
CN108897986B (en) * | 2018-05-29 | 2020-11-27 | 中南大学 | Genome sequence splicing method based on protein information |
CN108830047A (en) * | 2018-06-21 | 2018-11-16 | 河南理工大学 | A kind of scaffolding method based on long reading and contig classification |
CN109192246B (en) * | 2018-06-22 | 2020-10-16 | 深圳市达仁基因科技有限公司 | Method, apparatus and storage medium for detecting chromosomal copy number abnormalities |
CN109658985B (en) * | 2018-12-25 | 2020-07-17 | 人和未来生物科技(长沙)有限公司 | Redundancy removal optimization method and system for gene reference sequence |
CN110016498B (en) * | 2019-04-24 | 2020-05-08 | 北京诺赛基因组研究中心有限公司 | Method for determining single nucleotide polymorphism in Sanger method sequencing |
CN110379462B (en) * | 2019-06-21 | 2021-11-26 | 中南民族大学 | Method for assembling Chinese Jinyao chloroplast genome sequence based on Illumina technology |
US11515011B2 (en) | 2019-08-09 | 2022-11-29 | International Business Machines Corporation | K-mer based genomic reference data compression |
CN112802554B (en) * | 2021-01-28 | 2023-09-22 | 中国科学院成都生物研究所 | Animal mitochondrial genome assembly method based on second-generation data |
CN115620810B (en) * | 2022-12-19 | 2023-03-28 | 北京诺禾致源科技股份有限公司 | Method and device for detecting exogenous insertion information based on third-generation gene sequencing data |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103258145B (en) * | 2012-12-22 | 2016-06-29 | 中国科学院深圳先进技术研究院 | A kind of parallel gene-splicing method based on De Bruijn |
CN103093121B (en) * | 2012-12-28 | 2016-01-27 | 深圳先进技术研究院 | The compression storage of two-way multistep deBruijn figure and building method |
CN103699813B (en) * | 2013-12-10 | 2017-05-10 | 深圳先进技术研究院 | Method for identifying and removing repeated bidirectional edges of bidirectional multistep De Bruijn graph |
CN104200133B (en) * | 2014-09-19 | 2017-03-29 | 中南大学 | A kind of genome De novo sequence assembly methods based on reading and range distribution |
-
2015
- 2015-06-19 CN CN201510346970.1A patent/CN104951672B/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN104951672A (en) | 2015-09-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104951672B (en) | Joining method and system associated with a kind of second generation, three generations's gene order-checking data | |
Ma et al. | Reconstructing contiguous regions of an ancestral genome | |
Sundquist et al. | Whole-genome sequencing and assembly with high-throughput, short-read technologies | |
Batzoglou et al. | ARACHNE: a whole-genome shotgun assembler | |
Löytynoja | Phylogeny-aware alignment with PRANK | |
Deonier et al. | Physical Mapping of DNA | |
CN108121897B (en) | Genome variation detection method and detection device | |
Coombe et al. | Assembly of the complete Sitka spruce chloroplast genome using 10X Genomics’ GemCode sequencing data | |
Haghshenas et al. | HASLR: fast hybrid assembly of long reads | |
CN102206704B (en) | Method and device for assembling genome sequence | |
Sahraeian et al. | PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences | |
KR20160073406A (en) | Systems and methods for using paired-end data in directed acyclic structure | |
Hossain et al. | Crystallizing short-read assemblies around seeds | |
WO2002026934A2 (en) | System and process for validating, aligning and reordering genetic sequence maps using ordered restriction map | |
CN104200133A (en) | Read and distance distribution based genome De novo sequence splicing method | |
KR101930253B1 (en) | Apparatus and method constructing consensus reference genome map | |
CN107798216A (en) | The comparison method of high similitude sequence is carried out using divide and conquer | |
Löytynoja | Phylogeny-aware alignment with PRANK and PAGAN | |
Zhang et al. | An Eulerian path approach to global multiple alignment for DNA sequences | |
CN106355000B (en) | The scaffolding methods of insert size statistical natures are read based on both-end | |
Goltsman et al. | Meraculous-2D: Haplotype-sensitive assembly of highly heterozygous genomes | |
US20150142328A1 (en) | Calculation method for interchromosomal translocation position | |
Penner et al. | An algebro-topological description of protein domain structure | |
CN105069325B (en) | It is a kind of that matched method is carried out to nucleic acid sequence information | |
KR20160039386A (en) | Apparatus and method for detection of internal tandem duplication |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170829 Termination date: 20210619 |
|
CF01 | Termination of patent right due to non-payment of annual fee |