CN108830047A - A kind of scaffolding method based on long reading and contig classification - Google Patents
A kind of scaffolding method based on long reading and contig classification Download PDFInfo
- Publication number
- CN108830047A CN108830047A CN201810642753.0A CN201810642753A CN108830047A CN 108830047 A CN108830047 A CN 108830047A CN 201810642753 A CN201810642753 A CN 201810642753A CN 108830047 A CN108830047 A CN 108830047A
- Authority
- CN
- China
- Prior art keywords
- contig
- scaffold
- duplicate
- sco
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The scaffolding method based on long reading and contig classification that the invention discloses a kind of.This method closes long read-around ratio to contig collection first, generates part scaffold set according to comparison result.One part scaffold is made of the contig compared to the long reading of same.Based on the location information that every contig occurs in local scaffold, all contig are divided into two classes, one kind is to repeat contig, and another kind of is non-duplicate contig.Only the scaffold comprising non-duplicate contig schemes for building, one non-duplicate contig of each node on behalf in figure.Direction and sequence conflict in scaffold figure are eliminated followed by linear programming method, and is made only comprising simple path in scaffold figure, wherein the corresponding scaffold of every simple path.Then it is inserted into scaffold contig is repeated, forms final scaffolding result.The present invention is easy to use, and good scaffolding is shown on different truthful datas as a result, more other scaffolding methods have higher accuracy and continuity.
Description
Technical field
It is especially a kind of based on long reading and contig classification the present invention relates to the sequence assembling field of bioinformatics
Scaffolding method.
Background technique
Genome generally refers to all encode DNA (DNA) sequence with non-coding, it is by four kinds of bases:
The sequence of adenine (A), thymidine (T), cytimidine (C) and guanine (G) composition, i.e. genome sequence is a character
It goes here and there, only includes four characters A, T, G, C in this character string.It also include another character N, generation in actual gene group sequence
The base of the table position can not determine.Genomic dna sequence contains heredity and regulation and controlling of information, guides biological development and life
Function running.In fundamental biological knowledge research and numerous application fields, such as diagnosis, biotechnology, Forensic Biology, biology department
During system is learned, complete and correct genomic dna sequence has become indispensable knowledge.By gene order-checking, can obtain
Base sequence segment (reading or read) in lots of genes group sequence.Sequence assembling be the sequence fragment that is obtained by these also
The method of former whole gene group DNA sequence dna.And due to duplicate block, sequencing mistake and the problems such as unbalanced is sequenced, sequence assembling
Method often first generates some relatively independent and more scattered sequence fragments, i.e. contig, these contig are likely distributed in gene
The arbitrary region of group DNA sequence dna, and since DNA sequence is duplex structure, these contigs may be in appointing in double-strand
It anticipates on a chain.Scaffolding method is exactly the direction and ordinal relation between these determining contigs, and then is generated more
Long scaffold.Scaffolding can make sequence assembling result more continuous and complete, this facilitates subsequent gene identification,
Genome alignment, the research such as structure variation detection, is one of the hot spot in sequence assembling research.
Currently, significantly being dropped by the second generation sequencing technologies of representative of Illumina/Solexa and AB/SOLid company
While inexpensive, single operation can also generate magnanimity and the lower reading of error rate.Therefore, second generation sequencing technologies exist
It is widely used both at home and abroad.The short reading (paired reads) of the both-end obtained by second generation sequencing technologies is to come from
Two sequence fragments at one section of longer original genomic sequence segment both ends.The spacing (insert size) of the short reading of both-end can
To reach many kilobases, thus the short reading of both-end can across one section of longer region and overcome in sequence assembling part weight
Multiple area's problem, therefore the scaffolding method based on the short reading of both-end obtains researcher and widely pays close attention to.Its step
Contig usually is generated first with existing sequence assembling tool, then the short read-around ratio of both-end to on contig, then is led to
Comparison information building scaffold figure (scaffold graph or bidiercted graph) is crossed, and then infers contigs
Between direction and ordinal relation.
With the rapid development of sequencing technologies, speed faster the higher third generation sequencing technologies of flux just gradual perfection at
It is ripe.Third generation sequencing technologies mainly have the unimolecule of Pacific Ocean Biological Science Co., Ltd (Pacific Biosciences) to survey in real time
The nanometer pore single-molecule technology of sequence technology and Oxford Nanotec Solution (OxfordNanopore Technology).The third generation
Long reading length caused by sequencing technologies can achieve tens of thousands of bases, these length readings can be across major part in genome
Duplicate block, and then researcher is helped to obtain complete genome sequence.It, can be across since the length of long reading is longer
Most of duplicate block, but the sequencing error rate of long reading is higher, commonly reaches 15% or so.
Due to second generation sequencing technologies comparative maturity, and it is excellent to have that sequencing data accuracy is high, at low cost and flux is high etc.
Gesture, so being at home and abroad widely used.Although the reading that second generation sequencing technologies generate is shorter, sequencing
The spacing (insert size) of the short reading of obtained both-end can achieve many kilobases, can overcome part repeat region bring
Problem.The existing scaffolding method based on the short reading of both-end generally comprises following two step:(1) scaffold schemes
Building.In scaffold figure, a node often represents a contig, while representing corresponding two contigs in gene
It is adjacent in group sequence.(2) infer direction and the sequence between contigs.Based on the topological structure of scaffold figure, no
Same scaffolding method takes different strategies to extract corresponding path.Each path corresponds to a scaffold.
It is all the information uniquely compared that SCARPA only retains the short reading of both-end first.When constructing scaffold figure, every
Corresponding two nodes of a contig respectively represent the 3 ' ends and 5 ' ends of the contig.SCARPA is the side between determining contig
It is converted into minimum odd number circle Traversal Problem to problem, is solved using preset parameter algorithm, is made by deleting some sides
Odd number circle is not present in the associated diagram of contig.Mono- origin coordinates of each contig is distributed to by linear programming method again,
Delete the side for not meeting distance restraint.The sequence between contig is finally determined using heuristic based on scaffold figure
Relationship.SSPACE is when constructing scaffold figure, using contig as node, if can be matched between two contigs
The short reading of both-end is greater than a threshold value, then adds a line between them.SSPACE is tied according to the topology of scaffold figure
Structure, the extensions path since longest contig, according to the weight size on side, are set when there is multiple summits for subsequent expansion
A kind of greedy strategy is extended.MIP is divided according to connectivity pair scaffold figure, is then carried out on each subgraph
scaffolding.For each edge in scaffold figure, the synthesis of a kind of fusion direction relations and ordinal relation is designed about
Beam, the weight of each edge are still set as the short number of readings of both-end that can be matched between two contigs.SOPRA according to
Direction constraint and distance restraint between contig are designed in side in scaffold figure.Then design it is a kind of greediness it is heuristic
Algorithm, the side collection for deleting minimal weight find the direction a contig allocation plan and meet all remaining direction constraints.Most
It afterwards, is still to delete the side collection of minimal weight to find position contig point when determining the ordinal relation between contig
Meet remaining distance restraint with scheme.BESST according to the short reading of the both-end being matched between two contig, calculate this two
Difference and the short reading of both-end between a contig between theoretical standard deviation and the actual standard difference of distance is at two
Whether position distribution difference on contig, a line should be added between two contig by inferring.The power on side in the above method
Weight is often set as the short number of readings of both-end that can be matched between two nodes.BESST is in the scaffold figure of building, choosing
The path of the short reading of most both-ends can be matched out as final result.ScaffMatch, which infers entire direction and sequence, to be asked
Topic is converted into the acyclic two points of matching problems of weight limit of figure, and some sides of the deletion of iteration make that ring is not present in the figure.Final
In the acyclic figure of building, contig node is linearized.GRASS, which provides a kind of Integer programming, to be made between contig
Direction and sequence infer specification into an individual optimization aim.SLIQ devises one group of linear inequality and is used to constrain
Direction and ordinal relation between contig.SILP2 removes the match information of possible mistake using Maximum Likelihood Model and discovery is read
The region of number coverage exception, and solved using integral linear programming method.Briot et al. is then used on scaffold figure
The method of the broken circle of iteration determines direction and ordinal relation between contig.WiseScaffolder uses a kind of need manually
The method optimizing scaffolding method of intervention.Weller et al. is using a kind of tree decomposition method to the direction between contig
It is solved with ordinal relation.Direction between contig is determined that problem is converted into two-dimensional plot by ScaffoldScaffolder
In include at least the Solve problems of k directed edge, and prove that the problem is NP-hard.Then it is calculated according to maximum spanning tree
Method devises the new greedy algorithm of one kind and solves the problems, such as this, and is compared in performance with other several heuritic approaches.
The direction contig inference problems are converted map colouring problem by Bambus.Bambus2 passes through the side for deleting minimal weight in figure,
Make that inconsistent direction is not present in figure, direction and the sequence between contig are solved followed by optimum linearity built-up pattern
Relationship, and the site that may be occurred according to output interpretation of result genome mutation.
The long reading that third generation sequencing technologies generate can reach tens of thousands of bases, therefore long reading can be across most of weight
Multiple area, but its sequencing error rate is too high.Scaffolding method based on long reading mainly includes following two step:
(1) contigs and long reading are compared.Since long reading includes too many base replacement mistake, base inserting error and alkali
Base deletion error etc., therefore how to obtain accurate comparison result is the key that the step.(2) infer between contigs
Direction and ordinal relation.Based on comparison result, different methods is taken to linearize contigs, and then determines that direction and sequence are closed
System.
SSPACE-LongRead is a scaffolding tool individually for long reading.It (can be with contig
It is that Optional assembling tool generates) and reading is grown for input data.Using BLASR long read-around ratio to on contig, according to
Comparison position of the contig on long reading determines that those can compare the contig of the long reading of same, and detecting can
Compare the contig of multiple long readings.It is then based on comparison result, contigs is ranked up.LINKS method is taken out first
It takes k-mers pairs of in long read (sequence fragment that length is k), spacing of these pairs of k-mers on long reading is one
A definite value.Then pairs of k-mers and contigs are compared, and then determine the comparison position between contigs and long reading
Relationship.On the basis of comparison information, using a kind of heuristic determine the left and right of every contig close to contig, and
Infer direction relations.A kind of method that OPERA-LG uses accommodation carries out scaffolding using long read.It is sharp first
With BLASR then long read-around ratio converts the long reading for meeting comparison condition setting to on contig, makes long reading
Number is converted into the short reading of both-end, then again based on the short reading of both-end carry out scaffolding infer direction between contig and
Ordinal relation.DBG2OLC method is first compared contig and long reading, and one long reading and comparison to thereon
Contig be converted into compression reading, then determine the overlapping region between compression reading, seek unification sequence, finally push away
Direction and sequence between disconnected contig.AHA method is determined to match the long reading of multiple contig first, according to this
Some long read determines that the connections between contigs simultaneously establish associated diagram (scaffold graph), then to scaffold figure into
Row optimizes and linearizes node and export as a result.
Although currently, the short reading of the both-end based on second generation sequencing technologies or the long reading based on third generation sequencing technologies
Scaffolding method has been achieved for good result.But there is a problem of following still not adequately addressed need into one
Step research:
(1) in the short comparison information read between contig using both-end, due to reading shorter, the short reading of both-end
It is easy to compare to multiple positions, especially some duplicate blocks.Which increase the connectivity of scaffold figure, and then influence
The accuracy of scaffolding.
(2) when using the long comparison information read between contig, since the sequencing error rate of long reading is relatively high,
So there are more noises for the comparison information between long reading and contig.How to realize between long reading and contig
Precise alignment is a difficult point.
(3) existing scaffolding method often assumes each contig only can occur once in scaffold.
And some contig may be some duplicate blocks, this requires these contig to want to occur in multiple scaffold repeatedly.
The presence of these problems limits existing scaffolding method and obtains more satisfying result.
Summary of the invention
The technical problem to be solved by the present invention is in view of the above shortcomings of the prior art, provide a kind of based on long reading
With the scaffolding method of contig classification, easy to use, accuracy is high.
In order to solve the above technical problems, the technical scheme adopted by the invention is that:
A kind of scaffolding method based on long reading and contig classification, includes the following steps:
1) long read-around ratio is closed first to contig collection, and generates local scaffold;
1.1) comparison tool BWA is utilized, long reading set is compared to contig collection and is closed, comparison result is generated.Wherein
Only consider that length is greater than LrLong reading and length be greater than LcContig, Lr=500, Lc=3000.
1.2) it is directed to a long reading, extracts all contig set that can be compared on it, and calculate comparison area
Between position.If without or an only contig comparison on, this it is long reading do not do subsequent processing.If there is two
Or more a plurality of contig can be compared, then according to the comparison position and direction between the long reading of this and these contig
Information determines the sequencing between these contig and direction, and generates a part scaffold.It is all when having handled
Long reading after, generate a part scaffold and gather.
2) contig classifies;
If a contig appears in a middle position (i.e. non-part of two or more parts scaffold
First in scaffold or the last one contig), and in different part scaffold close to it 5 ' end (or
Person 3 ' holds) contig it is incomplete the same, then this contig is to repeat contig;Remaining contig is non-duplicate
contig.After having handled all contig, then all contig are divided into two classes, repeat contig and non-duplicate
contig。
3) construct and optimize scaffold figure;
3.1) node is constructed first against each non-duplicate contig;Judge that two non-duplicate contig whether can
It appears in the scaffold of same part simultaneously, if it can, determining the two non-duplicate contig then according to comparison information
Between direction and order information, and the distance between calculate them.Then a line is added between them, and is determined
The weight on side.After having handled all nodes two-by-two, then an initial scaffold figure building is completed.
3.2) each edge constrains direction between its two node that are connected, sequence and apart from letter in scaffold figure
Breath, therefore according to side all in scaffold figure, linear programming model is constructed, detection and removal cause direction and sequence to rush
Prominent side guarantees that there is no directions and sequence to conflict in scaffold figure.
3.3) after having eliminated conflict, in scaffold figure, if there is multiple nodes simultaneously and some node 5 ' hold
(or 3 ' ends) are connected, then only retain the maximum side of weight, remaining side is removed.Processing through the above steps,
It only include simple path in scaffold figure.
4) scaffold set is generated;
A simple path in scaffold figure contains between the sequence of node and directional information and adjacent node
Range information, therefore the corresponding scaffold of every simple path, and generate a scaffold set.For two
The adjacent non-duplicate contig in scaffold, if a part scaffold includes them, and in the part
Include between them in scaffold is entirely to repeat contig, then the repetition contig that these directions and sequence determine is
One insertion candidate item.If two non-duplicate contig are appeared in a plurality of part scaffold, every part
Scaffold corresponds to an insertion candidate item.In the scaffold, the insertion candidate item with most frequencys is selected to be inserted into this
Among two non-duplicate contig.If the insertion candidate item of most frequencys have it is multiple, the two non-duplicate contig not into
Row insertion operation.After having handled all adjacent non-duplicate contig, final scaffold set is generated.
The step 1.2) specifically includes following steps:
1.2.1) the comparison result generated according to tool BWA is compared, if one long reading and a contig can compare
To upper, its available position for comparing section and comparison direction.Assuming that the long reading (lr of j-th stripj) and i-th of contig (ci)
It can compare, and in lrjOn comparison section be [SPR (ci,lrj),EPR(ci,lrj)], in ciOn comparison section be
[SPC(ci,lrj),EPC(ci,lrj)].Since the sequencing error rate of long reading is relatively high, so the comparison that the tool of comparison provides
Section position often has some deviations.This method is modified using following step to section is compared.
If SPR (ci,lrj)<SPC(ci,lrj), then SPC ' (ci,lrj)=SPC (ci,lrj)-SPR(ci,lrj), SPR '
(ci,lrj)=0.
If SPR (ci,lrj)>=SPC (ci,lrj), then SPR ' (ci,lrj)=SPR (ci,lrj)-SPC(ci,lrj),
SPC’(ci, lrj)=0.
If LEN (lrj)-EPR(ci,lrj)>LEN(ci)-EPC(ci,lrj), then EPC ' (ci,lrj)=LEN (ci) -1,
EPR’(ci,lrj)=EPR (ci,lrj)+LEN(ci)-EPC(ci,lrj)。
If LEN (lrj)-EPR(ci,lrj)<=LEN (ci)-EPC(ci,lrj), then EPC ' (ci,lrj)=EPC (ci,
lrj)+ LEN(lrj)-EPR(ci,lrj), EPR ' (ci,lrj)=LEN (lrj)-1。
If | SPR (ci,lrj)-SPR’(ci,lrj)|<α, or | SPC (ci,lrj)-SPC’(ci,lrj)|<α, or |
EPC(ci,lrj) -EPC’(ci,lrj)|<α, or | EPR (ci,lrj)-EPR’(ci,lrj)|<α then recognizes lrjAnd ciCan not compare
To upper, this method ignores the comparison, wherein α=500.
After above-mentioned amendment, the available long comparison section read between contig of this method, and compare other side
To.
1.2.2) if being merely able to compare a contig on a long reading, the long reading of this does not do subsequent place
Reason.If there is two or more contig can be compared on the long reading of this, then according to these contig in the length
The initial position in section is compared on reading, it is ascending that they are ranked up.If there is a plurality of contig can be compared simultaneously
To the 5 ' ends (or 3 ' ends) of the long reading of this, then only retain the contig that section is compared with longest, it is remaining
Contig is removed.Assuming that long reading lrjIt can be compared with a plurality of contig, then it corresponds to a part
Scaffold, part scaffold can be expressed as a sequence node, wherein each node is a four-tuple.I-th
Local scaffold can be expressed as (si1, si2, si3... sim), wherein m is the contig number that this scaffold includes.
sijIt is a four-tuple (scij, scoij, scgij, sclij), wherein scijIndicate corresponding contig, scoijIndicate the contig
And lrjComparison direction, value is 0 or 1,0 to indicate reversed and compare that 1, which indicates positive, compares.scgijIndicate the contig and
Next contig is in the lrjDistance, i.e. SPR (scgi(j+1),lrj)-EPR(scgij,lrj), scgimIt is set as 0.sclijIt is
Comparison section size EPC (scg on this contigij,lrj)-SPC(scgij,lrj)。
In this scaffold, for adjacent two contig:scijAnd sci(j+1)If scoij=1,
scoi(j+1)=1, then scij3 ' end and sci(j+1)5 ' end be connected;If scoij=1, scoi(j+1)=0, then scij3 '
End and sci(j+1)3 ' end be connected;If scoij=0, scoi(j+1)=0, then scij5 ' end and sci(j+1)3 ' end be connected
It connects;If scoij=0, scoi(j+1)=1, then scij5 ' end and sci(j+1)5 ' end be connected;
After having handled the comparison information between all long reading and contig, then a part scaffold collection is generated
It closes.Every part scaffold contains the direction between the contig of part, sequence and range information.
The step 2) is specially:
The local scaffold set generated according to above-mentioned steps, searches the contig for meeting following two conditions:(1) it
The middle position of two part scaffold is at least appeared in, i.e., in a part scaffold, the contig is neither
One nor the last one contig, and in all part scaffold that it occurs and its 5 ' end (or 3 ' hold)
Adjacent contig is incomplete the same.(2) or the length of a contig is less than MIN, MIN=2000.These contig sentence
It is set to repetition contig, remaining contig is determined as non-duplicate contig.
The step 3.1) specifically includes following steps:
3.1.1) a scaffold figure is made of a node set and a line set.One node is one corresponding
contig.A line eij(c is indicated by a five-tuplei, cj, oij, gij, wij), wherein ciAnd cjIt is the two of this edge connection
A contig, oijIt is the relative direction between the two contig, gijIt is the distance between the two contig size, wijIt is
The weight of this edge.A node is constructed first against each non-duplicate contig.
3.1.2 two non-duplicate contig for meeting following conditions) are searched in local scaffold set, they are at certain
It is adjacent in one part scaffold, or the contig being among them is entirely to repeat contig.
For two non-duplicate contig for meeting above-mentioned condition:caAnd cb.Assuming that i-th of part scaffold:lsi, packet
Containing caAnd cb。lsiIt is expressed as (si1, si2, si3... sim), after neglecting repetition contig, caAnd cbIn lsiIn be adjacent
, and caAnd cbRespectively correspond scipAnd scis.If lsiIn, exist between this two contig and repeat contig, then uses
Following formula calculate the distance between they.
Wherein GD (scip,scis,lri) it is expressed as scipAnd scisBetween distance on corresponding long reading.LEN(scij) be
scijThe length of this contig.
If then the distance between they are scg there is no contig, i.e. s=p+1 is repeated between themip.Simultaneously
A weighted value is obtained, which is sclipAnd sclisMinimum value.The relative direction between them can also be obtained simultaneously, such as
Fruit scoij=1, scoi(j+1)=1, then relative direction is set as 1;If scoij=1, scoi(j+1)=0, then relative direction is set
It is set to 2;If scoij=0, scoi(j+1)=0, then relative direction is set as 3;If scoij=0, scoi(j+1)=1, then phase
4 are set as to direction;
Then it finds all comprising caAnd cbAll part scaffold.And according to every part scaffold, calculate
Relative direction between to them, distance and weighted value.Since the direction between two non-duplicate contig should be with sequence
Uniquely, if relative direction is different obtained in all part scaffold, only select the relative direction frequency maximum
All part scaffold remain, remaining part scaffold information is not considered.Then following formula meter is utilized
Calculate the distance of the two contig:
Wherein lsiIt is the local scaffold remained, n is the number of the local scaffold remained.If
There are multiple relative directions that there is the maximum frequency, does not then do subsequent processing, is i.e. does not add side between the two contig.
Since each local scaffold remained can obtain a weight, this method takes its maximum value conduct
Weight.Finally in scaffold, caAnd cbA line, the relative direction on side, distance and weight are added between corresponding node
It is calculated and is obtained by above step.
After having handled all non-duplicate contig two-by-two, an initial scaffold figure building is completed.
The step 4) specifically includes following steps:
The sequence and directional information and adjacent segments of node are contained in the side of a simple path in scaffold figure
The distance between point information, therefore the corresponding scaffold of every simple path, all simple paths constitute one
Scaffold set.Only packet includes non-duplicate contig in scaffold set at this time.In a scaffold, two
Adjacent non-duplicate contig, this method search all part scaffold comprising them in local scaffold set.
When a part scaffold include they, and include between them in the scaffold of the part be entirely repetition
Contig, then the repetition contig that these directions and sequence determine is an insertion candidate item.Then in the part comprising them
All insertion candidate items are found in scaffold.In the scaffold, the insertion candidate item with most frequencys is selected to insert
Enter among the two non-duplicate contig.If the insertion candidate item of most frequencys has multiple, the two are non-duplicate
Contig is without insertion operation.After having handled all adjacent non-duplicate contig, final scaffold collection is generated
It closes.
Compared with prior art, the advantageous effect of present invention is that:
The scaffolding method based on long reading and contig classification that the invention discloses a kind of.This method first
Long read-around ratio is closed to contig collection, generates part scaffold set according to comparison result.One part scaffold is
It is made of the contig compared to the long reading of same.The position letter occurred in local scaffold based on every contig
All contig are divided into two classes by breath, and one kind is to repeat contig, and another kind of is non-duplicate contig.Building is only comprising non-
The scaffold figure of contig is repeated, wherein one non-duplicate contig of each node on behalf in figure, while representing corresponding two
A node can be appeared in simultaneously in the scaffold of same part.Scaffold figure is eliminated followed by linear programming method
In direction and sequence conflict, and make in scaffold figure only comprising simple path, wherein every simple path is one corresponding
scaffold.Then it is inserted into scaffold contig is repeated, forms final scaffolding result.
The present invention is easy to use, and good scaffolding is shown on different truthful datas as a result, more other
Scaffolding method has higher accuracy and continuity.
Detailed description of the invention
Fig. 1 is the method for the present invention flow chart;
Fig. 2 is the amendment that the long reading of one embodiment of the invention and contig compare section;
Fig. 3 is that one embodiment of the invention only retains the contig that section is compared with longest;
Specific embodiment
As shown in Figure 1, the present invention the specific implementation process is as follows:
One, part scaffold set is generated
1.1 this method are using contig file and long reading file as input data.It is read first with tool BWA is compared long
Number is compared onto contig, obtains comparison result.Wherein only consider that length is greater than LrLong reading and length be greater than Lc's
Contig, Lr=500, Lc=3000.
If 1.2 1 long reading and a contig can be compared, its available position for comparing section and ratio
To direction.Assuming that the long reading (lr of j-th stripj) and i-th of contig (ci) can compare, then represent lrjA upper section energy
Enough compare arrives ciOn a section.This method SPR (ci,lrj) indicate in lrjThe upper initial position for comparing section, EPR
(ci,lrj) indicate in lrjThe upper final position for comparing section, SPC (ci,lrj) indicate the start position that section is compared on ci,
EPC(ci,lrj) indicate ciThe upper final position for comparing section.Then in lrjOn comparison section be [SPR (ci,lrj),EPR(ci,
lrj)], in ciOn comparison section be [SPC (ci,lrj),EPC(ci,lrj)].And since the sequencing error rate of long reading compares
Height, so position often has some deviations between the comparison area that the tool of comparison provides.This method is using following methods to comparison area
Between be modified.
If SPR (ci,lrj)<SPC(ci,lrj), then SPC ' (ci,lrj)=SPC (ci,lrj)-SPR(ci,lrj), SPR '
(ci,lrj)=0.
If SPR (ci,lrj)>=SPC (ci,lrj), then SPR ' (ci,lrj)=SPR (ci,lrj)-SPC(ci,lrj),
SPC’(ci, lrj)=0.
If LEN (lrj)-EPR(ci,lrj)>LEN(ci)-EPC(ci,lrj), then EPC ' (ci,lrj)=LEN (ci) -1,
EPR’(ci,lrj)=EPR (ci,lrj)+LEN(ci)-EPC(ci,lrj)。
If LEN (lrj)-EPR(ci,lrj)<=LEN (ci)-EPC(ci,lrj), then EPC ' (ci,lrj)=EPC (ci,
lrj)+ LEN(lrj)-EPR(ci,lrj), EPR ' (ci,lrj)=LEN (lrj)-1.Fig. 2 is the long reading of the present invention and contig ratio
To the amendment schematic diagram in section, in Fig. 2 (a), two contig:c1And c2Lr can be read with long1In comparison, wherein c1On
Section [s1, e1] and lr1On section [s2, e2] corresponding, c2On section [s3, e3] and lr1On section [s4, e4] corresponding.Figure
In 2 (b), the initial position and final position that compare section are modified.Due to s2<s1, so s1'=s1-s2, s2'=0;
Due to LEN (c1)-e1<LEN(lr1)-e2, so e1'=LEN (c1) -1, e2'=e2+LEN(c1)-e1;Due to s3<s4, so
s3'=0, s4'=s4-s3;Due to LEN (lr1)-e4<LEN(c2)-e3, so e3'=e3+ LEN(lr1)-e4, e4'=LEN
(lr1)-1;
If | SPR (ci,lrj)-SPR’(ci,lrj)|<α, or | SPC (ci,lrj)-SPC’(ci,lrj)|<α, or |
EPC(ci,lrj) -EPC’(ci,lrj)|<α, or | EPR (ci,lrj)-EPR’(ci,lrj)|<α then recognizes lrjAnd ciCan not compare
To upper, this method ignores the comparison, wherein α=500.
Comparison section [SPR ' (c after above-mentioned amendment, between the available long reading of this method and contigi,
lrj), EPR ' (ci, lrj)] and [SPC ' (ci,lrj), EPC ' (ci,lrj)], and compare direction.The value for comparing direction is 0
Or 1,0, which indicates reversed, compares, and 1, which indicates positive, compares.
When being merely able to compare a upper contig on a long reading, then the long reading of this does not do subsequent processing.If
There are two or more contig can compare on the long reading of this when, then according to these contig the length reading on compare
It is ascending that they are ranked up to the initial position in section.If there is can to compare simultaneously this long by a plurality of contig
5 ' the ends (or 3 ' ends) of reading, then only retain the contig for comparing section with longest on the long reading of this, remaining
Contig removed.Fig. 3 is that the present invention only retains the contig schematic diagram that there is longest to compare section, wherein there is five
contig:c1、c2、c3、c4、c5And c6Lr can be read with long1In comparison.Due to c1And c2Lr can be compared simultaneously15 '
End, so only retaining the c that there is longest to compare section1, c2Comparison information remove.Due to c5And c6Lr can be compared simultaneously1
3 ' end, so only retain have longest compare section c5, c6Comparison information remove.When a long reading can compare
On a plurality of contig, then part a scaffold, part scaffold can be generated by comparing to contig structure thereon
At, and these contig have determined sequence and direction.One part scaffold can indicate by a sequence node,
In each node be a four-tuple.I-th part scaffold can be expressed as (si1, si2, si3... sim), wherein m is
The contig number that this scaffold includes.sijIt is a four-tuple (scij, scoij, scgij, sclij), wherein scijTable
Show corresponding contig, scoijIndicate the contig and lrjComparison direction.scgijIndicate the contig and next
Contig is in the lrjDistance, i.e. SPR (scgi(j+1),lrj)-EPR(scgij, lrj), sclijIt indicates in lrjUpper comparison section
Length scale, i.e. EPR (scgij,lrj)-SPR(scgij,lrj)。
In this scaffold, for adjacent two contig:scijAnd sci(j+1)If scoij=1,
scoi(j+1)=1, then scij3 ' end and sci(j+1)5 ' end be connected;If scoij=1, scoi(j+1)=0, then scij3 '
End and sci(j+1)3 ' end be connected;If scoij=0, scoi(j+1)=0, then scij5 ' end and sci(j+1)3 ' end be connected
It connects;If scoij=0, scoi(j+1)=1, then scij5 ' end and sci(j+1)5 ' end be connected.
After having handled the comparison information between all long reading and contig, then a part scaffold collection is generated
It closes.Every part scaffold contains the direction between the contig of part, sequence and range information.
Two, contig classifies
The local scaffold set generated according to above-mentioned steps, searches the contig for meeting following two conditions:(1) it
The middle position of two part scaffold is at least appeared in, i.e., in a part scaffold, the contig is neither
One nor the last one contig, and in all part scaffold that it occurs and its 5 ' end (or 3 ' hold)
Adjacent contig is incomplete the same.(2) or the length of a contig is less than MIN, MIN=2000.These contig sentence
It is set to repetition contig, remaining contig is determined as non-duplicate contig.
Three, construct and optimize scaffold figure
The initial scaffold figure of 3.1 buildings
3.1.1) a scaffold figure is made of a node set and a line set.One node is one corresponding
contig.A line eij(c is indicated by a five-tuplei, cj, oij, gij, wij), wherein ciAnd cjIt is the two of this edge connection
A contig, oijIt is the relative direction between the two contig, i.e. expression ciWhich end (5 ' end or 3 ' end) and cj
Which end (5 ' end or 3 ' end) adjacent, gijIt is the distance between the two contig size, wijIt is the weight of this edge.
A node is constructed first against each non-duplicate contig, forms the node set of initial scaffold figure.
3.1.2 two non-duplicate contig for meeting following conditions) are searched in local scaffold set, they are at certain
It is adjacent in one part scaffold, or the contig being among them is entirely to repeat contig.
For two non-duplicate contig for meeting above-mentioned condition:caAnd cb.Assuming that i-th of part scaffold:lsi, packet
Containing caAnd cb。lsiIt is expressed as (si1, si2, si3... sim), and caAnd cbRespectively correspond scipAnd scis.If in lsiIn, this
Exist between two contig and repeat contig, then calculates the distance between they with following formula.
Wherein GD (scip, scis, lri) it is expressed as scipAnd scisBetween distance on corresponding long reading.LEN(scij)
It is scijThe length of this contig.
If then the distance between they are scg there is no contig, i.e. s=p+1 is repeated between themip.Simultaneously
A weighted value is obtained, which is sclipAnd sclisMinimum value.The relative direction between them can also be obtained simultaneously, such as
Fruit scoij=1, scoi(j+1)=1, then relative direction is set as 1, i.e. ca3 ' end and cb5 ' end it is adjacent;If scoij=1,
scoi(j+1)=0, then relative direction is set as 2, i.e. ca3 ' end and cb3 ' end it is adjacent;If scoij=0, scoi(j+1)=0,
Then relative direction is set as 3, i.e. ca5 ' end and cb3 ' end it is adjacent;If scoij=0, scoi(j+1)=1, then relative direction
It is set as 4, i.e. ca5 ' end and cb5 ' end it is adjacent;Wherein work as scoiAnd scojWhen equal, c is indicatedaAnd cbIn the same direction
On, otherwise caAnd cbNot in the same direction.
Then it finds all comprising caAnd cbLocal scaffold.And according to every part scaffold, the above-mentioned side of reason
The relative direction between them, distance and weighted value is calculated in method.Due to direction between two non-duplicate contig and suitable
Sequence should be unique, if relative direction is incomplete the same obtained in all part scaffold, only selection is relatively
The frequency maximum all part scaffold in direction are remained, and remaining part scaffold comparison information is not considered.So
The distance of the two contig is calculated using following formula afterwards:
Wherein lsiIt is the local scaffold remained, n is the number of the local scaffold remained.If
There are multiple relative directions that there is the maximum frequency, does not then do subsequent processing, is i.e. does not add side between the two contig.
Since each local scaffold remained can obtain a weight, this method is maximized as power
Weight.Finally in scaffold, caAnd cbA line is added between corresponding node, the relative direction on side, distance and weight are equal
It is calculated and is obtained by above step.
After having handled all non-duplicate contig two-by-two, an initial scaffold figure building is completed.
3.2scaffold figure optimization
3.2.1 direction conflict) is eliminated
Set Oi∈ { 0,1 }, represents ciDirection, 0 represents forward direction, 1 represent it is reversed.In scaffold figure, if one
The relative direction on side is equal to 1 or 3, then the side, which constrains corresponding two contig, has the same direction.Otherwise, the side
The two contig are constrained with opposite direction.When the direction of a node has determined, then based on the road in scaffold figure
The direction for other nodes that diameter and the node are connected also can determine that.But often there are some directions in scaffold figure
Conflict, i.e., a certain node are often derived by different directions by different paths.This method is sent out using integral linear programming
Existing direction conflict, and make not including direction conflict in scaffold figure by deleting some sides.
Direction between two nodes can be constrained in each edge.If ciAnd cjIn a different direction, then this method is arranged about
Beam condition is:
If CiAnd CjIn the same direction, then constraint condition is:
Wherein, ηijIt is a slack variable, ηij∈ { 0,1 };
Optimization object function is:
MAX(∑wij·ηij)
Wherein, wijIndicate ciAnd cjBetween side weight;MAX(∑wij·ηij) indicate to ask so that functional value is maximum
ηijValue;
After acquiring optimal solution, a direction is assigned in each node, if wijThe side meeting is thought not equal to 1
It causes direction to conflict, and deletes.
3.2.2 eliminating position conflict
In scaffold figure, each edge alsies specify the distance between two contig.This method passes through to each
Contig distributes initial position, and the initial position of distribution is made to meet the distance between contig as defined in each edge as far as possible.
If Xi∈ [0, C], represents ciIn starting position coordinates forward.Due to previous step solve direction conflict when,
The direction for having calculated each contig, the contig for being 0 for direction, which is exactly the position at 5 ' ends, right
The contig for being 1 in direction, the initial position are exactly the position at 3 ' ends.XiIt is an integer.C is the sum of all node's lengths
Twice.
For ciAnd cjBetween side, then establishing order constrained condition is:
Optimization object function is:
MAX(∑wij·Φij)
Wherein, Xi, Xj∈ [0, C], respectively indicates and gives node ciAnd cjDistribution in starting position coordinates forward, Xi,
XjFor integer;Φij∈ [0,1] is a slack variable, for reflecting c as defined in corresponding edgeiAnd cjThe distance between and pass through
The gap between distance that distribution position coordinates obtain.This gap is smaller, then ΦijValue closer to 1.
After solution obtains optimal solution, for a line eijIf | | Xi-Xj|-Gij|/1000>β, then it is assumed that eijMeeting
It causes to conflict, and deletes.Wherein β=3.
3.2.3 eliminating multiple-limb side
After above-mentioned steps, if there are sides to be connected with a number of other nodes at the 5 ' ends (or 3 ' ends) of a node
It connects, then only retains the maximum side of weight, delete remaining side.
After above-mentioned steps are handled, scaffold figure only includes simple path.
Four, scaffold is generated
The sequence and directional information and adjacent segments of node are contained in the side of a simple path in scaffold figure
The distance between point information, therefore the corresponding scaffold of every simple path, all simple paths constitute one
Scaffold set.It only include non-duplicate contig in scaffold set at this time.
Following this method is inserted into the position for repeating contig in scaffold set.In a scaffold,
Two adjacent non-duplicate contig, this method search all parts comprising them in local scaffold set
scaffold.When a part scaffold includes them, and in the whole for including between them in the scaffold of the part
It is to repeat contig, then the repetition contig that these directions and sequence determine is an insertion candidate item.It finds comprising the two
All part scaffold of non-duplicate contig, and determine all insertion candidate items.Select the insertion with most frequencys
Candidate item is inserted among the two non-duplicate contig.If the insertion candidate item of most frequencys has multiple, the two are non-
Contig is repeated without insertion operation.After having handled all adjacent non-duplicate contig, final scaffold is generated
Set.
Five, experimental verification
5.1 data sets and evaluation index
In order to verify the validity of this method, contig data set and long readings collection of this method on two species
On tested, and be compared analysis with currently a popular other three kinds of scaffolding methods.Both species include:
Escherichia coli (Escherichia coli/E.coli) and saccharomyces cerevisiae (Saccharomyces cerevisiae/
S.cerevisiae).There are two different contig to gather for each species, is generated by different assembling tools.E.coli pairs
The two contig set answered is respectively contig set 1 and contig set 2, S.cerevisiae two corresponding
Contig set is respectively contig set 3 and contig set 4.Also there are two long reading set for each species, respectively
It is (big to equal the real-time sequencing technologies of foreign sequencing company unimolecule and Oxford nano-pore sequencing that there are two types of different third generation sequencing technologies
Technology) it obtains.The details of long reading set are shown in Table 1.In order to evaluate the accuracy of scaffolding method, this method benefit
Scaffolding result is evaluated with QUAST tool.Wherein main evaluation index is shown in Table 2.
The long readings collection of table 1
Comparison between 5.2 scaffolding methods
This method and other three popular scaffolding methods compare, these three sides scaffolding
Method includes:SSPACE-LongRead, LINKS and npScarf.This method is named as SLR.
2 QUAST evaluation index of table
Scaffolding evaluation result of the table 3 based on the long reading set of the real-time sequencing technologies of unimolecule
5.2.1, the scaffolding evaluation result based on the long reading set of the real-time sequencing technologies of unimolecule
Scaffolding, evaluation knot are carried out first with the long reading set obtained by the real-time sequencing technologies of unimolecule
Fruit is shown in Table 3.It will be seen that our method has least mistake on all data sets, this method is illustrated
The more other methods of accuracy have higher accuracy.The continuity of Scaffolding result is often by NA50 and NGA50
It is evaluated, it will be seen that this method has optimal NA50 and NGA50 on first group and third group data set,
There is optimal NA50 on four group data sets.
5.2.2, the scaffolding evaluation result based on Oxford nano-pore sequencing Chief Technology Officer reading set
We carry out scaffolding followed by the long reading set obtained by Oxford nano-pore sequencing technology, comment
Valence the results are shown in Table 4.It will be seen that our method still has least mistake on all data sets.We
Method has optimal NA50 and NGA50 on first group and the second group data set, has on third group data set optimal
NGA50。
Scaffolding evaluation result of the table 4 based on Oxford nano-pore sequencing Chief Technology Officer reading set
Claims (5)
1. a kind of scaffolding method based on long reading and contig classification, which is characterized in that include the following steps:
1) long read-around ratio is closed first to contig collection, and generates local scaffold set;
1.1) comparison tool BWA is utilized, long reading set is compared to contig collection and is closed, comparison result is generated.Wherein only examine
Consider length and is greater than LrLong reading and length be greater than LcContig, Lr=500, Lc=3000.
1.2) it is directed to a long reading, extracts all contig set that can be compared on it, and calculating ratio is to section position
It sets.If without or an only contig comparison on, this it is long reading do not do subsequent processing.If there is two or more
A plurality of contig can be compared, then according to the comparison position and direction information between the long reading of this and these contig, really
Sequencing and direction between these fixed contig, and generate a part scaffold.When the length for having handled all is read
Afterwards, a part scaffold set is generated.
2) contig classifies;
If a contig appears in the middle position of two or more parts scaffold (i.e. a part
In scaffold, it is neither first, nor the last one contig), and it is tight in different local scaffold
The contig at its adjacent 5 ' end (or 3 ' ends) is incomplete the same, then this contig is to repeat contig.An or contig
Length be less than MIN, MIN=2000, then also think this contig be repeat contig.Remaining contig is non-duplicate
contig.After having handled all contig, then all contig are divided into two classes:Repeat contig and non-duplicate
contig。
3) construct and optimize scaffold figure;
3.1) node is constructed first against each non-duplicate contig;For two non-duplicate contig, judge that they are
It is not no while appearing in the scaffold of same part, if it is then determining that the two are non-duplicate according to comparison information
The distance between direction and order information between contig, and calculate them.Then judge whether can add between them
A line, and determine the weight on side.After having handled all nodes two-by-two, then an initial scaffold figure building is completed.
3.2) each edge constrains direction, sequence and range information between its two node that are connected in scaffold figure, because
This constructs linear programming model according to side all in scaffold figure, and detection and removal cause the side in direction and sequence conflict,
Guarantee that there is no directions and sequence to conflict in scaffold figure.
3.3) after eliminating conflict, in scaffold figure, simultaneously and some node 5 ' holds (or 3 ' if there is multiple nodes
End) be connected, then only retain the maximum side of weight, remaining side is removed.Processing through the above steps, in scaffold figure
It only include simple path.
4) scaffold set is generated;
A simple path in scaffold figure contain between the sequence of node and directional information and adjacent node away from
From information, therefore every simple path corresponds to a scaffold, and generates a scaffold set.For two
Adjacent non-duplicate contig in scaffold, if a part scaffold includes them, and in the part
Include between them in scaffold is entirely to repeat contig, then the repetition contig that these directions and sequence determine is one
A insertion candidate item.If two non-duplicate contig are appeared in a plurality of part scaffold, every part scaffold
A corresponding insertion candidate item, then there is the insertion candidate item of most frequencys to be inserted into the two non-duplicate contig for selection
Between.If the insertion candidate item of most frequencys has multiple, the two non-duplicate contig are without insertion operation.When having handled
After all adjacent non-duplicate contig, final scaffold set is generated.
Wherein contig is genomic sequence fragment;Scaffold is genome overlength sequence fragment;Scaffolding method is
Refer to the direction for determining each contig and their sequencings on genome sequence, to generate some genes
Group overlength sequence fragment, the i.e. method of scaffold.The left end of one sequence is its 5 ' end, and right end is its 3 ' end.
2. the scaffolding method according to claim 1 based on long reading and contig classification, which is characterized in that
The step 1.2) is specially:
1.2.1) the comparison result generated according to tool BWA is compared, if one long reading and a contig can be compared,
Its available position for comparing section and comparison direction.Assuming that the long reading (lr of j-th stripj) and i-th of contig (ci) can compare
To upper, and in lrjOn comparison section be [SPR (ci,lrj),EPR(ci,lrj)], in ciOn comparison section be [SPC
(ci,lrj),EPC(ci,lrj)].Since the sequencing error rate of long reading is relatively high, so the comparison section that the tool of comparison provides
Position often has some deviations.This method is modified using following step to section is compared.
If SPR (ci,lrj)<SPC(ci,lrj), then SPC ' (ci,lrj)=SPC (ci,lrj)-SPR(ci,lrj), SPR ' (ci,
lrj)=0.
If SPR (ci,lrj)>=SPC (ci,lrj), then SPR ' (ci,lrj)=SPR (ci,lrj)-SPC(ci,lrj), SPC '
(ci,lrj)=0.
If LEN (lrj)-EPR(ci,lrj)>LEN(ci)-EPC(ci,lrj), then EPC ' (ci,lrj)=LEN (ci) -1, EPR '
(ci,lrj)=EPR (ci,lrj)+LEN(ci)-EPC(ci,lrj)。
If LEN (lrj)-EPR(ci,lrj)<=LEN (ci)-EPC(ci,lrj), then EPC ' (ci,lrj)=EPC (ci,lrj)+
LEN(lrj)-EPR(ci,lrj), EPR ' (ci,lrj)=LEN (lrj)-1。
If | SPR (ci,lrj)-SPR’(ci,lrj)|<α, or | SPC (ci,lrj)-SPC’(ci,lrj)|<α, or | EPC
(ci,lrj)-EPC’(ci,lrj)|<α, or | EPR (ci,lrj)-EPR’(ci,lrj)|<α then recognizes lrjAnd ciIt can not compare,
This method ignores the comparison, and wherein α is a parameter (α=500).
After above-mentioned amendment, the available long comparison section read between contig of this method, and compare direction.
1.2.2) if being merely able to compare a contig on a long reading, the long reading of this does not do subsequent processing.Such as
There are two fruits or more contig can be compared on the long reading of this, then is compared on length reading according to these contig
It is ascending that they are ranked up to the initial position in section.If there is can to compare simultaneously this long by a plurality of contig
5 ' the ends (or 3 ' ends) of reading, then only retain the contig that section is compared with longest, remaining contig is moved
It removes.Assuming that long reading lrjIt can be compared with a plurality of contig, then it corresponds to a part scaffold, the part
Scaffold can be expressed as a sequence node, wherein each node is a four-tuple.I-th part scaffold can be with
It is expressed as (si1, si2, si3... sim), wherein m is the contig number that this scaffold includes.sijIt is a four-tuple
(scij, scoij, scgij, sclij), wherein scijIndicate corresponding contig, scoijIndicate the contig and lrjRatio other side
To value, which is that 0 or 1,0 expression is reversed, to be compared, and 1 indicates positive comparison.scgijIndicate that the contig and next contig exist
The lrjDistance, i.e. SPR (scgi(j+1),lrj)-EPR(scgij,lrj), scgimIt is set as 0.sclijIt is on this contig
Compare section size EPC (scgij,lrj)-SPC(scgij,lrj)。
In this scaffold, for adjacent two contig:scijAnd sci(j+1)If scoij=1, scoi(j+1)=1,
Then scij3 ' end and sci(j+1)5 ' end be connected;If scoij=1, scoi(j+1)=0, then scij3 ' end and sci(j+1)'s
3 ' ends are connected;If scoij=0, scoi(j+1)=0, then scij5 ' end and sci(j+1)3 ' end be connected;If scoij=
0, scoi(j+1)=1, then scij5 ' end and sci(j+1)5 ' end be connected;
After having handled the comparison information between all long reading and contig, then a part scaffold set is generated.
Every part scaffold contains the direction between the contig of part, sequence and range information.
3. the scaffolding method according to claim 1 based on long reading and contig classification, which is characterized in that
The step 2) is specially:
The local scaffold set generated according to above-mentioned steps, searches the contig for meeting following two conditions:(1) it is at least
The middle position of two part scaffold is appeared in, i.e., in a part scaffold, the contig is neither first
Be also not the last one contig, and in all part scaffold that it occurs and its 5 ' end (or 3 ' hold) close to
Contig it is incomplete the same.(2) or the length of a contig is less than MIN, MIN=2000.These contig are determined as
Contig is repeated, remaining contig is determined as non-duplicate contig.
4. the scaffolding method according to claim 1 based on long reading and contig classification, which is characterized in that
The step 3.1) is specially:
3.1.1) a scaffold figure is made of a node set and a line set.One node is one corresponding
contig.A line eij(c is indicated by a five-tuplei, cj, oij, gij, wij), wherein ciAnd cjIt is the two of this edge connection
A contig, oijIt is the relative direction between the two contig, gijIt is the distance between the two contig size, wijIt is
The weight of this edge.A node is constructed first against each non-duplicate contig.
3.1.2 two non-duplicate contig for meeting following conditions) are searched in local scaffold set, they are in a certain item
It is adjacent in local scaffold, or the contig being among them is entirely to repeat contig.
For two non-duplicate contig for meeting above-mentioned condition:caAnd cb.Assuming that i-th of part scaffold:lsi, include ca
And cb。lsiIt is expressed as (si1, si2, si3... sim), it is assumed that caAnd cbRespectively correspond scipAnd scis.If lsiIn, this two
Exist between contig and repeat contig, then calculates the distance between they with following formula.
Wherein GD (scip,scis,lri) it is expressed as scipAnd scisBetween distance on corresponding long reading.LEN(scij) it is scij
The length of this contig.
If then the distance between they are scg there is no contig, i.e. s=p+1 is repeated between themip.One is also obtained simultaneously
A weighted value, the value are sclipAnd sclisMinimum value.The relative direction between them can also be obtained simultaneously, if scoij
=1, scoi(j+1)=1, then relative direction is set as 1;If scoij=1, scoi(j+1)=0, then relative direction is set as 2;Such as
Fruit scoij=0, scoi(j+1)=0, then relative direction is set as 3;If scoij=0, scoi(j+1)=1, then relative direction is arranged
It is 4;
Then it finds all comprising caAnd cbLocal scaffold.According to every part scaffold, it is calculated between them
Relative direction, distance and weighted value.Since the direction between two non-duplicate contig should be unique with sequence, if
The relative direction obtained in all part scaffold is different, then only selects the maximum all parts of the relative direction frequency
Scaffold is remained, and remaining part scaffold information is not considered.Then the two are calculated using following formula
The distance of contig:
Wherein lsiIt is the local scaffold remained, n is the number of the local scaffold remained.If there is multiple
Relative direction has the maximum frequency, then does not do subsequent processing, is i.e. does not add side between the two contig.
Since each local scaffold remained can obtain a weight, this method takes its maximum value as power
Weight.Finally in scaffold, caAnd cbAdded between corresponding node a line, the relative direction on side, distance and weight by
Above step, which calculates, to be obtained.
After having handled all non-duplicate contig two-by-two, an initial scaffold figure building is completed.
5. the scaffolding method according to claim 1 based on long reading and contig classification, which is characterized in that
The step 4) is specially:
Contained in the side of a simple path in scaffold figure node sequence and directional information and adjacent node it
Between range information, therefore the corresponding scaffold of every simple path, all simple paths constitute a scaffold collection
It closes.Only packet includes non-duplicate contig in scaffold set at this time.In a scaffold, two adjacent non-duplicate
Contig, this method search all part scaffold comprising them in local scaffold set.When a part
Scaffold include they, and include between them in the scaffold of the part be entirely repeatedly contig, then these
The repetition contig that direction and sequence determine is an insertion candidate item.Institute is then found in the local scaffold comprising them
Some insertion candidate items.In the scaffold, the insertion candidate item with most frequencys is selected to be inserted into the two non-duplicate
Among contig.If the insertion candidate item of most frequencys has multiple, the two non-duplicate contig are without insertion operation.
After having handled all adjacent non-duplicate contig, final scaffold set is generated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810642753.0A CN108830047A (en) | 2018-06-21 | 2018-06-21 | A kind of scaffolding method based on long reading and contig classification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810642753.0A CN108830047A (en) | 2018-06-21 | 2018-06-21 | A kind of scaffolding method based on long reading and contig classification |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108830047A true CN108830047A (en) | 2018-11-16 |
Family
ID=64142845
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810642753.0A Pending CN108830047A (en) | 2018-06-21 | 2018-06-21 | A kind of scaffolding method based on long reading and contig classification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108830047A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109935274A (en) * | 2019-03-01 | 2019-06-25 | 河南大学 | A kind of long reading overlay region detection method based on k-mer distribution characteristics |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110257889A1 (en) * | 2010-02-24 | 2011-10-20 | Pacific Biosciences Of California, Inc. | Sequence assembly and consensus sequence determination |
US20130317755A1 (en) * | 2012-05-04 | 2013-11-28 | New York University | Methods, computer-accessible medium, and systems for score-driven whole-genome shotgun sequence assembly |
CN104200133A (en) * | 2014-09-19 | 2014-12-10 | 中南大学 | Read and distance distribution based genome De novo sequence splicing method |
US20150178446A1 (en) * | 2013-12-18 | 2015-06-25 | Pacific Biosciences Of California, Inc. | Iterative clustering of sequence reads for error correction |
CN104951672A (en) * | 2015-06-19 | 2015-09-30 | 中国科学院计算技术研究所 | Splicing method and system of second generation and third generation genomic sequencing data combination |
CN106021978A (en) * | 2016-04-06 | 2016-10-12 | 晶能生物技术(上海)有限公司 | Assembling method for de novo sequencing data based on optics map platform Irys |
CN106355000A (en) * | 2016-08-25 | 2017-01-25 | 中南大学 | Scaffolding method based on statistical characteristic of double-end insert size |
-
2018
- 2018-06-21 CN CN201810642753.0A patent/CN108830047A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110257889A1 (en) * | 2010-02-24 | 2011-10-20 | Pacific Biosciences Of California, Inc. | Sequence assembly and consensus sequence determination |
US20130317755A1 (en) * | 2012-05-04 | 2013-11-28 | New York University | Methods, computer-accessible medium, and systems for score-driven whole-genome shotgun sequence assembly |
US20150178446A1 (en) * | 2013-12-18 | 2015-06-25 | Pacific Biosciences Of California, Inc. | Iterative clustering of sequence reads for error correction |
CN104200133A (en) * | 2014-09-19 | 2014-12-10 | 中南大学 | Read and distance distribution based genome De novo sequence splicing method |
CN104951672A (en) * | 2015-06-19 | 2015-09-30 | 中国科学院计算技术研究所 | Splicing method and system of second generation and third generation genomic sequencing data combination |
CN106021978A (en) * | 2016-04-06 | 2016-10-12 | 晶能生物技术(上海)有限公司 | Assembling method for de novo sequencing data based on optics map platform Irys |
CN106355000A (en) * | 2016-08-25 | 2017-01-25 | 中南大学 | Scaffolding method based on statistical characteristic of double-end insert size |
Non-Patent Citations (5)
Title |
---|
HSIN-HUNG LIN等: ""Evaluation and Validation of Assembling Corrected PacBio Long Reads for Microbial Genome Completion via Hybrid Approaches"", 《PLOS ONE》 * |
LUO J: ""BOSS: a novel scaffolding algorithm based on an optimized scaffold graph"", 《BIOINFORMATICS》 * |
RAJAT S. ROY等: ""SLIQ: Simple Linear Inequalities for Efficient Contig Scaffolding"", 《ARXIV:1111.1426V2》 * |
杨帆: ""基于BWT的DNA重叠群序列合并算法研究"", 《中国优秀硕士学位论文全文数据库·基础科学辑》 * |
马云云: ""新一代DNA测序数据的重叠群组装算法的研究与实现"", 《中国优秀硕士学位论文全文数据库·信息科技辑》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109935274A (en) * | 2019-03-01 | 2019-06-25 | 河南大学 | A kind of long reading overlay region detection method based on k-mer distribution characteristics |
CN109935274B (en) * | 2019-03-01 | 2021-04-30 | 河南大学 | Long reading overlap region detection method based on k-mer distribution characteristics |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA2424031C (en) | System and process for validating, aligning and reordering genetic sequence maps using ordered restriction map | |
Jansson et al. | Algorithms for combining rooted triplets into a galled phylogenetic network | |
CN109241355A (en) | Accessibility querying method, system and the readable storage medium storing program for executing of directed acyclic graph | |
WO2018218788A1 (en) | Third-generation sequencing sequence alignment method based on global seed scoring optimization | |
CN104700311B (en) | A kind of neighborhood in community network follows community discovery method | |
CN108830047A (en) | A kind of scaffolding method based on long reading and contig classification | |
El-Mabrouk et al. | Gene family evolution—an algorithmic framework | |
CN106355000B (en) | The scaffolding methods of insert size statistical natures are read based on both-end | |
CN110046265B (en) | Subgraph query method based on double-layer index | |
CN108491687B (en) | Scafffolding method based on contig quality evaluation classification and graph optimization | |
Walve et al. | Kermit: linkage map guided long read assembly | |
Horesh et al. | Designing an A* algorithm for calculating edit distance between rooted-unordered trees | |
CN111462812A (en) | Multi-target phylogenetic tree construction method based on feature hierarchy | |
CN114896480B (en) | Top-K space keyword query method based on road network index | |
Ndiaye et al. | When less is more: sketching with minimizers in genomics | |
CN113312488B (en) | Knowledge graph processing method and device | |
EP3663890A1 (en) | Alignment method, device and system | |
CN102750460A (en) | Operational method of layering simplifying large-scale graph data | |
CN113392279A (en) | Similar directed subgraph searching method and system based on subjective logic and feedforward neural network | |
Nakhleh et al. | Phylogenetic networks: Properties and relationship to trees and clusters | |
US8428885B2 (en) | Virtual screening of chemical spaces | |
Sundararajan et al. | Chaining algorithms for alignment of draft sequence | |
CN114580779B (en) | Block chain transaction behavior prediction method based on graph feature extraction | |
Zou et al. | The Improved Needleman-Wunsch Algorithm Based on the Genetic Process of Sequences | |
Oehl | A combinatorial approach for reconstructing rDNA repeats |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20181116 |
|
WD01 | Invention patent application deemed withdrawn after publication |