CN108830047A - A kind of scaffolding method based on long reading and contig classification - Google Patents

A kind of scaffolding method based on long reading and contig classification Download PDF

Info

Publication number
CN108830047A
CN108830047A CN201810642753.0A CN201810642753A CN108830047A CN 108830047 A CN108830047 A CN 108830047A CN 201810642753 A CN201810642753 A CN 201810642753A CN 108830047 A CN108830047 A CN 108830047A
Authority
CN
China
Prior art keywords
contig
scaffold
duplicate
sco
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810642753.0A
Other languages
Chinese (zh)
Inventor
罗军伟
王俊峰
张波
张霄宏
贾利琴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University of Technology
Original Assignee
Henan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University of Technology filed Critical Henan University of Technology
Priority to CN201810642753.0A priority Critical patent/CN108830047A/en
Publication of CN108830047A publication Critical patent/CN108830047A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The scaffolding method based on long reading and contig classification that the invention discloses a kind of.This method closes long read-around ratio to contig collection first, generates part scaffold set according to comparison result.One part scaffold is made of the contig compared to the long reading of same.Based on the location information that every contig occurs in local scaffold, all contig are divided into two classes, one kind is to repeat contig, and another kind of is non-duplicate contig.Only the scaffold comprising non-duplicate contig schemes for building, one non-duplicate contig of each node on behalf in figure.Direction and sequence conflict in scaffold figure are eliminated followed by linear programming method, and is made only comprising simple path in scaffold figure, wherein the corresponding scaffold of every simple path.Then it is inserted into scaffold contig is repeated, forms final scaffolding result.The present invention is easy to use, and good scaffolding is shown on different truthful datas as a result, more other scaffolding methods have higher accuracy and continuity.

Description

A kind of scaffolding method based on long reading and contig classification
Technical field
It is especially a kind of based on long reading and contig classification the present invention relates to the sequence assembling field of bioinformatics Scaffolding method.
Background technique
Genome generally refers to all encode DNA (DNA) sequence with non-coding, it is by four kinds of bases: The sequence of adenine (A), thymidine (T), cytimidine (C) and guanine (G) composition, i.e. genome sequence is a character It goes here and there, only includes four characters A, T, G, C in this character string.It also include another character N, generation in actual gene group sequence The base of the table position can not determine.Genomic dna sequence contains heredity and regulation and controlling of information, guides biological development and life Function running.In fundamental biological knowledge research and numerous application fields, such as diagnosis, biotechnology, Forensic Biology, biology department During system is learned, complete and correct genomic dna sequence has become indispensable knowledge.By gene order-checking, can obtain Base sequence segment (reading or read) in lots of genes group sequence.Sequence assembling be the sequence fragment that is obtained by these also The method of former whole gene group DNA sequence dna.And due to duplicate block, sequencing mistake and the problems such as unbalanced is sequenced, sequence assembling Method often first generates some relatively independent and more scattered sequence fragments, i.e. contig, these contig are likely distributed in gene The arbitrary region of group DNA sequence dna, and since DNA sequence is duplex structure, these contigs may be in appointing in double-strand It anticipates on a chain.Scaffolding method is exactly the direction and ordinal relation between these determining contigs, and then is generated more Long scaffold.Scaffolding can make sequence assembling result more continuous and complete, this facilitates subsequent gene identification, Genome alignment, the research such as structure variation detection, is one of the hot spot in sequence assembling research.
Currently, significantly being dropped by the second generation sequencing technologies of representative of Illumina/Solexa and AB/SOLid company While inexpensive, single operation can also generate magnanimity and the lower reading of error rate.Therefore, second generation sequencing technologies exist It is widely used both at home and abroad.The short reading (paired reads) of the both-end obtained by second generation sequencing technologies is to come from Two sequence fragments at one section of longer original genomic sequence segment both ends.The spacing (insert size) of the short reading of both-end can To reach many kilobases, thus the short reading of both-end can across one section of longer region and overcome in sequence assembling part weight Multiple area's problem, therefore the scaffolding method based on the short reading of both-end obtains researcher and widely pays close attention to.Its step Contig usually is generated first with existing sequence assembling tool, then the short read-around ratio of both-end to on contig, then is led to Comparison information building scaffold figure (scaffold graph or bidiercted graph) is crossed, and then infers contigs Between direction and ordinal relation.
With the rapid development of sequencing technologies, speed faster the higher third generation sequencing technologies of flux just gradual perfection at It is ripe.Third generation sequencing technologies mainly have the unimolecule of Pacific Ocean Biological Science Co., Ltd (Pacific Biosciences) to survey in real time The nanometer pore single-molecule technology of sequence technology and Oxford Nanotec Solution (OxfordNanopore Technology).The third generation Long reading length caused by sequencing technologies can achieve tens of thousands of bases, these length readings can be across major part in genome Duplicate block, and then researcher is helped to obtain complete genome sequence.It, can be across since the length of long reading is longer Most of duplicate block, but the sequencing error rate of long reading is higher, commonly reaches 15% or so.
Due to second generation sequencing technologies comparative maturity, and it is excellent to have that sequencing data accuracy is high, at low cost and flux is high etc. Gesture, so being at home and abroad widely used.Although the reading that second generation sequencing technologies generate is shorter, sequencing The spacing (insert size) of the short reading of obtained both-end can achieve many kilobases, can overcome part repeat region bring Problem.The existing scaffolding method based on the short reading of both-end generally comprises following two step:(1) scaffold schemes Building.In scaffold figure, a node often represents a contig, while representing corresponding two contigs in gene It is adjacent in group sequence.(2) infer direction and the sequence between contigs.Based on the topological structure of scaffold figure, no Same scaffolding method takes different strategies to extract corresponding path.Each path corresponds to a scaffold.
It is all the information uniquely compared that SCARPA only retains the short reading of both-end first.When constructing scaffold figure, every Corresponding two nodes of a contig respectively represent the 3 ' ends and 5 ' ends of the contig.SCARPA is the side between determining contig It is converted into minimum odd number circle Traversal Problem to problem, is solved using preset parameter algorithm, is made by deleting some sides Odd number circle is not present in the associated diagram of contig.Mono- origin coordinates of each contig is distributed to by linear programming method again, Delete the side for not meeting distance restraint.The sequence between contig is finally determined using heuristic based on scaffold figure Relationship.SSPACE is when constructing scaffold figure, using contig as node, if can be matched between two contigs The short reading of both-end is greater than a threshold value, then adds a line between them.SSPACE is tied according to the topology of scaffold figure Structure, the extensions path since longest contig, according to the weight size on side, are set when there is multiple summits for subsequent expansion A kind of greedy strategy is extended.MIP is divided according to connectivity pair scaffold figure, is then carried out on each subgraph scaffolding.For each edge in scaffold figure, the synthesis of a kind of fusion direction relations and ordinal relation is designed about Beam, the weight of each edge are still set as the short number of readings of both-end that can be matched between two contigs.SOPRA according to Direction constraint and distance restraint between contig are designed in side in scaffold figure.Then design it is a kind of greediness it is heuristic Algorithm, the side collection for deleting minimal weight find the direction a contig allocation plan and meet all remaining direction constraints.Most It afterwards, is still to delete the side collection of minimal weight to find position contig point when determining the ordinal relation between contig Meet remaining distance restraint with scheme.BESST according to the short reading of the both-end being matched between two contig, calculate this two Difference and the short reading of both-end between a contig between theoretical standard deviation and the actual standard difference of distance is at two Whether position distribution difference on contig, a line should be added between two contig by inferring.The power on side in the above method Weight is often set as the short number of readings of both-end that can be matched between two nodes.BESST is in the scaffold figure of building, choosing The path of the short reading of most both-ends can be matched out as final result.ScaffMatch, which infers entire direction and sequence, to be asked Topic is converted into the acyclic two points of matching problems of weight limit of figure, and some sides of the deletion of iteration make that ring is not present in the figure.Final In the acyclic figure of building, contig node is linearized.GRASS, which provides a kind of Integer programming, to be made between contig Direction and sequence infer specification into an individual optimization aim.SLIQ devises one group of linear inequality and is used to constrain Direction and ordinal relation between contig.SILP2 removes the match information of possible mistake using Maximum Likelihood Model and discovery is read The region of number coverage exception, and solved using integral linear programming method.Briot et al. is then used on scaffold figure The method of the broken circle of iteration determines direction and ordinal relation between contig.WiseScaffolder uses a kind of need manually The method optimizing scaffolding method of intervention.Weller et al. is using a kind of tree decomposition method to the direction between contig It is solved with ordinal relation.Direction between contig is determined that problem is converted into two-dimensional plot by ScaffoldScaffolder In include at least the Solve problems of k directed edge, and prove that the problem is NP-hard.Then it is calculated according to maximum spanning tree Method devises the new greedy algorithm of one kind and solves the problems, such as this, and is compared in performance with other several heuritic approaches. The direction contig inference problems are converted map colouring problem by Bambus.Bambus2 passes through the side for deleting minimal weight in figure, Make that inconsistent direction is not present in figure, direction and the sequence between contig are solved followed by optimum linearity built-up pattern Relationship, and the site that may be occurred according to output interpretation of result genome mutation.
The long reading that third generation sequencing technologies generate can reach tens of thousands of bases, therefore long reading can be across most of weight Multiple area, but its sequencing error rate is too high.Scaffolding method based on long reading mainly includes following two step: (1) contigs and long reading are compared.Since long reading includes too many base replacement mistake, base inserting error and alkali Base deletion error etc., therefore how to obtain accurate comparison result is the key that the step.(2) infer between contigs Direction and ordinal relation.Based on comparison result, different methods is taken to linearize contigs, and then determines that direction and sequence are closed System.
SSPACE-LongRead is a scaffolding tool individually for long reading.It (can be with contig It is that Optional assembling tool generates) and reading is grown for input data.Using BLASR long read-around ratio to on contig, according to Comparison position of the contig on long reading determines that those can compare the contig of the long reading of same, and detecting can Compare the contig of multiple long readings.It is then based on comparison result, contigs is ranked up.LINKS method is taken out first It takes k-mers pairs of in long read (sequence fragment that length is k), spacing of these pairs of k-mers on long reading is one A definite value.Then pairs of k-mers and contigs are compared, and then determine the comparison position between contigs and long reading Relationship.On the basis of comparison information, using a kind of heuristic determine the left and right of every contig close to contig, and Infer direction relations.A kind of method that OPERA-LG uses accommodation carries out scaffolding using long read.It is sharp first With BLASR then long read-around ratio converts the long reading for meeting comparison condition setting to on contig, makes long reading Number is converted into the short reading of both-end, then again based on the short reading of both-end carry out scaffolding infer direction between contig and Ordinal relation.DBG2OLC method is first compared contig and long reading, and one long reading and comparison to thereon Contig be converted into compression reading, then determine the overlapping region between compression reading, seek unification sequence, finally push away Direction and sequence between disconnected contig.AHA method is determined to match the long reading of multiple contig first, according to this Some long read determines that the connections between contigs simultaneously establish associated diagram (scaffold graph), then to scaffold figure into Row optimizes and linearizes node and export as a result.
Although currently, the short reading of the both-end based on second generation sequencing technologies or the long reading based on third generation sequencing technologies Scaffolding method has been achieved for good result.But there is a problem of following still not adequately addressed need into one Step research:
(1) in the short comparison information read between contig using both-end, due to reading shorter, the short reading of both-end It is easy to compare to multiple positions, especially some duplicate blocks.Which increase the connectivity of scaffold figure, and then influence The accuracy of scaffolding.
(2) when using the long comparison information read between contig, since the sequencing error rate of long reading is relatively high, So there are more noises for the comparison information between long reading and contig.How to realize between long reading and contig Precise alignment is a difficult point.
(3) existing scaffolding method often assumes each contig only can occur once in scaffold. And some contig may be some duplicate blocks, this requires these contig to want to occur in multiple scaffold repeatedly.
The presence of these problems limits existing scaffolding method and obtains more satisfying result.
Summary of the invention
The technical problem to be solved by the present invention is in view of the above shortcomings of the prior art, provide a kind of based on long reading With the scaffolding method of contig classification, easy to use, accuracy is high.
In order to solve the above technical problems, the technical scheme adopted by the invention is that:
A kind of scaffolding method based on long reading and contig classification, includes the following steps:
1) long read-around ratio is closed first to contig collection, and generates local scaffold;
1.1) comparison tool BWA is utilized, long reading set is compared to contig collection and is closed, comparison result is generated.Wherein Only consider that length is greater than LrLong reading and length be greater than LcContig, Lr=500, Lc=3000.
1.2) it is directed to a long reading, extracts all contig set that can be compared on it, and calculate comparison area Between position.If without or an only contig comparison on, this it is long reading do not do subsequent processing.If there is two Or more a plurality of contig can be compared, then according to the comparison position and direction between the long reading of this and these contig Information determines the sequencing between these contig and direction, and generates a part scaffold.It is all when having handled Long reading after, generate a part scaffold and gather.
2) contig classifies;
If a contig appears in a middle position (i.e. non-part of two or more parts scaffold First in scaffold or the last one contig), and in different part scaffold close to it 5 ' end (or Person 3 ' holds) contig it is incomplete the same, then this contig is to repeat contig;Remaining contig is non-duplicate contig.After having handled all contig, then all contig are divided into two classes, repeat contig and non-duplicate contig。
3) construct and optimize scaffold figure;
3.1) node is constructed first against each non-duplicate contig;Judge that two non-duplicate contig whether can It appears in the scaffold of same part simultaneously, if it can, determining the two non-duplicate contig then according to comparison information Between direction and order information, and the distance between calculate them.Then a line is added between them, and is determined The weight on side.After having handled all nodes two-by-two, then an initial scaffold figure building is completed.
3.2) each edge constrains direction between its two node that are connected, sequence and apart from letter in scaffold figure Breath, therefore according to side all in scaffold figure, linear programming model is constructed, detection and removal cause direction and sequence to rush Prominent side guarantees that there is no directions and sequence to conflict in scaffold figure.
3.3) after having eliminated conflict, in scaffold figure, if there is multiple nodes simultaneously and some node 5 ' hold (or 3 ' ends) are connected, then only retain the maximum side of weight, remaining side is removed.Processing through the above steps, It only include simple path in scaffold figure.
4) scaffold set is generated;
A simple path in scaffold figure contains between the sequence of node and directional information and adjacent node Range information, therefore the corresponding scaffold of every simple path, and generate a scaffold set.For two The adjacent non-duplicate contig in scaffold, if a part scaffold includes them, and in the part Include between them in scaffold is entirely to repeat contig, then the repetition contig that these directions and sequence determine is One insertion candidate item.If two non-duplicate contig are appeared in a plurality of part scaffold, every part Scaffold corresponds to an insertion candidate item.In the scaffold, the insertion candidate item with most frequencys is selected to be inserted into this Among two non-duplicate contig.If the insertion candidate item of most frequencys have it is multiple, the two non-duplicate contig not into Row insertion operation.After having handled all adjacent non-duplicate contig, final scaffold set is generated.
The step 1.2) specifically includes following steps:
1.2.1) the comparison result generated according to tool BWA is compared, if one long reading and a contig can compare To upper, its available position for comparing section and comparison direction.Assuming that the long reading (lr of j-th stripj) and i-th of contig (ci) It can compare, and in lrjOn comparison section be [SPR (ci,lrj),EPR(ci,lrj)], in ciOn comparison section be [SPC(ci,lrj),EPC(ci,lrj)].Since the sequencing error rate of long reading is relatively high, so the comparison that the tool of comparison provides Section position often has some deviations.This method is modified using following step to section is compared.
If SPR (ci,lrj)<SPC(ci,lrj), then SPC ' (ci,lrj)=SPC (ci,lrj)-SPR(ci,lrj), SPR ' (ci,lrj)=0.
If SPR (ci,lrj)>=SPC (ci,lrj), then SPR ' (ci,lrj)=SPR (ci,lrj)-SPC(ci,lrj), SPC’(ci, lrj)=0.
If LEN (lrj)-EPR(ci,lrj)>LEN(ci)-EPC(ci,lrj), then EPC ' (ci,lrj)=LEN (ci) -1, EPR’(ci,lrj)=EPR (ci,lrj)+LEN(ci)-EPC(ci,lrj)。
If LEN (lrj)-EPR(ci,lrj)<=LEN (ci)-EPC(ci,lrj), then EPC ' (ci,lrj)=EPC (ci, lrj)+ LEN(lrj)-EPR(ci,lrj), EPR ' (ci,lrj)=LEN (lrj)-1。
If | SPR (ci,lrj)-SPR’(ci,lrj)|<α, or | SPC (ci,lrj)-SPC’(ci,lrj)|<α, or | EPC(ci,lrj) -EPC’(ci,lrj)|<α, or | EPR (ci,lrj)-EPR’(ci,lrj)|<α then recognizes lrjAnd ciCan not compare To upper, this method ignores the comparison, wherein α=500.
After above-mentioned amendment, the available long comparison section read between contig of this method, and compare other side To.
1.2.2) if being merely able to compare a contig on a long reading, the long reading of this does not do subsequent place Reason.If there is two or more contig can be compared on the long reading of this, then according to these contig in the length The initial position in section is compared on reading, it is ascending that they are ranked up.If there is a plurality of contig can be compared simultaneously To the 5 ' ends (or 3 ' ends) of the long reading of this, then only retain the contig that section is compared with longest, it is remaining Contig is removed.Assuming that long reading lrjIt can be compared with a plurality of contig, then it corresponds to a part Scaffold, part scaffold can be expressed as a sequence node, wherein each node is a four-tuple.I-th Local scaffold can be expressed as (si1, si2, si3... sim), wherein m is the contig number that this scaffold includes. sijIt is a four-tuple (scij, scoij, scgij, sclij), wherein scijIndicate corresponding contig, scoijIndicate the contig And lrjComparison direction, value is 0 or 1,0 to indicate reversed and compare that 1, which indicates positive, compares.scgijIndicate the contig and Next contig is in the lrjDistance, i.e. SPR (scgi(j+1),lrj)-EPR(scgij,lrj), scgimIt is set as 0.sclijIt is Comparison section size EPC (scg on this contigij,lrj)-SPC(scgij,lrj)。
In this scaffold, for adjacent two contig:scijAnd sci(j+1)If scoij=1, scoi(j+1)=1, then scij3 ' end and sci(j+1)5 ' end be connected;If scoij=1, scoi(j+1)=0, then scij3 ' End and sci(j+1)3 ' end be connected;If scoij=0, scoi(j+1)=0, then scij5 ' end and sci(j+1)3 ' end be connected It connects;If scoij=0, scoi(j+1)=1, then scij5 ' end and sci(j+1)5 ' end be connected;
After having handled the comparison information between all long reading and contig, then a part scaffold collection is generated It closes.Every part scaffold contains the direction between the contig of part, sequence and range information.
The step 2) is specially:
The local scaffold set generated according to above-mentioned steps, searches the contig for meeting following two conditions:(1) it The middle position of two part scaffold is at least appeared in, i.e., in a part scaffold, the contig is neither One nor the last one contig, and in all part scaffold that it occurs and its 5 ' end (or 3 ' hold) Adjacent contig is incomplete the same.(2) or the length of a contig is less than MIN, MIN=2000.These contig sentence It is set to repetition contig, remaining contig is determined as non-duplicate contig.
The step 3.1) specifically includes following steps:
3.1.1) a scaffold figure is made of a node set and a line set.One node is one corresponding contig.A line eij(c is indicated by a five-tuplei, cj, oij, gij, wij), wherein ciAnd cjIt is the two of this edge connection A contig, oijIt is the relative direction between the two contig, gijIt is the distance between the two contig size, wijIt is The weight of this edge.A node is constructed first against each non-duplicate contig.
3.1.2 two non-duplicate contig for meeting following conditions) are searched in local scaffold set, they are at certain It is adjacent in one part scaffold, or the contig being among them is entirely to repeat contig.
For two non-duplicate contig for meeting above-mentioned condition:caAnd cb.Assuming that i-th of part scaffold:lsi, packet Containing caAnd cb。lsiIt is expressed as (si1, si2, si3... sim), after neglecting repetition contig, caAnd cbIn lsiIn be adjacent , and caAnd cbRespectively correspond scipAnd scis.If lsiIn, exist between this two contig and repeat contig, then uses Following formula calculate the distance between they.
Wherein GD (scip,scis,lri) it is expressed as scipAnd scisBetween distance on corresponding long reading.LEN(scij) be scijThe length of this contig.
If then the distance between they are scg there is no contig, i.e. s=p+1 is repeated between themip.Simultaneously A weighted value is obtained, which is sclipAnd sclisMinimum value.The relative direction between them can also be obtained simultaneously, such as Fruit scoij=1, scoi(j+1)=1, then relative direction is set as 1;If scoij=1, scoi(j+1)=0, then relative direction is set It is set to 2;If scoij=0, scoi(j+1)=0, then relative direction is set as 3;If scoij=0, scoi(j+1)=1, then phase 4 are set as to direction;
Then it finds all comprising caAnd cbAll part scaffold.And according to every part scaffold, calculate Relative direction between to them, distance and weighted value.Since the direction between two non-duplicate contig should be with sequence Uniquely, if relative direction is different obtained in all part scaffold, only select the relative direction frequency maximum All part scaffold remain, remaining part scaffold information is not considered.Then following formula meter is utilized Calculate the distance of the two contig:
Wherein lsiIt is the local scaffold remained, n is the number of the local scaffold remained.If There are multiple relative directions that there is the maximum frequency, does not then do subsequent processing, is i.e. does not add side between the two contig.
Since each local scaffold remained can obtain a weight, this method takes its maximum value conduct Weight.Finally in scaffold, caAnd cbA line, the relative direction on side, distance and weight are added between corresponding node It is calculated and is obtained by above step.
After having handled all non-duplicate contig two-by-two, an initial scaffold figure building is completed.
The step 4) specifically includes following steps:
The sequence and directional information and adjacent segments of node are contained in the side of a simple path in scaffold figure The distance between point information, therefore the corresponding scaffold of every simple path, all simple paths constitute one Scaffold set.Only packet includes non-duplicate contig in scaffold set at this time.In a scaffold, two Adjacent non-duplicate contig, this method search all part scaffold comprising them in local scaffold set. When a part scaffold include they, and include between them in the scaffold of the part be entirely repetition Contig, then the repetition contig that these directions and sequence determine is an insertion candidate item.Then in the part comprising them All insertion candidate items are found in scaffold.In the scaffold, the insertion candidate item with most frequencys is selected to insert Enter among the two non-duplicate contig.If the insertion candidate item of most frequencys has multiple, the two are non-duplicate Contig is without insertion operation.After having handled all adjacent non-duplicate contig, final scaffold collection is generated It closes.
Compared with prior art, the advantageous effect of present invention is that:
The scaffolding method based on long reading and contig classification that the invention discloses a kind of.This method first Long read-around ratio is closed to contig collection, generates part scaffold set according to comparison result.One part scaffold is It is made of the contig compared to the long reading of same.The position letter occurred in local scaffold based on every contig All contig are divided into two classes by breath, and one kind is to repeat contig, and another kind of is non-duplicate contig.Building is only comprising non- The scaffold figure of contig is repeated, wherein one non-duplicate contig of each node on behalf in figure, while representing corresponding two A node can be appeared in simultaneously in the scaffold of same part.Scaffold figure is eliminated followed by linear programming method In direction and sequence conflict, and make in scaffold figure only comprising simple path, wherein every simple path is one corresponding scaffold.Then it is inserted into scaffold contig is repeated, forms final scaffolding result.
The present invention is easy to use, and good scaffolding is shown on different truthful datas as a result, more other Scaffolding method has higher accuracy and continuity.
Detailed description of the invention
Fig. 1 is the method for the present invention flow chart;
Fig. 2 is the amendment that the long reading of one embodiment of the invention and contig compare section;
Fig. 3 is that one embodiment of the invention only retains the contig that section is compared with longest;
Specific embodiment
As shown in Figure 1, the present invention the specific implementation process is as follows:
One, part scaffold set is generated
1.1 this method are using contig file and long reading file as input data.It is read first with tool BWA is compared long Number is compared onto contig, obtains comparison result.Wherein only consider that length is greater than LrLong reading and length be greater than Lc's Contig, Lr=500, Lc=3000.
If 1.2 1 long reading and a contig can be compared, its available position for comparing section and ratio To direction.Assuming that the long reading (lr of j-th stripj) and i-th of contig (ci) can compare, then represent lrjA upper section energy Enough compare arrives ciOn a section.This method SPR (ci,lrj) indicate in lrjThe upper initial position for comparing section, EPR (ci,lrj) indicate in lrjThe upper final position for comparing section, SPC (ci,lrj) indicate the start position that section is compared on ci, EPC(ci,lrj) indicate ciThe upper final position for comparing section.Then in lrjOn comparison section be [SPR (ci,lrj),EPR(ci, lrj)], in ciOn comparison section be [SPC (ci,lrj),EPC(ci,lrj)].And since the sequencing error rate of long reading compares Height, so position often has some deviations between the comparison area that the tool of comparison provides.This method is using following methods to comparison area Between be modified.
If SPR (ci,lrj)<SPC(ci,lrj), then SPC ' (ci,lrj)=SPC (ci,lrj)-SPR(ci,lrj), SPR ' (ci,lrj)=0.
If SPR (ci,lrj)>=SPC (ci,lrj), then SPR ' (ci,lrj)=SPR (ci,lrj)-SPC(ci,lrj), SPC’(ci, lrj)=0.
If LEN (lrj)-EPR(ci,lrj)>LEN(ci)-EPC(ci,lrj), then EPC ' (ci,lrj)=LEN (ci) -1, EPR’(ci,lrj)=EPR (ci,lrj)+LEN(ci)-EPC(ci,lrj)。
If LEN (lrj)-EPR(ci,lrj)<=LEN (ci)-EPC(ci,lrj), then EPC ' (ci,lrj)=EPC (ci, lrj)+ LEN(lrj)-EPR(ci,lrj), EPR ' (ci,lrj)=LEN (lrj)-1.Fig. 2 is the long reading of the present invention and contig ratio To the amendment schematic diagram in section, in Fig. 2 (a), two contig:c1And c2Lr can be read with long1In comparison, wherein c1On Section [s1, e1] and lr1On section [s2, e2] corresponding, c2On section [s3, e3] and lr1On section [s4, e4] corresponding.Figure In 2 (b), the initial position and final position that compare section are modified.Due to s2<s1, so s1'=s1-s2, s2'=0; Due to LEN (c1)-e1<LEN(lr1)-e2, so e1'=LEN (c1) -1, e2'=e2+LEN(c1)-e1;Due to s3<s4, so s3'=0, s4'=s4-s3;Due to LEN (lr1)-e4<LEN(c2)-e3, so e3'=e3+ LEN(lr1)-e4, e4'=LEN (lr1)-1;
If | SPR (ci,lrj)-SPR’(ci,lrj)|<α, or | SPC (ci,lrj)-SPC’(ci,lrj)|<α, or | EPC(ci,lrj) -EPC’(ci,lrj)|<α, or | EPR (ci,lrj)-EPR’(ci,lrj)|<α then recognizes lrjAnd ciCan not compare To upper, this method ignores the comparison, wherein α=500.
Comparison section [SPR ' (c after above-mentioned amendment, between the available long reading of this method and contigi, lrj), EPR ' (ci, lrj)] and [SPC ' (ci,lrj), EPC ' (ci,lrj)], and compare direction.The value for comparing direction is 0 Or 1,0, which indicates reversed, compares, and 1, which indicates positive, compares.
When being merely able to compare a upper contig on a long reading, then the long reading of this does not do subsequent processing.If There are two or more contig can compare on the long reading of this when, then according to these contig the length reading on compare It is ascending that they are ranked up to the initial position in section.If there is can to compare simultaneously this long by a plurality of contig 5 ' the ends (or 3 ' ends) of reading, then only retain the contig for comparing section with longest on the long reading of this, remaining Contig removed.Fig. 3 is that the present invention only retains the contig schematic diagram that there is longest to compare section, wherein there is five contig:c1、c2、c3、c4、c5And c6Lr can be read with long1In comparison.Due to c1And c2Lr can be compared simultaneously15 ' End, so only retaining the c that there is longest to compare section1, c2Comparison information remove.Due to c5And c6Lr can be compared simultaneously1 3 ' end, so only retain have longest compare section c5, c6Comparison information remove.When a long reading can compare On a plurality of contig, then part a scaffold, part scaffold can be generated by comparing to contig structure thereon At, and these contig have determined sequence and direction.One part scaffold can indicate by a sequence node, In each node be a four-tuple.I-th part scaffold can be expressed as (si1, si2, si3... sim), wherein m is The contig number that this scaffold includes.sijIt is a four-tuple (scij, scoij, scgij, sclij), wherein scijTable Show corresponding contig, scoijIndicate the contig and lrjComparison direction.scgijIndicate the contig and next Contig is in the lrjDistance, i.e. SPR (scgi(j+1),lrj)-EPR(scgij, lrj), sclijIt indicates in lrjUpper comparison section Length scale, i.e. EPR (scgij,lrj)-SPR(scgij,lrj)。
In this scaffold, for adjacent two contig:scijAnd sci(j+1)If scoij=1, scoi(j+1)=1, then scij3 ' end and sci(j+1)5 ' end be connected;If scoij=1, scoi(j+1)=0, then scij3 ' End and sci(j+1)3 ' end be connected;If scoij=0, scoi(j+1)=0, then scij5 ' end and sci(j+1)3 ' end be connected It connects;If scoij=0, scoi(j+1)=1, then scij5 ' end and sci(j+1)5 ' end be connected.
After having handled the comparison information between all long reading and contig, then a part scaffold collection is generated It closes.Every part scaffold contains the direction between the contig of part, sequence and range information.
Two, contig classifies
The local scaffold set generated according to above-mentioned steps, searches the contig for meeting following two conditions:(1) it The middle position of two part scaffold is at least appeared in, i.e., in a part scaffold, the contig is neither One nor the last one contig, and in all part scaffold that it occurs and its 5 ' end (or 3 ' hold) Adjacent contig is incomplete the same.(2) or the length of a contig is less than MIN, MIN=2000.These contig sentence It is set to repetition contig, remaining contig is determined as non-duplicate contig.
Three, construct and optimize scaffold figure
The initial scaffold figure of 3.1 buildings
3.1.1) a scaffold figure is made of a node set and a line set.One node is one corresponding contig.A line eij(c is indicated by a five-tuplei, cj, oij, gij, wij), wherein ciAnd cjIt is the two of this edge connection A contig, oijIt is the relative direction between the two contig, i.e. expression ciWhich end (5 ' end or 3 ' end) and cj Which end (5 ' end or 3 ' end) adjacent, gijIt is the distance between the two contig size, wijIt is the weight of this edge. A node is constructed first against each non-duplicate contig, forms the node set of initial scaffold figure.
3.1.2 two non-duplicate contig for meeting following conditions) are searched in local scaffold set, they are at certain It is adjacent in one part scaffold, or the contig being among them is entirely to repeat contig.
For two non-duplicate contig for meeting above-mentioned condition:caAnd cb.Assuming that i-th of part scaffold:lsi, packet Containing caAnd cb。lsiIt is expressed as (si1, si2, si3... sim), and caAnd cbRespectively correspond scipAnd scis.If in lsiIn, this Exist between two contig and repeat contig, then calculates the distance between they with following formula.
Wherein GD (scip, scis, lri) it is expressed as scipAnd scisBetween distance on corresponding long reading.LEN(scij) It is scijThe length of this contig.
If then the distance between they are scg there is no contig, i.e. s=p+1 is repeated between themip.Simultaneously A weighted value is obtained, which is sclipAnd sclisMinimum value.The relative direction between them can also be obtained simultaneously, such as Fruit scoij=1, scoi(j+1)=1, then relative direction is set as 1, i.e. ca3 ' end and cb5 ' end it is adjacent;If scoij=1, scoi(j+1)=0, then relative direction is set as 2, i.e. ca3 ' end and cb3 ' end it is adjacent;If scoij=0, scoi(j+1)=0, Then relative direction is set as 3, i.e. ca5 ' end and cb3 ' end it is adjacent;If scoij=0, scoi(j+1)=1, then relative direction It is set as 4, i.e. ca5 ' end and cb5 ' end it is adjacent;Wherein work as scoiAnd scojWhen equal, c is indicatedaAnd cbIn the same direction On, otherwise caAnd cbNot in the same direction.
Then it finds all comprising caAnd cbLocal scaffold.And according to every part scaffold, the above-mentioned side of reason The relative direction between them, distance and weighted value is calculated in method.Due to direction between two non-duplicate contig and suitable Sequence should be unique, if relative direction is incomplete the same obtained in all part scaffold, only selection is relatively The frequency maximum all part scaffold in direction are remained, and remaining part scaffold comparison information is not considered.So The distance of the two contig is calculated using following formula afterwards:
Wherein lsiIt is the local scaffold remained, n is the number of the local scaffold remained.If There are multiple relative directions that there is the maximum frequency, does not then do subsequent processing, is i.e. does not add side between the two contig.
Since each local scaffold remained can obtain a weight, this method is maximized as power Weight.Finally in scaffold, caAnd cbA line is added between corresponding node, the relative direction on side, distance and weight are equal It is calculated and is obtained by above step.
After having handled all non-duplicate contig two-by-two, an initial scaffold figure building is completed.
3.2scaffold figure optimization
3.2.1 direction conflict) is eliminated
Set Oi∈ { 0,1 }, represents ciDirection, 0 represents forward direction, 1 represent it is reversed.In scaffold figure, if one The relative direction on side is equal to 1 or 3, then the side, which constrains corresponding two contig, has the same direction.Otherwise, the side The two contig are constrained with opposite direction.When the direction of a node has determined, then based on the road in scaffold figure The direction for other nodes that diameter and the node are connected also can determine that.But often there are some directions in scaffold figure Conflict, i.e., a certain node are often derived by different directions by different paths.This method is sent out using integral linear programming Existing direction conflict, and make not including direction conflict in scaffold figure by deleting some sides.
Direction between two nodes can be constrained in each edge.If ciAnd cjIn a different direction, then this method is arranged about Beam condition is:
If CiAnd CjIn the same direction, then constraint condition is:
Wherein, ηijIt is a slack variable, ηij∈ { 0,1 };
Optimization object function is:
MAX(∑wiηij)
Wherein, wijIndicate ciAnd cjBetween side weight;MAX(∑wij·ηij) indicate to ask so that functional value is maximum ηijValue;
After acquiring optimal solution, a direction is assigned in each node, if wijThe side meeting is thought not equal to 1 It causes direction to conflict, and deletes.
3.2.2 eliminating position conflict
In scaffold figure, each edge alsies specify the distance between two contig.This method passes through to each Contig distributes initial position, and the initial position of distribution is made to meet the distance between contig as defined in each edge as far as possible.
If Xi∈ [0, C], represents ciIn starting position coordinates forward.Due to previous step solve direction conflict when, The direction for having calculated each contig, the contig for being 0 for direction, which is exactly the position at 5 ' ends, right The contig for being 1 in direction, the initial position are exactly the position at 3 ' ends.XiIt is an integer.C is the sum of all node's lengths Twice.
For ciAnd cjBetween side, then establishing order constrained condition is:
Optimization object function is:
MAX(∑wij·Φij)
Wherein, Xi, Xj∈ [0, C], respectively indicates and gives node ciAnd cjDistribution in starting position coordinates forward, Xi, XjFor integer;Φij∈ [0,1] is a slack variable, for reflecting c as defined in corresponding edgeiAnd cjThe distance between and pass through The gap between distance that distribution position coordinates obtain.This gap is smaller, then ΦijValue closer to 1.
After solution obtains optimal solution, for a line eijIf | | Xi-Xj|-Gij|/1000>β, then it is assumed that eijMeeting It causes to conflict, and deletes.Wherein β=3.
3.2.3 eliminating multiple-limb side
After above-mentioned steps, if there are sides to be connected with a number of other nodes at the 5 ' ends (or 3 ' ends) of a node It connects, then only retains the maximum side of weight, delete remaining side.
After above-mentioned steps are handled, scaffold figure only includes simple path.
Four, scaffold is generated
The sequence and directional information and adjacent segments of node are contained in the side of a simple path in scaffold figure The distance between point information, therefore the corresponding scaffold of every simple path, all simple paths constitute one Scaffold set.It only include non-duplicate contig in scaffold set at this time.
Following this method is inserted into the position for repeating contig in scaffold set.In a scaffold, Two adjacent non-duplicate contig, this method search all parts comprising them in local scaffold set scaffold.When a part scaffold includes them, and in the whole for including between them in the scaffold of the part It is to repeat contig, then the repetition contig that these directions and sequence determine is an insertion candidate item.It finds comprising the two All part scaffold of non-duplicate contig, and determine all insertion candidate items.Select the insertion with most frequencys Candidate item is inserted among the two non-duplicate contig.If the insertion candidate item of most frequencys has multiple, the two are non- Contig is repeated without insertion operation.After having handled all adjacent non-duplicate contig, final scaffold is generated Set.
Five, experimental verification
5.1 data sets and evaluation index
In order to verify the validity of this method, contig data set and long readings collection of this method on two species On tested, and be compared analysis with currently a popular other three kinds of scaffolding methods.Both species include: Escherichia coli (Escherichia coli/E.coli) and saccharomyces cerevisiae (Saccharomyces cerevisiae/ S.cerevisiae).There are two different contig to gather for each species, is generated by different assembling tools.E.coli pairs The two contig set answered is respectively contig set 1 and contig set 2, S.cerevisiae two corresponding Contig set is respectively contig set 3 and contig set 4.Also there are two long reading set for each species, respectively It is (big to equal the real-time sequencing technologies of foreign sequencing company unimolecule and Oxford nano-pore sequencing that there are two types of different third generation sequencing technologies Technology) it obtains.The details of long reading set are shown in Table 1.In order to evaluate the accuracy of scaffolding method, this method benefit Scaffolding result is evaluated with QUAST tool.Wherein main evaluation index is shown in Table 2.
The long readings collection of table 1
Comparison between 5.2 scaffolding methods
This method and other three popular scaffolding methods compare, these three sides scaffolding Method includes:SSPACE-LongRead, LINKS and npScarf.This method is named as SLR.
2 QUAST evaluation index of table
Scaffolding evaluation result of the table 3 based on the long reading set of the real-time sequencing technologies of unimolecule
5.2.1, the scaffolding evaluation result based on the long reading set of the real-time sequencing technologies of unimolecule
Scaffolding, evaluation knot are carried out first with the long reading set obtained by the real-time sequencing technologies of unimolecule Fruit is shown in Table 3.It will be seen that our method has least mistake on all data sets, this method is illustrated The more other methods of accuracy have higher accuracy.The continuity of Scaffolding result is often by NA50 and NGA50 It is evaluated, it will be seen that this method has optimal NA50 and NGA50 on first group and third group data set, There is optimal NA50 on four group data sets.
5.2.2, the scaffolding evaluation result based on Oxford nano-pore sequencing Chief Technology Officer reading set
We carry out scaffolding followed by the long reading set obtained by Oxford nano-pore sequencing technology, comment Valence the results are shown in Table 4.It will be seen that our method still has least mistake on all data sets.We Method has optimal NA50 and NGA50 on first group and the second group data set, has on third group data set optimal NGA50。
Scaffolding evaluation result of the table 4 based on Oxford nano-pore sequencing Chief Technology Officer reading set

Claims (5)

1. a kind of scaffolding method based on long reading and contig classification, which is characterized in that include the following steps:
1) long read-around ratio is closed first to contig collection, and generates local scaffold set;
1.1) comparison tool BWA is utilized, long reading set is compared to contig collection and is closed, comparison result is generated.Wherein only examine Consider length and is greater than LrLong reading and length be greater than LcContig, Lr=500, Lc=3000.
1.2) it is directed to a long reading, extracts all contig set that can be compared on it, and calculating ratio is to section position It sets.If without or an only contig comparison on, this it is long reading do not do subsequent processing.If there is two or more A plurality of contig can be compared, then according to the comparison position and direction information between the long reading of this and these contig, really Sequencing and direction between these fixed contig, and generate a part scaffold.When the length for having handled all is read Afterwards, a part scaffold set is generated.
2) contig classifies;
If a contig appears in the middle position of two or more parts scaffold (i.e. a part In scaffold, it is neither first, nor the last one contig), and it is tight in different local scaffold The contig at its adjacent 5 ' end (or 3 ' ends) is incomplete the same, then this contig is to repeat contig.An or contig Length be less than MIN, MIN=2000, then also think this contig be repeat contig.Remaining contig is non-duplicate contig.After having handled all contig, then all contig are divided into two classes:Repeat contig and non-duplicate contig。
3) construct and optimize scaffold figure;
3.1) node is constructed first against each non-duplicate contig;For two non-duplicate contig, judge that they are It is not no while appearing in the scaffold of same part, if it is then determining that the two are non-duplicate according to comparison information The distance between direction and order information between contig, and calculate them.Then judge whether can add between them A line, and determine the weight on side.After having handled all nodes two-by-two, then an initial scaffold figure building is completed.
3.2) each edge constrains direction, sequence and range information between its two node that are connected in scaffold figure, because This constructs linear programming model according to side all in scaffold figure, and detection and removal cause the side in direction and sequence conflict, Guarantee that there is no directions and sequence to conflict in scaffold figure.
3.3) after eliminating conflict, in scaffold figure, simultaneously and some node 5 ' holds (or 3 ' if there is multiple nodes End) be connected, then only retain the maximum side of weight, remaining side is removed.Processing through the above steps, in scaffold figure It only include simple path.
4) scaffold set is generated;
A simple path in scaffold figure contain between the sequence of node and directional information and adjacent node away from From information, therefore every simple path corresponds to a scaffold, and generates a scaffold set.For two Adjacent non-duplicate contig in scaffold, if a part scaffold includes them, and in the part Include between them in scaffold is entirely to repeat contig, then the repetition contig that these directions and sequence determine is one A insertion candidate item.If two non-duplicate contig are appeared in a plurality of part scaffold, every part scaffold A corresponding insertion candidate item, then there is the insertion candidate item of most frequencys to be inserted into the two non-duplicate contig for selection Between.If the insertion candidate item of most frequencys has multiple, the two non-duplicate contig are without insertion operation.When having handled After all adjacent non-duplicate contig, final scaffold set is generated.
Wherein contig is genomic sequence fragment;Scaffold is genome overlength sequence fragment;Scaffolding method is Refer to the direction for determining each contig and their sequencings on genome sequence, to generate some genes Group overlength sequence fragment, the i.e. method of scaffold.The left end of one sequence is its 5 ' end, and right end is its 3 ' end.
2. the scaffolding method according to claim 1 based on long reading and contig classification, which is characterized in that The step 1.2) is specially:
1.2.1) the comparison result generated according to tool BWA is compared, if one long reading and a contig can be compared, Its available position for comparing section and comparison direction.Assuming that the long reading (lr of j-th stripj) and i-th of contig (ci) can compare To upper, and in lrjOn comparison section be [SPR (ci,lrj),EPR(ci,lrj)], in ciOn comparison section be [SPC (ci,lrj),EPC(ci,lrj)].Since the sequencing error rate of long reading is relatively high, so the comparison section that the tool of comparison provides Position often has some deviations.This method is modified using following step to section is compared.
If SPR (ci,lrj)<SPC(ci,lrj), then SPC ' (ci,lrj)=SPC (ci,lrj)-SPR(ci,lrj), SPR ' (ci, lrj)=0.
If SPR (ci,lrj)>=SPC (ci,lrj), then SPR ' (ci,lrj)=SPR (ci,lrj)-SPC(ci,lrj), SPC ' (ci,lrj)=0.
If LEN (lrj)-EPR(ci,lrj)>LEN(ci)-EPC(ci,lrj), then EPC ' (ci,lrj)=LEN (ci) -1, EPR ' (ci,lrj)=EPR (ci,lrj)+LEN(ci)-EPC(ci,lrj)。
If LEN (lrj)-EPR(ci,lrj)<=LEN (ci)-EPC(ci,lrj), then EPC ' (ci,lrj)=EPC (ci,lrj)+ LEN(lrj)-EPR(ci,lrj), EPR ' (ci,lrj)=LEN (lrj)-1。
If | SPR (ci,lrj)-SPR’(ci,lrj)|<α, or | SPC (ci,lrj)-SPC’(ci,lrj)|<α, or | EPC (ci,lrj)-EPC’(ci,lrj)|<α, or | EPR (ci,lrj)-EPR’(ci,lrj)|<α then recognizes lrjAnd ciIt can not compare, This method ignores the comparison, and wherein α is a parameter (α=500).
After above-mentioned amendment, the available long comparison section read between contig of this method, and compare direction.
1.2.2) if being merely able to compare a contig on a long reading, the long reading of this does not do subsequent processing.Such as There are two fruits or more contig can be compared on the long reading of this, then is compared on length reading according to these contig It is ascending that they are ranked up to the initial position in section.If there is can to compare simultaneously this long by a plurality of contig 5 ' the ends (or 3 ' ends) of reading, then only retain the contig that section is compared with longest, remaining contig is moved It removes.Assuming that long reading lrjIt can be compared with a plurality of contig, then it corresponds to a part scaffold, the part Scaffold can be expressed as a sequence node, wherein each node is a four-tuple.I-th part scaffold can be with It is expressed as (si1, si2, si3... sim), wherein m is the contig number that this scaffold includes.sijIt is a four-tuple (scij, scoij, scgij, sclij), wherein scijIndicate corresponding contig, scoijIndicate the contig and lrjRatio other side To value, which is that 0 or 1,0 expression is reversed, to be compared, and 1 indicates positive comparison.scgijIndicate that the contig and next contig exist The lrjDistance, i.e. SPR (scgi(j+1),lrj)-EPR(scgij,lrj), scgimIt is set as 0.sclijIt is on this contig Compare section size EPC (scgij,lrj)-SPC(scgij,lrj)。
In this scaffold, for adjacent two contig:scijAnd sci(j+1)If scoij=1, scoi(j+1)=1, Then scij3 ' end and sci(j+1)5 ' end be connected;If scoij=1, scoi(j+1)=0, then scij3 ' end and sci(j+1)'s 3 ' ends are connected;If scoij=0, scoi(j+1)=0, then scij5 ' end and sci(j+1)3 ' end be connected;If scoij= 0, scoi(j+1)=1, then scij5 ' end and sci(j+1)5 ' end be connected;
After having handled the comparison information between all long reading and contig, then a part scaffold set is generated. Every part scaffold contains the direction between the contig of part, sequence and range information.
3. the scaffolding method according to claim 1 based on long reading and contig classification, which is characterized in that The step 2) is specially:
The local scaffold set generated according to above-mentioned steps, searches the contig for meeting following two conditions:(1) it is at least The middle position of two part scaffold is appeared in, i.e., in a part scaffold, the contig is neither first Be also not the last one contig, and in all part scaffold that it occurs and its 5 ' end (or 3 ' hold) close to Contig it is incomplete the same.(2) or the length of a contig is less than MIN, MIN=2000.These contig are determined as Contig is repeated, remaining contig is determined as non-duplicate contig.
4. the scaffolding method according to claim 1 based on long reading and contig classification, which is characterized in that The step 3.1) is specially:
3.1.1) a scaffold figure is made of a node set and a line set.One node is one corresponding contig.A line eij(c is indicated by a five-tuplei, cj, oij, gij, wij), wherein ciAnd cjIt is the two of this edge connection A contig, oijIt is the relative direction between the two contig, gijIt is the distance between the two contig size, wijIt is The weight of this edge.A node is constructed first against each non-duplicate contig.
3.1.2 two non-duplicate contig for meeting following conditions) are searched in local scaffold set, they are in a certain item It is adjacent in local scaffold, or the contig being among them is entirely to repeat contig.
For two non-duplicate contig for meeting above-mentioned condition:caAnd cb.Assuming that i-th of part scaffold:lsi, include ca And cb。lsiIt is expressed as (si1, si2, si3... sim), it is assumed that caAnd cbRespectively correspond scipAnd scis.If lsiIn, this two Exist between contig and repeat contig, then calculates the distance between they with following formula.
Wherein GD (scip,scis,lri) it is expressed as scipAnd scisBetween distance on corresponding long reading.LEN(scij) it is scij The length of this contig.
If then the distance between they are scg there is no contig, i.e. s=p+1 is repeated between themip.One is also obtained simultaneously A weighted value, the value are sclipAnd sclisMinimum value.The relative direction between them can also be obtained simultaneously, if scoij =1, scoi(j+1)=1, then relative direction is set as 1;If scoij=1, scoi(j+1)=0, then relative direction is set as 2;Such as Fruit scoij=0, scoi(j+1)=0, then relative direction is set as 3;If scoij=0, scoi(j+1)=1, then relative direction is arranged It is 4;
Then it finds all comprising caAnd cbLocal scaffold.According to every part scaffold, it is calculated between them Relative direction, distance and weighted value.Since the direction between two non-duplicate contig should be unique with sequence, if The relative direction obtained in all part scaffold is different, then only selects the maximum all parts of the relative direction frequency Scaffold is remained, and remaining part scaffold information is not considered.Then the two are calculated using following formula The distance of contig:
Wherein lsiIt is the local scaffold remained, n is the number of the local scaffold remained.If there is multiple Relative direction has the maximum frequency, then does not do subsequent processing, is i.e. does not add side between the two contig.
Since each local scaffold remained can obtain a weight, this method takes its maximum value as power Weight.Finally in scaffold, caAnd cbAdded between corresponding node a line, the relative direction on side, distance and weight by Above step, which calculates, to be obtained.
After having handled all non-duplicate contig two-by-two, an initial scaffold figure building is completed.
5. the scaffolding method according to claim 1 based on long reading and contig classification, which is characterized in that The step 4) is specially:
Contained in the side of a simple path in scaffold figure node sequence and directional information and adjacent node it Between range information, therefore the corresponding scaffold of every simple path, all simple paths constitute a scaffold collection It closes.Only packet includes non-duplicate contig in scaffold set at this time.In a scaffold, two adjacent non-duplicate Contig, this method search all part scaffold comprising them in local scaffold set.When a part Scaffold include they, and include between them in the scaffold of the part be entirely repeatedly contig, then these The repetition contig that direction and sequence determine is an insertion candidate item.Institute is then found in the local scaffold comprising them Some insertion candidate items.In the scaffold, the insertion candidate item with most frequencys is selected to be inserted into the two non-duplicate Among contig.If the insertion candidate item of most frequencys has multiple, the two non-duplicate contig are without insertion operation. After having handled all adjacent non-duplicate contig, final scaffold set is generated.
CN201810642753.0A 2018-06-21 2018-06-21 A kind of scaffolding method based on long reading and contig classification Pending CN108830047A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810642753.0A CN108830047A (en) 2018-06-21 2018-06-21 A kind of scaffolding method based on long reading and contig classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810642753.0A CN108830047A (en) 2018-06-21 2018-06-21 A kind of scaffolding method based on long reading and contig classification

Publications (1)

Publication Number Publication Date
CN108830047A true CN108830047A (en) 2018-11-16

Family

ID=64142845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810642753.0A Pending CN108830047A (en) 2018-06-21 2018-06-21 A kind of scaffolding method based on long reading and contig classification

Country Status (1)

Country Link
CN (1) CN108830047A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109935274A (en) * 2019-03-01 2019-06-25 河南大学 A kind of long reading overlay region detection method based on k-mer distribution characteristics

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110257889A1 (en) * 2010-02-24 2011-10-20 Pacific Biosciences Of California, Inc. Sequence assembly and consensus sequence determination
US20130317755A1 (en) * 2012-05-04 2013-11-28 New York University Methods, computer-accessible medium, and systems for score-driven whole-genome shotgun sequence assembly
CN104200133A (en) * 2014-09-19 2014-12-10 中南大学 Read and distance distribution based genome De novo sequence splicing method
US20150178446A1 (en) * 2013-12-18 2015-06-25 Pacific Biosciences Of California, Inc. Iterative clustering of sequence reads for error correction
CN104951672A (en) * 2015-06-19 2015-09-30 中国科学院计算技术研究所 Splicing method and system of second generation and third generation genomic sequencing data combination
CN106021978A (en) * 2016-04-06 2016-10-12 晶能生物技术(上海)有限公司 Assembling method for de novo sequencing data based on optics map platform Irys
CN106355000A (en) * 2016-08-25 2017-01-25 中南大学 Scaffolding method based on statistical characteristic of double-end insert size

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110257889A1 (en) * 2010-02-24 2011-10-20 Pacific Biosciences Of California, Inc. Sequence assembly and consensus sequence determination
US20130317755A1 (en) * 2012-05-04 2013-11-28 New York University Methods, computer-accessible medium, and systems for score-driven whole-genome shotgun sequence assembly
US20150178446A1 (en) * 2013-12-18 2015-06-25 Pacific Biosciences Of California, Inc. Iterative clustering of sequence reads for error correction
CN104200133A (en) * 2014-09-19 2014-12-10 中南大学 Read and distance distribution based genome De novo sequence splicing method
CN104951672A (en) * 2015-06-19 2015-09-30 中国科学院计算技术研究所 Splicing method and system of second generation and third generation genomic sequencing data combination
CN106021978A (en) * 2016-04-06 2016-10-12 晶能生物技术(上海)有限公司 Assembling method for de novo sequencing data based on optics map platform Irys
CN106355000A (en) * 2016-08-25 2017-01-25 中南大学 Scaffolding method based on statistical characteristic of double-end insert size

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
HSIN-HUNG LIN等: ""Evaluation and Validation of Assembling Corrected PacBio Long Reads for Microbial Genome Completion via Hybrid Approaches"", 《PLOS ONE》 *
LUO J: ""BOSS: a novel scaffolding algorithm based on an optimized scaffold graph"", 《BIOINFORMATICS》 *
RAJAT S. ROY等: ""SLIQ: Simple Linear Inequalities for Efficient Contig Scaffolding"", 《ARXIV:1111.1426V2》 *
杨帆: ""基于BWT的DNA重叠群序列合并算法研究"", 《中国优秀硕士学位论文全文数据库·基础科学辑》 *
马云云: ""新一代DNA测序数据的重叠群组装算法的研究与实现"", 《中国优秀硕士学位论文全文数据库·信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109935274A (en) * 2019-03-01 2019-06-25 河南大学 A kind of long reading overlay region detection method based on k-mer distribution characteristics
CN109935274B (en) * 2019-03-01 2021-04-30 河南大学 Long reading overlap region detection method based on k-mer distribution characteristics

Similar Documents

Publication Publication Date Title
CA2424031C (en) System and process for validating, aligning and reordering genetic sequence maps using ordered restriction map
Jansson et al. Algorithms for combining rooted triplets into a galled phylogenetic network
CN109241355A (en) Accessibility querying method, system and the readable storage medium storing program for executing of directed acyclic graph
WO2018218788A1 (en) Third-generation sequencing sequence alignment method based on global seed scoring optimization
CN104700311B (en) A kind of neighborhood in community network follows community discovery method
CN108830047A (en) A kind of scaffolding method based on long reading and contig classification
El-Mabrouk et al. Gene family evolution—an algorithmic framework
CN106355000B (en) The scaffolding methods of insert size statistical natures are read based on both-end
CN110046265B (en) Subgraph query method based on double-layer index
CN108491687B (en) Scafffolding method based on contig quality evaluation classification and graph optimization
Walve et al. Kermit: linkage map guided long read assembly
Horesh et al. Designing an A* algorithm for calculating edit distance between rooted-unordered trees
CN111462812A (en) Multi-target phylogenetic tree construction method based on feature hierarchy
CN114896480B (en) Top-K space keyword query method based on road network index
Ndiaye et al. When less is more: sketching with minimizers in genomics
CN113312488B (en) Knowledge graph processing method and device
EP3663890A1 (en) Alignment method, device and system
CN102750460A (en) Operational method of layering simplifying large-scale graph data
CN113392279A (en) Similar directed subgraph searching method and system based on subjective logic and feedforward neural network
Nakhleh et al. Phylogenetic networks: Properties and relationship to trees and clusters
US8428885B2 (en) Virtual screening of chemical spaces
Sundararajan et al. Chaining algorithms for alignment of draft sequence
CN114580779B (en) Block chain transaction behavior prediction method based on graph feature extraction
Zou et al. The Improved Needleman-Wunsch Algorithm Based on the Genetic Process of Sequences
Oehl A combinatorial approach for reconstructing rDNA repeats

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181116

WD01 Invention patent application deemed withdrawn after publication