CN104750765A

CN104750765A - Genome sequencing data sequence assembling method

Info

Publication number: CN104750765A
Application number: CN201410096283.4A
Authority: CN
Inventors: 孙际宾; 李澎鹏; 郑平; 马延和
Original assignee: Tianjin Institute of Industrial Biotechnology of CAS
Current assignee: Tiangong Biotechnology (Tianjin) Co.,Ltd.
Priority date: 2013-12-30
Filing date: 2014-03-17
Publication date: 2015-07-01
Anticipated expiration: 2034-03-17
Also published as: CN104750765B

Abstract

The embodiment of the invention provides a genome sequencing data sequence assembling method. By means of the method, the advantages of denovo sequencing and the advantages of re-sequencing can be integrated, and genome sequencing data sequences are effectively assembled. A drafted sequence traversal path of the genome sequencing sequences obtained based on a reference sequence and generated after sequencing data are mapped to the reference sequence in a comparing mode and an overlapping relation set of the genome sequencing data are known. The set comprises a determined relation subset and an undetermined relation subset. The method includes the steps that after the sequencing data sequences are mapped to an affinis reference genome in a comparing mode, the drafted data traversal path of the genome sequencing sequences is obtained based on the reference sequence, all nodes in the drafted sequence traversal path are checked one by one; iterated revision is conducted on the drafted sequence traversal path according to the connection relation of the determined relation subset and/or the undetermined relation subset in the overlapping relation set, and the overlapping relation set is updated; the next node is checked based on the updated drafted sequence traversal path and the updated overlapping relation set till the last node is checked.

Description

A kind of gene order-checking data sequence assemble method

Technical field

The present invention relates to genome sequence package technique, particularly relate to the genome assemble method had in nearly edge reference sequences situation.

Technical background

Along with the continuous progress of sequencing technologies, a large amount of microbial genome has been had to be done and to be submitted to database.Have the microorganism of industrial use, its industrial bacterial strain mostly constantly screens existing bacterial strain and transforms and obtains.Therefore, the genome sequence of these starting strains or nearly edge bacterial strain can provide certain guidance and reference for genome assembling process.

In order to obtain the genome full figure of the industrial bacterial strain of a strain, analytical plan conventional at present has De novo(de novo sequencing).De novo refers to without any the basis of background information using order-checking and conventional molecular biological laboratory facilities check order to object species genome, assembles, the techniqueflow that framework is built and blank (Gap) fills up.This scheme is when comparatively complicated or larger genome, meeting at substantial time and cost, but its result obtained is the most reliable, the plasmid of species, plastid, specific sequence and sudden change result can be obtained, and then can obtain species overall functional sequence and analyze obtain its Physiology and biochemistry ability, reconstruct its history of life.

What current De novo was the most frequently used is " overlap-layout-consensus " scheme, (read is also called by the sequence of reading of method to all order-checking gained of sequence alignment, refer to the sequence obtaining of checking order) border sequences detect, find the overlapping region that may exist.Then, according to these overlapping relations reading sequence, sequence is read to these and merge, form contig thus complete assembling.

Figure 1 shows that the schematic diagram of " overlap-layout-consensus " algorithm in prior art De novo.As shown in Figure 1, suppose on genome, there are two height similar sequences REP1 and REP2.Read1 and Read2 lays respectively at REP1 both sides, and its overlapping region is seated in REP1, and both overlapping region length are L1; Read3 and Read4 is positioned at REP2 both sides, and its overlapping region is positioned at REP2, and both overlap length is L2, and L2>L1.

If package program uses greedy algorithm, suppose first to traverse Read1 in ergodic sequence process, because its sequence with best overlay relation is Read4, therefore, this connection result of Read1->Read4 is brought in net result, causes assembling result to make a mistake.Therefore, when only having the traversal order of Read3 or Read4 to be greater than Read1, Read2, just correct result can be obtained.

If package program uses Graph Theory, although the mistake that greedy algorithm may cause can be identified, be also have circumscribed.Suppose, reading a setting parameter L in sequence overlap detecting (overlap) step, if the overlap length of two sequences is greater than L just think that two sequences have overlapping relation, to suppose L1<L<L2; So reading in the overlapping detecting process of sequence, due to the reason of Selecting parameter, this relation of Read1->Read2 would not be identified, cause putting in (layout) process reading sequence, if first Read1 or Read2 be traversed, because it only has a kind of connected mode (Read1->Read4, Read2->Read3) be identified, this part can be considered to believable, and then is brought in net result and causes assembly defect.

In addition, read long repetitive sequence region for the far super order-checking of length, relevant sequenced fragments can only be assembled formation consistance fragment by packing algorithm by force conventional at present.Although other correlation techniques of current announcement can estimate the multiplicity of repeated fragment on genome according to overburden depth, the accurate putting position of this fragment on genome accurately cannot be determined.

Summary of the invention

The embodiment of the present invention provides a kind of gene order-checking data sequence assemble method, can be easy and confirm " uncertain " annexation in the set of sequencing data overlapping relation and carry out accurate reproduction to genome sequence accurately.

In order to achieve the above object, a kind of gene order-checking data sequence assemble method that the embodiment of the present invention provides, the overlapping relation set drafting sequence traverse path and gene order-checking data of the gene order-checking sequence based on reference sequences acquisition generated after known sequencing data comparison a to reference sequences, described overlapping relation set comprises " determination " relation subset sums " uncertain " relation subset, and the method comprises:

The each node drafted in sequence traverse path is checked one by one according to the order drafting sequence traverse path, carry out modified quasi fixed sequence traverse path according to the annexation that " determination " relation subset and/or " uncertain " relator of overlapping relation set are concentrated, and upgrade overlapping relation set;

Based on upgrade after draft sequence traverse path and overlapping relation set, check next node, until last node;

Wherein, when checking a certain node, described method comprises:

There is " determination " relation of setting out with present node if concentrated in " determination " relator of overlapping relation set, and described pass ties up to draft in sequence traverse path and does not exist; Then described relation is added to and draft sequence traverse path, and described relation is concentrated deletion from " determination " relator.

Wherein, when checking a certain node, described method comprises further:

Be that the set out pass of node ties up in overlapping relation set and exists drafting in sequence traverse path with present node, then retain current relation drafting in sequence traverse path, and corresponding annexation is deleted from overlapping relation set; And/or

If be that the pass of starting point node ties up in overlapping relation set and do not exist with present node drafting in sequence traverse path, then described sequence traverse path of drafting is deleted from drafting sequence traverse path.

Wherein, when checking a certain node, described method comprises further:

If concentrate existence anduniquess one to be the overlapping relation of node of setting out with present node in " uncertain " relator of overlapping relation set, meanwhile, the terminating node of described annexation is the start node that described " uncertain " relator concentrates another annexation; The annexation that then to concentrate with described node by described " uncertain " relator be start node is added to and is drafted in sequence traverse path, and concentrates from described " uncertain " relator and delete corresponding annexation.

Wherein, when checking a certain node, described method comprises further:

When not having any record of described node in overlapping relation set, and/or concentrate in " uncertain " relator the annexation that not existence anduniquess is start node with described node, and/or in the annexation that " uncertain " relator concentrates existence anduniquess to be start node with described node, but the terminating node of described annexation is not the start node that " uncertain " relator concentrates another annexation, then delete described node described drafting in sequence traverse path.

Wherein, after traversal flow process terminates, comprise further:

Drafting of final correction the being completed annexation that sequence traverse path and " determination " relator concentrate drafts sequence traverse path as described tested gene order-checking data.

Utilize the technical scheme that the embodiment of the present invention provides, in gene order-checking data sequence assembling process, " uncertain " annexation in suspicious overlapping sequences set of relationship can be confirmed, and determine that its multiplicity reduces its disposing way further, experimental result shows, the method that the embodiment of the present invention provides can obtain result more accurately and effectively.

Accompanying drawing explanation

Fig. 1 is the schematic diagram of " overlap-layout-consensu " s algorithm in De novo in prior art.

Fig. 2 is the schematic flow sheet of gene order-checking data sequence assemble method in the embodiment of the present invention.

Fig. 3 is the schematic flow sheet of reference sequences Application way in the embodiment of the present invention.

Fig. 4 a and Fig. 4 b is that the reference sequences overlapping relation figure that the embodiment of the present invention provides builds schematic diagram.

Fig. 5 is the example schematic diagram of reference sequences Application way in the embodiment of the present invention.

Fig. 6 is the example schematic diagram of gene order-checking data sequence assemble method in the embodiment of the present invention.

Fig. 7 is the schematic flow sheet of reference sequences Application way in another embodiment of the present invention.

Fig. 8 is that in one embodiment of the invention, reference sequences reads sequence overlapping relation schematic diagram.

embodiment

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, the present invention is described in further detail.

Fig. 2 is the schematic flow sheet of gene order-checking data sequence assemble method in the embodiment of the present invention.As shown in Figure 2, the method comprises:

Step 201: according to the overlapping relation read between sequence obtained that checks order, builds overlapping relation figure and reverse mutual complement.All corresponding node in overlapping relation figure and reverse mutual complement thereof are all the relation of reverse complemental equivalence each other.Because we only know whether two sequences have overlapping relation usually, but in uncertain assembling result sequence set finally put order, therefore, we need to build two figure simultaneously, overlapping relation figure G and its anti-complementary series figure G '.As long as have overlapping relation between two sequence fragments, can mark in overlapping relation figure.

Step 202: judge in overlapping relation figure G, whether each node is all inspected; If so, then whole flow process is terminated; Otherwise turn to step 203.

Step 203: appoint the node n be not inspected got in overlapping relation figure G _x, travel through G and G ' with any direction D.Wherein any direction D can be out-degree direction (from this node) or in-degree direction (pointing to this node).

Step 204: judge whether there is a certain node n in any direction D _ywith node n _xthere is annexation; If existed, then enter step 205; Otherwise proceed to step 206.

Step 205: if node n _ywith node n _xfor two-way unique relationships, then enter step 208; Otherwise enter step 209.

Here, and if only if in G, finds a node n _yfor this node n _xat the exclusive path in downstream, D direction; And in G ', node n _xnode n _ywhen D direction downstream exclusive path, assert node n _ywith node n _xfor two-way unique relationships.

Step 206: judge whether there is a certain node n in the D ' of direction _zwith node n _xthere is annexation; If existed, then enter step 207.Here, due to n _xwill inevitably be connected with a node, if so perform to step 206, then judged result one is herein decided to be and there is node n _zthere is annexation.

Step 207: if node n _zwith node n _xfor two-way unique relationships, then enter step 208; Otherwise enter step 209.

Step 208: by this n _ywith n _xrelation, and/or this n _zwith n _xrelation confirm as credible annexation, and by this credible annexation n _x->n _y, and/or n _z->n _xput into reliable sequence fragment contig; And proceed to step 210.

Step 209: by this n _ywith n _xrelation, and/or this n _zwith n _xrelation confirm as " uncertain " annexation, should " uncertain " annexation n _x->n _y, and/or n _z->n _xput in suspicious sequence fragment set of relationship; And proceed to step 210.

Step 210: by checked relation, comprise relation n _x->n _y, relation n _y->n _x, and/or relation n _x->n _z, relation n _z->n _xdelete from G and G ' respectively, simultaneously by this node n _xbe set to " inspected ".

Like this, by two-way procuratorial work method, overlapping graph of a relation is pruned, obtain reliable sequence fragment contig and suspicious sequence fragment set of relationship.

In an embodiment of the present invention, above step 204 and step 205 and step 206 and step 207 can parallel executions, so further raising operation efficiency.In an embodiment of the present invention, additionally provide a kind of method, further " uncertain " annexation in suspicious sequence fragment set of relationship is confirmed as " determination " annexation.The set of " uncertain " annexation can be integrated, to obtain annexation comparatively accurately by reference sequences.The sequence fragment related to by all " uncertain " annexations and reference sequences are compared, if some " uncertain " annexation can find in reference sequences, then think that this " uncertain " annexation is " determination " annexation.Suppose that multiple " uncertain " annexation relating to same node can find in reference sequences (namely read a sequence can be compared on multiple positions of reference sequences), then use contrast score to meet the result of certain score threshold as its comparison position.Here owing to usually can adopt certain measure algorithm to weigh two to read the close of sequence or mutually long-range sequence, such measure algorithm usually can digitally assess contrast score.

Now, as shown in Figure 3, the reference sequences Application way that the embodiment of the present invention provides comprises:

Step 301: what obtain according to existing reference sequences and initial order-checking reads sequence, the overlapping relation figure R of construction reference sequences and reverse mutual complement R '.Here reference sequences can be sequence disclosed in prior art, the assembled relation of these sequences is determined, have in reference sequences situation, reference sequences is used to compare with the sequence of reading obtained that checks order, sort reading the position of sequence according to comparison, and then obtain a reference sequences overlapping relation figure R and reverse mutual complement R '.

In an embodiment of the present invention, initial order-checking is obtained read sequence and reference sequences compare time, blat etc. can be used to tolerate, and the alignment programs that large fragment is reset and lacked is compared.In addition because sequencing data is too large, and its length might not be homogeneous, therefore, cluster analysis to be carried out to all sequences of reading of order-checking before comparing, the similarity of 95% is used to carry out Sequence clustering, only choose one or several sequences the longest in each class for subsequent analysis, and the result after cluster abbreviation and reference sequences are compared, then sorted in the direction of the result after comparison and its comparison position.

The reference sequences overlapping relation figure that Fig. 4 provides for one embodiment of the invention builds schematic diagram.As shown in fig. 4 a, supposing to read sequence 1,2,3,4 for initial order-checking obtains, due to read sequence 2 and 3 read completely sequence 1 comprise, first cluster analysis is passed through before construction reference sequences overlapping relation figure R, sequence 1,2 and 3 being read and be polymerized to a class, and choosing out reading sequence 1, and read sequence 2 and 3 and give up.By the result (read sequence 1 and read sequence 4) after cluster abbreviation and reference sequences comparison, as shown in Figure 4 b, then sort reading sequence 1 and reading the comparison position of sequence 4 on reference sequences.Find that it is adjacent for reading sequence 1 and reading sequence 4, therefore using the part of the relation of 1->4 as reference overlapping sequences graph of a relation.

Step 302: in suspicious sequence fragment set of relationship, if any one some annexations (M->X) of setting out in multiple " uncertain " annexation (M->X or M->Y) of node M exist in R or R ', then by this annexation stored in reliable sequence fragment contig, all the other annexations relating to M are eliminated in suspicious sequence fragment set of relationship.

Step 303: assemble according to the credible annexation in reliable sequence fragment contig, obtains assembling result sequence.In an embodiment of the present invention, the assembling result sequence obtained is likely unique result sequence, is also likely a series of multiple result sequence.

Fig. 5 is the example schematic diagram utilizing reference sequences to assemble in the embodiment of the present invention.As shown in Figure 5, according to reference overlay chart (secondary series of Fig. 5) that an existing reference sequences (Fig. 5 first row) is formed, relation (the 3rd of Fig. 5 arranges) in this reference overlay chart and suspicious sequence fragment composition of relations is contrasted, credible annexation is being confirmed as with reference to the relation occurred in overlay chart by suspicious sequence fragment composition of relations, by the deletion of other relations, (the relation B in the 4th row in Fig. 5, C, E are credible annexation, relation D and F deletes), finally obtain errorless result (the 5th row in Fig. 5).

When user provides overlapping relation figure or provides raw data, utilize the technical scheme that the embodiment of the present invention provides, can build overlapping graph of a relation and check, and generation tentatively assembles result automatically, this result can revise the mistake of greedy algorithm or graph-theoretical algorithm in prior art.

For a concrete example, above gene order-checking data sequence assemble method is described below.If Fig. 6 a is the overlapping relation figure G of one section of sequence set and the example of its anti-complementary series figure G '.In figure, each node represents one and reads sequence, is designated as n, { n ₁, n ₂, n ₃... n _n∈ G.In figure, each limit is designated as e, { e ₁, e ₂, e ₃... e _n∈ G.For node 5, sequence that what each in-degree direction (point to node 5) representation node 5 represented read can with node 4 representated by read sequence overlap, the reading the sequence of reading that sequence can represent with node 7 and carry out overlap of each out-degree direction (from node 5s) representation node 5 representative.

With any one node if node 4 is for start node, with any direction as the out-degree direction of node 4 checks there are two relations on the out-degree direction of G interior joint 4 according to figure G and G ': with the relation of node 5 and the relation with node 6.Therefore can determine that this node extends relation uniquely, by its relation record in suspicious sequence fragment set of relationship.Due to can determine node 4 by figure G out-degree direction on not there is unique credible annexation, therefore, do not need to reexamine G '.This process as shown in Figure 6 b.

Again the in-degree direction of node 4 is checked, in figure G, find that node 3 is the exclusive node in node 4 in-degree direction, then, in G ', check node 3, find that node 4 is exclusive node of node 3 upstream.Therefore determine that it is unique prolongation node each other at G and G ' interior joint 4 and node 3.Therefore, think that this 3->4 relation is believable, put in reliable sequence fragment contig.This process as fig. 6 c.

By checked n _xthe relation that node is relevant is deleted from G and G '.As shown in fig 6d, from figure G and G ', delete the limit centered by node 4.

An optional node again in the node be not inspected, as node 1, the out-degree direction of node 1 is checked in G, find that node 3 is its optimum solution, node 3 is checked in figure G ', find that node 3 is not unique on out-degree direction, therefore 1->3,2->3 are put in suspicious sequence fragment set of relationship.This process as shown in fig 6e.

Checked relation is deleted from G and G '.As shown in Figure 6 f, from figure G and G ', delete the limit centered by node 1, node 2 and node 3.

Appoint in remaining node again and get a node, as node 5, find in G and G ', itself and node 7 are unique relationships, therefore, are put into by 5->7 in reliable sequence fragment contig.This process as shown in figure 6g.

Checked relation is deleted from G and G '.As shown in figure 6h, from figure G and G ', delete the limit centered by node 5, node 7.

Process before repetition, it is uniquely two-way for finding that node 6 and node 8 close, and therefore, is put into by 6->8 in reliable sequence fragment contig.This process is as shown in Fig. 6 i.

Checked relation is deleted from G and G '.As shown in Fig. 6 j, from figure G and G ', delete the limit centered by node 6, node 8.

After all nodes are checked, to find in figure without any limit, whole bilateral checking process terminates.

Suppose existence reference sequences r, and we find in this reference sequences r, have one section of sequence to be 2->3 and 3->4.According to the analysis to this section of sequence, we can select former suspicious relation 2->3 to confirm as credible annexation, delete relation 1->3, as shown in Fig. 6 k simultaneously.

In the above-described embodiments, we progressively judge in overlapping relation figure centered by overlapping relation figure " uncertain " whether relation appear in reference sequences, just confirms as confirmation relation if there is in reference sequences.A kind of Application way of reference sequences is also provided: after just all sequencing data comparisons survey similar reference sequences to one to tested genome in another embodiment of the present invention, a sequence can be obtained and draft sequence traverse path relative to reference sequences, this traverse path of base, and progressively revise this sequence according to overlapping relation figure and draft sequence traverse path.

The prerequisite of the method is: known method according to Fig. 2 obtains the overlapping relation check result of gene order-checking data, namely all overlapping relations are divided into " determination " relation or " uncertain " relation, and then form " determination " relation subset sums " uncertain " relation subset; Simultaneously also known one draft sequence traverse path relative to reference to genomic sequencing data.This is drafted sequence traverse path and is obtained after a reference sequences by sequencing data comparison, and this reference sequences requires to have certain similarity with order-checking species.

What the method carried out progressively Orders Corrected by following principle drafts sequence traverse path:

Principle 1): all exist if annexation is concentrated in " uncertain " relator drafting sequence traverse path and overlapping relation set, then retain drafting in sequence traverse path, and this annexation is concentrated from " uncertain " relator delete;

Principle 2): if annexation exists drafting in sequence traverse path, but do not exist in overlapping relation set, this annexation is deleted from drafting sequence traverse path;

Principle 3): for the some nodes drafted in sequence traverse path, if concentrate " determination " annexation existing and set out with this node in " determination " relator, and should " determination " annexation not exist drafting in sequence traverse path, then " should determine " that annexation was added to drafts in sequence traverse path, corresponding annexation is deleted from " determination " set of relationship simultaneously;

Principle 4): for the some nodes drafted in sequence traverse path, if in the annexation that " uncertain " relator concentrates existence anduniquess to be start node with this node, meanwhile, the terminating node of this annexation is the start node that " uncertain " relator concentrates another annexation; Then add to draft being somebody's turn to do " uncertain " relator annexation that to concentrate with this node be start node in sequence traverse path, and delete this annexation from overlapping relation set; Otherwise directly delete this node from drafting sequence traverse path.For Node B, only there is B+ → A+ mono-paths if concentrated in " uncertain " relator, meanwhile, also there is at least A+ → X, then B+ → A+ is added to and draft in sequence traverse path; Otherwise Node B is deleted from drafting sequence traverse path;

Principle 5): if present node is not last node drafting in sequence traverse path, and in overlapping relation set, there is not any record of this node, this knot removal is deleted from drafting sequence traverse path.

Here it is to be noted that it each step needs corresponding renewal to draft sequence traverse path and/or overlapping relation set, after next step then upgrades based on previous step draft sequence traverse path and/or overlapping relation set completes.

In an embodiment of the present invention, the affect intensity of above traversal principle on net result also has difference, wherein

Fig. 7 is the schematic flow sheet of reference sequences Application way in another embodiment of the present invention.As shown in Figure 7, deposit in case in the method prerequisite, the method comprises:

Step 701: judge to draft whether present node in sequence traverse path is last node; If not, then step 702 is performed; Otherwise perform step 711;

Step 702: judge whether present node is present in overlapping relation set drafting starting direction relation in sequence traverse path; If so, step 703 is performed; Otherwise perform step 704;

Step 703: this starting direction is closed and ties up to draft in sequence traverse path and retain, and delete this relation from overlapping relation set; And turn to execution step 716;

Step 704: relation is deleted from drafting sequence traverse path, and turn to execution step 705;

Step 705: whether judgement " determination " relator is concentrated exists with the initial annexation of this node; If existed, then perform step 706; Otherwise perform step 707;

Step 706: " should determine that " relation was added to and drafted in sequence traverse path, and deletes this confirmation relation from overlapping relation set; And turn to execution step 716;

Step 707: judge any information that whether there is this node in " uncertain " relation; If existed, then perform step 708; Otherwise perform step 710;

Step 708: judge whether the annexation concentrating existence anduniquess to be start node with this node in " uncertain " relator, meanwhile, the terminating node of this relation is the start node that " uncertain " relator concentrates another annexation; If existed, then perform step 709; Otherwise perform step 710;

Step 709: add to draft being somebody's turn to do " uncertain " relator annexation that to concentrate with this node be start node in sequence traverse path, and delete this annexation from overlapping relation set; And turn to execution step 716;

Step 710: to delete with node be the relation of node from drafting in sequence traverse path; And turn to execution step 716;

Step 711: judge to concentrate in " determination " relator whether exist with the initial annexation of this node; If existed, then perform step 712; Otherwise perform step 713;

Step 712: " should determine " that relation was added to and draft in sequence traverse path, and " determination " relation of being somebody's turn to do is deleted from overlapping relation set; And turn to execution step 716;

Step 713: judge any information that whether there is this node in " uncertain " relation; If existed, then perform step 714; Otherwise perform step 717;

Step 714: judge whether the annexation concentrating existence anduniquess to be start node with this node in " uncertain " relator, meanwhile, the terminating node of this relation is the start node that " uncertain " relator concentrates another annexation; If existed, then perform step 715; Otherwise perform step 717;

Step 715: add to draft being somebody's turn to do " uncertain " relator annexation that to concentrate with this node be start node in sequence traverse path, and delete this annexation from overlapping relation set; And turn to execution step 716;

Step 716: drafting based on sequence traverse path and overlapping relation set after upgrading, turns to next node to judge, namely transfers to and performs step 701;

Step 717: terminate whole flow process, and draft " determination " relation in sequence traverse path and overlapping relation set check result as net result using current.

It will be understood by those skilled in the art that the mutual order between some determining step above can be exchanged, in the process of concrete practice, according to the situation of process data, the complexity of algorithm can be reduced by the tandem adjusting determining step.When the tandem of determining step affects little on algorithm complex, those skilled in the art can according to circumstances arbitrarily adjust.Such as, it will be understood by those skilled in the art that can first determining step 707, when judged result is no, then determining step 705.Also such as step 705 and step 707 can perform simultaneously, final summarized results.

Illustrate that the above embodiment of the present invention records reference sequences Application way with an example below.Suppose there is one group of genome sequence, after carrying out random fragmentation order-checking to this sequence, obtain a series ofly reading sequence (1,2,3 ... 11).Wherein+or-represent the disposing way that this reads sequence :+represent this to read sequence cis and put ,-represent this to read sequence and put in reverse complemental mode.Wherein, 1 is one section of repetitive sequence, occur in genome repeatedly, there is overlapping relation the sequence end of 6+ and the sequence front end of 1+, and the sequence front end of sequence end and 1+ that this relation only represents 6+ has sequence similarity, but in real genome, read sequence 6+ and read sequence 1+ and be not connected, namely there is a false-positive annexation 6+ → 1+, according to the overlapping relation read between sequence, we can obtain overlapping relation figure, as shown in Figure 8.

Utilize the method shown in Fig. 2 can learn in these sequence relations, relation which relation belongs to " determination ", which relation belongs to " uncertain " relation, and specifically, result is as follows:

" determine " relation:

3+→4+→5+，7+→8+，9+→10+→11+→9+

" uncertain " relation:

6+→7+，6+→1+，1+→2，5+→2+，2+→6+，2+→3+，8+→1+

After these sequences and reference sequences being compared, the sequence traverse path of drafting obtained is:

1+→2+→3+→2+→4+→5+→7-→6-

Wherein, 7-→ 6-represents the reversion of a sequence, and as ATC is reversed to GAT(A and T pairing according to base complementrity principle, G and C matches).

First, from reference sequences figure draft the reference position 1+ of sequence traverse path check, namely judge 1+ → 2+ of drafting in sequence traverse path, this relation is present in simultaneously to be drafted in sequence traverse path and overlapping relation figure, then retain, and delete in overlapping relation figure.Now draft sequence traverse path constant, the check result of overlapping relation figure becomes:

-----------------------------------------------

" determine " relation subset:

3+→4+→5+，7+→8+，9+→10+→11+→9+

" uncertain " relation subset:

6+→7+，6+→1+，5+→2+，2+→6+，2+→3+，8+→1+

-------------------------------------------------

Then, check relation 2+ → 3+ that 2+ node sets out, find that 2+ → 3+ is present in equally simultaneously and draft in sequence traverse path and overlapping relation figure, retain, and it is deleted from overlapping relation figure.Now draft sequence traverse path still constant, the check result of overlapping relation figure becomes:

---------------------------------------------------------

" determine " relation subset:

3+→4+→5+，7+→8+，9+→10+→11+→9+

" uncertain " relation subset:

6+→7+，6+→1+，5+→2+，2+→6+，8+→1+

------------------------------------------------------------

We check 3+ → 2+ below, find this relation be not present in renewal after overlapping relation figure in, this relation is deleted from drafting sequence traverse path.Now draft sequence traverse path to become: 1+ → 2+ → 3+ and 2+ → 4+ → 5+ → 7-→ 6-two parts.

Meanwhile, we find in confirmation relation, there is a series of reliable 3+ → 4+ → 5+ set out with node 3+, and therefore this series of reliable is added to and drafted sequence traverse path by we, and are tied up in overlapping relation figure this pass and delete.Now draft sequence traverse path to become: 1+ → 2+ → 3+ → 4+ → 5+ and 2+ → 4+ → 5+ → 7-→ 6-two parts, the check result of overlapping relation figure becomes:

------------------------------------------------------------

" determine " relation subset:

7+→8+，9+→10+→11+→9+

" uncertain " relation subset:

8+→1+，6+→7+，6+→1+，5+→2+，2+→6+，

---------------------------------------------------------------

From the 5+ drafted in sequence traverse path (due to upgrade after draft 3+ → 4 in sequence traverse path, 4+ → 5+ has been inspected as " determination " relation, therefore do not need to reexamine), find to there is 5+ → 2+ in " uncertain " relation, further, find again in overlapping relation figure in the updated, relation " 5+ → 2+ " is unique annexation that node 5+ sets out, and, also 2+ → 6+ is there is with the limit that 2+ sets out, then 5+ → 2+ is put into and draft in sequence traverse path, and it is deleted from the check result of overlapping relation figure.Now draft sequence traverse path to become: 1+ → 2+ → 3+ → 4+ → 5+ → 2+ and 2+ → 4+ → 5+ → 7-→ 6-two parts, the check result of overlapping relation figure becomes:

------------------------------------------------------------

" determine " relation subset:

7+→8+，9+→10+→11+→9+

" uncertain " relation subset:

8+→1+，6+→7+，6+→1+，2+→6+

---------------------------------------------------------------

Next step checks the 2+ drafted in sequence traverse path, finds to only have 2+ → 6+ in overlapping relation check result, and, there is 6+ → 1+, 6+ → 7+ two relations of setting out, therefore put into 2+ → 6+ and draft in sequence traverse path, and delete corresponding relation from overlapping relation figure.Now draft sequence traverse path to become: 1+ → 2+ → 3+ → 4+ → 5+ → 2+ → 6+ and 2+ → 4+ → 5+ → 7-→ 6-two parts, the check result of overlapping relation figure becomes:

------------------------------------------------------------

" determine " relation subset:

7+→8+，9+→10+→11+→9+

" uncertain " relation subset:

8+→1+，6+→7+，6+→1+

---------------------------------------------------------------

Then check 6+, find in overlapping relation check result to there are two annexation: 6+ → 7+ and 6+ → 1+ that 6+ sets out, therefore, extension stops, and now drafts sequence traverse path constant.

Next step, check this series of traversal relation path of 2+ → 4+ → 5+ → 7-→ 6-, first the 2+ ranked first is checked, find in overlapping relation check result without any the record about 2+, and 2+ is not a last node, therefore, 2+ node is deleted, along with the disappearance of this node, 2+ → relation is lost in the lump, now draft sequence traverse path to become: 1+ → 2+ → 3+ → 4+ → 5+ → 2+ → 6+ and 4+ → 5+ → 7-→ 6-two parts, the check result of overlapping relation figure is constant.

Then, 4+ is checked, find in overlapping relation check result without any the record about 4+, and 4+ is not a last node, therefore, 4+ node is deleted, and 4+ → relation is lost in the lump, now draft sequence traverse path to become: 1+ → 2+ → 3+ → 4+ → 5+ → 2+ → 6+ and 5+ → 7-→ 6-two parts, the check result of overlapping relation figure is constant.

Next step checks 5+, and based on same reason, 5+ should be deleted, then now draft sequence traverse path and become: 1+ → 2+ → 3+ → 4+ → 5+ → 2+ → 6+ and 7-→ 6-two parts, the check result of overlapping relation figure is constant.

Then, check 7-, find that 7-→ 6-and 6+ → 7+ is equivalent, and 6+ → 7+ exists in overlapping relation check result, therefore, this pass ties up to draft in sequence traverse path and is retained, simultaneously deleted in overlapping relation figure.Then draft sequence traverse path to become: 1+ → 2+ → 3+ → 4+ → 5+ → 2+ → 6+ and 7-→ 6-two parts, wherein 7-→ 6-is equivalent alternative by 6+ → 7+.Then whole sequence traverse path of drafting is finally:

1+ → 2+ → 3+ → 4+ → 5+ → 2+ → 6+ → 7+, the check result of overlapping relation figure becomes:

------------------------------------------------------------

" determine " relation subset:

7+→8+，9+→10+→11+→9+

" uncertain " relation subset:

8+→1+，6+→1+

---------------------------------------------------------------

Then check 7+ node, find that existing with 7+ in overlapping relation check result is " determination " relation 7+ → 8+ of node of setting out, therefore, this relation being added to drafts in sequence traverse path, whole traversal relation becomes: 1+ → 2+ → 3+ → 4+ → 5+ → 2+ → 6+ → 7+ → 8+, and the check result of overlapping relation figure becomes:

------------------------------------------------------------

" determine " relation:

9+→10+→11+→9+

" uncertain " relation:

8+→1+，6+→1+

---------------------------------------------------------------

Then, check 8+, find not there is any unchecked relation drafting 8+ node in sequence traverse path, and, not there is any " determination " relation of setting out with 8+ in overlapping relation figure, in " uncertain " annexation, there is annexation 8+ → 1+ of setting out with 8+ but not there is in the annexation figure after upgrading the relation that any 1+ sets out, and 8+ is last node drafting sequence traverse path, therefore, this node retains.

Finally, check " determination " relation subset of overlapping relation check result, find to also have a 9+ → 10+ → 11+ → 9+ not use, therefore, put among traversing result.Then traversing result is 1+ → 2+ → 3+ → 4+ → 5+ → 2+ → 6+ → 7+ → 8+ and 9+ → 10+ → 11+ → 9+.

So far, " determination " closes is empty, and all node of whole traversal relation is examined complete from the beginning to the end.

A kind of gene order-checking data sequence assemble method that the embodiment of the present invention provides, utilizes two-way optimal algorithm to build overlapping sequences figure, solves the misspelled problem of De novo well.In conjunction with reference sequences, use sequence correlation technique of resurveying to simplify De novo overlay chart, make assembling result the most close with actual sequence.The method has carried out sufficient excavation to reference sequences, and has carried out making up and abbreviation to the limitation of De novo method and complicacy.According to the overlay chart after optimization, the method for the embodiment of the present invention automatically can also carry out bug check to existing sequence fragment contig and mistake splits, and prevents misspelled generation.

In addition, the method also contributes to the assembling of plasmid sequence in cell.In microorganism, most of plasmid is all that ring-shaped sequence exists, in overlapping relation figure, take the form of n ₁->n ₂->n ₃n ₁, namely in overlapping relation figure, there is an Euler's circuit.Therefore, in structure result sequence process, the method can Automatic sieve select longer comparison on reference sequences less than the seed found as plasmid of sequence, use Fleury algorithm, finding can by the Euler's circuit of these seed node as the plasmid candidate that may exist; Then by the length (being no more than 1MB) of assembling result, plasmid sequence is screened.

Therefore, to the overlay chart still with uncertain annexation, the method of the embodiment of the present invention can also be carried out plasmid and be split and plasmid conjecture, and the plasmid sequence that may exist most possibly carries out screening, being separated and carrying out robotization Cheng Huan, conveniently further analyzes.

To Escherichia coli K ₁₂the genome of MG1655 bacterial strain, the strain of Escherichia coli L-threonine Yielding Strain, Meiothermus Ruber DSM1299, Pedobacter heparinus DSM2366, yellow quarter butt Zl5 check order, and utilize the gene order-checking data assembling method of invention to test, test result will be evaluated from the following aspects:

Assemble the genomic integrality that obtains, whether the assembling result obtained can reduce all genome areas, whether can complete a microbial genome;

Whether assemble the genomic global reliability that obtains, whether assembling result has structural differences compared with reference sequences, have region to be missed or mistake is brought into;

Assembling needs the procurement cost using sequencing data, and completes the data volume lower limit required for same sample gene order-checking data assembling.

According to test result display, utilize the method that the embodiment of the present invention provides, compared with truth, there is structural mistake in the preliminary assembling result obtained, its result is obviously better than other instruments announced hardly.When having the reference sequences of comparatively near edge, the method that the embodiment of the present invention provides only needs the data volume of needed for additive method about 2/3, just can obtain identical or more excellent analysis result.

Here, it will be understood by those skilled in the art that above occur suspicious, insincere, not confirming can for the same meaning; Equally, credible, reliable, confirmation also can be the same meaning; Meanwhile, the taken in context such as sequence relation combination, arrangement set, graph of a relation check result, set of relationship also can be the same meaning; The present invention does not give strict differentiation to these words.Meanwhile, the present invention does not also strictly distinguish and drafts sequence traverse path and draft sequence traverse path, and those skilled in the art can some place of taken in context refers to is an annexation, the set of annexation that what some place referred to is.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement etc., all should be included within protection scope of the present invention.

Claims

1. the Application way of a reference sequences, the overlapping relation set drafting sequence traverse path and gene order-checking data of the known gene order-checking sequence obtained based on reference sequences that gene order-checking comparing is obtained to a nearly edge with reference to postgenome, described overlapping relation set comprises " determination " relation subset sums " uncertain " relation subset, it is characterized in that, comprising:

The each node in traverse path is checked by the sequential iteration drafting sequence traverse path, carry out modified quasi fixed sequence traverse path according to the annexation that " determination " relation subset and/or " uncertain " relator of overlapping relation set are concentrated, and upgrade overlapping relation set simultaneously;

Wherein, when checking a certain node, described method comprises:

There is if concentrated in " determination " relator of overlapping relation set the overlapping relation that sets out with present node, and described overlapping relation does not exist drafting in sequence traverse path; Then described annexation is added to and draft sequence traverse path, and described connection overlapping relation is concentrated deletion from " determination " relator.

2. the method for claim 1, is characterized in that, when checking a certain node, described method comprises further:

If there is the overlapping relation that present node sets out in overlapping relation set, and this annexation in sequence traverse path of drafting also exist, then retain this relation drafting in sequence traverse path, and this annexation deleted from overlapping relation set; And/or

If be that the set out pass of node ties up in overlapping relation set and do not exist drafting in sequence traverse path with present node, then will be that the set out relation of node is deleted from drafting sequence traverse path with present node.

3. method as claimed in claim 1 or 2, is characterized in that, when checking a certain node, described method comprises further:

If " uncertain " relator of overlapping relation set in the updated concentrates the annexation that existence anduniquess one is start node with described node, simultaneously, the terminating node of described annexation is the start node that described " uncertain " relator concentrates another annexation, be then that the annexation of start node adds drafting in sequence traverse path of gene order-checking sequence to by described node, and concentrate from described " uncertain " relator and delete described annexation.

4. method as claimed in claim 1 or 2, is characterized in that, when checking a certain node, described method comprises further:

5. method as claimed in claim 1 or 2, is characterized in that, after traversal flow process terminates, comprises further:

Draft annexation that sequence traverse path and " determination " relator concentrate and draft sequence traverse path using final as described tested gene order-checking data.