CN105989249B

CN105989249B - For assembling the method, system and device of genome sequence

Info

Publication number: CN105989249B
Application number: CN201510084489.XA
Authority: CN
Inventors: 叶承羲; 马占山
Original assignee: Nanjing Endless Biotechnology Co ltd; Kunming Institute of Zoology of CAS
Current assignee: Yizhi (Nanjing) Bioinformation Technology Co.,Ltd.; Kunming Institute of Zoology of CAS
Priority date: 2014-09-26
Filing date: 2015-02-16
Publication date: 2019-03-15
Anticipated expiration: 2035-02-16
Also published as: CN105989249A

Abstract

The application provides a kind of method compressed using anchor point sequence to sequence to be assembled and carry out gene order assembling, including obtaining the data of a sequence sets to be assembled and the data of multiple anchor point sequences；Using the multiple anchor point sequence, each of the sequence sets to be assembled sequence to be assembled is compressed to obtain the concordance list of the sequence to be assembled, wherein the concordance list is indicated with the sequence of the label of anchor point sequence；And simultaneously component composition skeleton is compared using the concordance list that each sequence compaction to be assembled in the sequence sets to be assembled obtains.

Description

For assembling the method, system and device of genome sequence

Technical field

This application involves gene sequencing field, in particular to the method, system and device of assembling gene order, The speed and precision of gene order assembling, the especially speed and precision of genome sequence assembling can be improved.

Background technique

The Human Genome Project is one of biological medicine project maximum so far, greatly accelerates DNA sequencing skill The development of art.In the past thirty years, three generations's DNA sequencing technology has been developed, is currently under and skill is sequenced in the second generation The crossroad that art and third generation sequencing technologies substitute.

Wherein, sequence length caused by first generation Sanger sequencing technologies is generally less than 800bp.It is sequenced according to the first generation Technology searches for shared region using overlay chart (Overlapping Graph) among the sequence sets that sequencing obtains to assemble base Because of sequence, but this search and alignment algorithm based on overlay chart expended on time and storage it is higher.And the second generation is sequenced Sequence caused by technology (also referred to as next-generation sequencing technologies, NGS) is shorter, generally in the range of 30-300bp.Second In the case where for sequencing technologies, the de Brujin graph (de Bruijn Graph, DBG) for being based on nucleosides k conjuncted (k-mer) is utilized Algorithm (Conway, T.C.&Bromage, A.J.Succinct data structures for assembling large genomes,Bioinformatics 27,479-486(2011)；Melsted,P.&Pritchard,J.Efficient counting of k-mers in DNA sequences using a bloom filter,BMC Bioinformatics12,333(2011)；Ye,C.,Ma,Z.S.,Cannon,C.H.,Pop,M.&Yu,D.W.Exploiting sparseness in de novo genome assembly,BMC Bioinformatics 13 Suppl 6,S1(2012)) The sequence sets obtained to sequencing assemble.Time associated with overlay chart and storage problem are in processing first generation sequencing technologies It can also bear, but be generated by second generation sequenator in processing huge number of short when the sequence of the relatively small amount of generation When sequence, with regard to becoming very big problem.Although the problem of avoiding paired comparison using DBG mode, but have to face The problem of huge EMS memory occupation.DBG become standard method before, be originally used for assembling the first generation sequence based on overlapping The software of (Overlap-Layout-Consensus, OLC) is integrated in graph model or overlapping-arrangement-, as Celera (Venter, J.C.et al.The sequence of the human genome, Science 291,1304-1351 (2001)), AMOS (Treangen,T.J.,Sommer,D.D.,Angly,F.E.,Koren,S.&Pop,M.Next generation sequence assembly with AMOS,Current protocols in bioinformatics Chapter 11,Unit 11.18 (2011)) and ARACHNE (Batzoglou, S.et al.ARACHNE:a whole-genome shotgun assembler, Genome research 12,177-189 (2002)), also it is applied to the assembling of sequence obtained by second generation sequencing technologies.Word string Figure and optimum superposing figure are the particular forms of OLC method, non-in terms of assembling long sequence by simplifying whole overlapping map Chang Youxiao.

Currently, having developed third generation sequencing technologies (it is also referred to as single-molecule sequencing technology) (Levene, M.J.et al.Zero-mode waveguides for single-molecule analysis at high concentrations, Science299,682-686(2003)；Gu,L.Q.,Cheley,S.&Bayley,H.Capture of a single Molecule in a nanocavity, Science291,636-640 (2001)), so that producible maximum length sequence range is Between 500bp-120kbp.Third generation sequencing technologies biggest advantage is that by high-throughput long sequence.On the one hand, increase Add sequencing length have assist distinguish gene order repeat region significant advantage, thus facilitate the mankind can more directly and have Biological question is explored on effect ground, such as the problems such as Genotyping and mutation.However, on the other hand, increasing sequencing length will mention High error rate, so that there are great problems in terms of the detection and amendment of mistake.

The high error rate of third generation sequencing to be originally developed for relatively accurate first generation sequencing technologies based on weight The OLC method of stacked group dress is not applicable.Equally, the long sequence sets of tape error that the third generation is sequenced to be initially designed to the second generation The DBG method of the short sequence assembling of sequencing technologies is also no longer applicable in.

Therefore, at this stage, the very big obstacle that gene sequencing technology is faced is the absence of a kind of efficient gene sequence group Dress method.

Summary of the invention

Therefore, this application provides the method, system and device for assembling gene order, and it is suitable for random length surveys Sequence sequence is particularly suitable for the long gene order that assembling is measured by third generation sequencing technologies, so as to substantially reduce gene Sequence assembling is calculating the high complexity on time and memory, and improves the speed and precision of gene order assembling.

The one aspect of the application provides a kind of method for assembling gene order, comprising: obtains a sequence to be assembled Arrange the data of collection and the data of multiple anchor point sequences；Using the multiple anchor point sequence, to every in the sequence sets to be assembled One sequence to be assembled is compressed to obtain the concordance list of the sequence to be assembled, wherein the concordance list is with the mark of anchor point sequence It number indicates；And the concordance list component composition bone obtained using each sequence compaction to be assembled in the sequence sets to be assembled Frame.

In some embodiments, above-mentioned compression comprises determining that each anchor point sequence to obtain the concordance list of sequence to be assembled The arrangement position being listed in sequence to be assembled, and the arrangement position according to anchor point sequence in sequence to be assembled, are corresponded to The concordance list of the sequence to be assembled.In some embodiments, row of each anchor point sequence of above-mentioned determination in sequence to be assembled Column position includes: that the anchor point sequence and each sequence to be assembled are broken into a series of nucleosides k respectively is conjuncted, by the anchor The nucleosides k of point sequence it is conjuncted respectively with the nucleosides k of the sequence to be assembled is conjuncted is compared, when the core of the sequence to be assembled It with the conjuncted number to match of the nucleosides k of anchor point sequence is more than predetermined threshold during glycosides k is conjuncted, then it is assumed that the anchor point sequence can be with It is matched to the sequence to be assembled；It is conjuncted respectively in the anchor point sequence and described to be assembled by each matched nucleosides k Arrangement position of the matched anchor point sequence in sequence to be assembled is calculated in the corresponding relationship of position in sequence.It is right In the anchor point sequence for having multiple matched nucleosides k conjuncted, arrangement position of the anchor point sequence in sequence to be assembled be can be According to being averaged for arrangement position of the conjuncted anchor point sequence being calculated of each matched nucleosides k in sequence to be assembled Value.In some embodiments, the length of the k-mer are as follows: 15bp-51bp.In some embodiments, the threshold value is institute State the 0.1% to 2% of the length of anchor point sequence.

In some embodiments, above-mentioned component composition skeleton includes: by sequence to be assembled each in the sequence sets to be assembled The concordance list that column compression obtains is compared one by one, filters out the sequence subset to be assembled with optimum superposing, with building Optimum superposing map between sequence to be assembled, then according to anchor point sequential labeling in the sequence index table to be assembled in the subset Position corresponding relationship arrange to each sequence to be assembled in the subset, to form assembling skeleton.In some implementations In mode, above-mentioned component composition skeleton includes: to choose a sequence in the sequence sets to be assembled as current sequence, by institute The concordance list of the concordance list and all other sequence in sequence sets to be assembled of stating current sequence is compared one by one, is found out and institute The alternative sequence that current sequence has overlapping relation is stated, alternative sequence and current sequence are indexed table and compared or base ratio pair To find out the sequence that there is optimum superposing and extension with current sequence, and it is added to relative to the part that current sequence extends and is worked as The sequence spreading that current sequence is obtained on presequence, by using every sequence in the sequence sets to be assembled as working as preamble Column repeat the above steps to obtain one group of sequence spreading, then corresponding according to the position of anchor point sequential labeling in the sequence spreading Relationship arranges to each sequence spreading, to form assembling skeleton.In some embodiments, provided by the present application to be used for group The method of dress gene order further comprises: after the concordance list for obtaining every sequence to be assembled, treating group using anchor point sequence The concordance list of dress sequence is reversely retrieved, and the concordance list subset of sequence to be assembled corresponding with the anchor point sequence is obtained, By comparing the concordance list of every other sequence in the concordance list subset to repairing on the concordance list of current sequence to be assembled The concordance list of sequence to be assembled in the concordance list subset of the just described sequence to be assembled.In some embodiments, above by It compares to correct the rope that compressed sequence to be assembled includes: every sequence to be assembled in the concordance list subset to sequence to be assembled Draw table, carries out one as the concordance list of every other sequence to be assembled in current sequence concordance list and the concordance list subset One compares, and is voted according to the frequency that anchor point sequential labeling different on each position occurs, utilizes the ballot of each position As a result the concordance list of every current sequence is modified, to obtain the concordance list of revised sequence to be assembled.Some In embodiment, the above-mentioned concordance list to sequence to be assembled be modified include: by comparison result in the rope of sequence to be assembled Draw the anchor point for occurring not more than 5 times or not more than 4 times or not more than 3 times or not more than 2 times or not more than 1 time in table subset It is deleted from the concordance list comprising its sequence to be assembled；The heterozygous sequence that breakpoint will be present is deleted；And it will be in comparison result By comprising sequence delete.In some embodiments, component composition skeleton is carried out in the concordance list to sequence to be assembled It is carried out after amendment comprising: the concordance list of all above-mentioned revised sequences to be assembled is compared one by one, screening is provided There is the subset of the revised sequence to be assembled of optimum superposing, then according to the revised sequence rope to be assembled in the subset The position corresponding relationship for drawing anchor point sequential labeling in table arranges to sequence to be assembled, to form assembling skeleton.Some In embodiment, the method provided by the present application for assembling gene order further comprises that the assembling skeleton to building carries out such as Lower described corrects again: after component composition skeleton, all sequences to be assembled being compared with assembling skeleton, for assembling bone Any site on frame occurs by all every kind of bases compared into each sequence in the site in the site of calculating The frequency is voted, and is corrected again using the voting results in each site to assembling skeleton, to obtain again revised base Because of sequence assembling result.In some embodiments, above-mentioned ballot is by utilizing the conjuncted realization of great-jump-forward nucleosides k.One In a little embodiments, above-mentioned corrected again to assembling skeleton includes: to obtain the group for covering gene order different zones to be assembled Skeleton is filled, corrects the assembling bone of the covering gene order different zones to be assembled respectively simultaneously using multiple computer processors Frame, and revised assembling skeleton will be integrated to obtain complete gene order assembling result again.

In some embodiments, sequence sets to be assembled described herein include passing through second generation genomic sequencing technique Obtained sequence is sequenced.In some embodiments, the sequence length in sequence sets to be assembled described herein is in 100bp- Between 120kbp or between 500bp-120kbp or between 500bp-100kbp or 1kbp-100kbp it Between or between 100bp-1kbp or between 100bp-300bp.In some embodiments, it is described herein to Assembling sequence sets includes by the obtained sequence of third generation genomic sequencing technique, and the sequence in the sequence sets to be assembled Column length is between 500bp-100kbp.In some embodiments, the length of anchor point sequence described herein is in 2bp- Between 30bp or between 30bp-100bp or between 100bp-300bp or between 30bp-100kbp or Person is between 300bp-100kbp.In some embodiments, anchor point sequence described herein is selected from the group: precision is greater than Short sequence of 95% sequence length between 100bp-300bp, the precision in the domain of genome high conserved region are greater than 95% sequence The column length length of sequence length between 1kbp-120kbp in the short sequence and sequence sets to be assembled between 100bp-300bp The partial cut away of sequence.In some embodiments, the length of anchor point sequence described herein is between 2-30bp.

Further aspect of the application provides a kind of for assembling the device of gene order, comprising: module is obtained, it is described Module is obtained for obtaining the data of a sequence sets to be assembled and the data of multiple anchor point sequences；Compression module, the compression mould Block is used to utilize the multiple anchor point sequence, is compressed each of the sequence sets to be assembled sequence to be assembled to obtain The sequence to be assembled concordance list, wherein the concordance list is indicated with the label of anchor point sequence；And assembling module, it is described The concordance list component composition bone that assembling module is used to obtain using each sequence compaction to be assembled in the sequence sets to be assembled Frame.

In some embodiments, provided by the present application for assembling the device of gene order, further comprise: to be assembled The concordance list correction module of sequence, the concordance list correction module of the sequence to be assembled are used to treat group using the anchor point sequence The concordance list of dress sequence is reversely retrieved, and the concordance list subset of sequence to be assembled corresponding with the anchor point sequence is obtained, By comparing the concordance list of every other sequence in the concordance list subset to repairing on the concordance list of current sequence to be assembled The concordance list of the just described sequence to be assembled.

In some embodiments, provided by the present application for assembling the device of gene order, further comprise: gene sequence Column assembling modified result module, the gene order assembling modified result module are used to carry out following institute to the assembling skeleton of building That states corrects again: after component composition skeleton, all sequences to be assembled being compared with assembling skeleton, on assembling skeleton Any site, pass through calculate it is all compare into each sequence in the site the site every kind of bases appearance the frequencys It votes, assembling skeleton is corrected again using the voting results in each site, to obtain again revised gene sequence Column assembling result.

In some embodiments, provided by the present application for assembling the device of gene order, further comprise: gene sequence Column assembling result integrates module, the gene order assembling result integrate module for obtain cover gene order difference to be assembled The assembling skeleton in region corrects the covering gene order different zones to be assembled using multiple computer processors respectively simultaneously Assembling skeleton, and will be integrated to obtain complete gene order assembling result by revised assembling skeleton again.

Another aspect of the application provides a kind of system for assembling gene order, comprising: input module is used In the data for the data and multiple anchor point sequences for obtaining a sequence sets to be assembled；Memory is used to store the sequence to be assembled Arrange the data of collection and the data of multiple anchor point sequences；And processor, it is configured in；It is right using the multiple anchor point sequence Each of the sequence sets to be assembled sequence to be assembled is compressed to obtain the concordance list of the sequence to be assembled, wherein institute State concordance list is indicated with the label of anchor point sequence；And it is obtained using each sequence compaction to be assembled in the sequence sets to be assembled The concordance list component composition skeleton.

In some embodiments, in the system provided by the present application for assembling gene order, processor is further It is configured at: after obtaining the concordance list, reversely being retrieved, obtained using concordance list of the anchor point sequence to sequence to be assembled To the concordance list subset of sequence to be assembled corresponding with the anchor point sequence, by will be every other in the concordance list subset In concordance list subset in the concordance list comparison of sequence to the concordance list of current sequence to be assembled to correct the sequence to be assembled The concordance list of every sequence to be assembled.

In some embodiments, in the system provided by the present application for assembling gene order, processor is further Be configured at that the assembling skeleton to building is discussed below corrects again: after component composition skeleton, by all sequences to be assembled It is compared with assembling skeleton, for any site on assembling skeleton, by calculating all compare to each item in the site In sequence the site every kind of base occur the frequency vote, using each site voting results to assembling skeleton into Row is corrected again, to obtain again revised gene order assembling result.

In some embodiments, in the system provided by the present application for assembling gene order, processor is to assembling bone It includes: to obtain the assembling skeleton for covering gene order different zones to be assembled that frame carried out corrects again, at multiple computers Reason device corrects the assembling skeleton of the covering gene order different zones to be assembled respectively simultaneously, and will revised assembling again Skeleton is integrated to obtain complete gene order assembling result.

Correspondingly, by applying mthods, systems and devices described herein, by short sequence or high-precision is sequenced in high precision The assembling segment that short sequence is sequenced in degree forms anchor point sequence, and sequence to be assembled is marked using anchor point sequence, and will be to Sequence compaction is assembled into corresponding concordance list, is then compared and assembles again, to greatly accelerate assembling speed and reduce Calculation amount.

Also, the error rate of the sequence to be assembled in view of being obtained by third generation sequencing technologies is relatively high, therefore into one Step is corrected or removes to compressed sequence to be assembled, so that the mistake of sequence to be assembled is corrected, improves Assemble the accuracy of skeleton.Finally, for any site on assembling skeleton, by calculate it is all compare it is each to the site The frequency that every kind of base in sequence in the site occurs is voted, using the voting results in each site to assembling skeleton It is modified to have obtained the higher gene order assembling result of accuracy rate.

Therefore, mthods, systems and devices described herein only need lower gene order coverage rate to provide for more High continuity, and the smallest calculator memory is only needed, and when handling big gene order-checking sequence sets, speed has The promotion of multiple orders of magnitude, therefore it is suitable for random length sequencing sequence, it is particularly suitable for assembling and passes through third generation sequencing technologies The long gene order measured.

Detailed description of the invention

The above-mentioned and other feature of the application will by with reference to the accompanying drawing and its detailed description be described further.It should Understand, these attached drawings illustrate only several illustrative embodiments according to the application, therefore are not considered as pair The limitation of the application protection scope.Unless stated otherwise, attached drawing is not necessarily to scale, and wherein similar label indicates class As component.

Fig. 1 is a kind of flow chart of gene order assemble method shown according to an exemplary embodiment.

Fig. 2 is a kind of flow chart of the gene order assemble method shown according to another exemplary embodiment.

Fig. 3 is a kind of flow chart of the gene order assemble method shown according to a further exemplary embodiment.

Fig. 4 A to Fig. 4 G is the schematic diagram that sequence assembling is carried out according to Fig. 1 to method shown in Fig. 3.

Fig. 5 A to Fig. 5 D is the graphical representation of exemplary for illustrating integration step shown in Fig. 3.

Fig. 6 is for realizing the device block diagram of gene order assemble method described in above-mentioned Fig. 1.

Fig. 7 is for realizing the device block diagram of gene order assemble method described in above-mentioned Fig. 2.

Fig. 8 is for realizing the device block diagram of gene order assemble method described in above-mentioned Fig. 3.

Fig. 9 is a kind of block diagram of gene order package system shown according to an exemplary embodiment.

Specific embodiment

Refer to constitute the attached drawing of this specification a part in the following detailed description.Show mentioned by the description and the appended drawings Meaning property embodiment is not intended to the protection scope of limitation the application only merely for being illustrative purpose.Those skilled in the art Member, which is appreciated that, can also be used many other embodiments, and can make various changes to described embodiment, without Away from the purport and protection scope of the application.It should be understood that the various aspects for the application for illustrating and illustrating herein can be with It arranges, replace, combine, separate and design according to many different configurations, these difference configurations are included in the application.

Definition

" assembling " of gene order is referred in the application, multiple sequences to be assembled are compared, and using between sequence Overlapping relation construct the genome based on the sequence to be assembled.

Term " comparison " in the application is when public for that can pass through those skilled in the art when comparison to nucleic acid sequence Any method for knowing is completed, for example, can by existing software BLASR, DALIGN, BLAST, BLAST-2, ALIGN, ALIGN-2BWA, BOWTIE or Megalign (DNASTAR) etc..Term " comparison " in the application when for concordance list it Between comparison when can be by the way that well known to a person skilled in the art the completions of any other method.(such as Smith, Temple F.and Waterman,Michael S.(1981),"Identification of Common Molecular Subsequences", Journal of Molecular Biology 147:195–197；Or Lipman, DJ；Pearson,WR(1985)," Rapid and sensitive protein similarity searches",Science 227(4693):1435–41；Or Person Pearson, WR；Lipman,DJ(1988),"Improved tools for biological sequence comparison",Proceedings of the National Academy of Sciences of the United States of America 85(8):2444–8)。

" OLC algorithm " in the application refers to algorithm (the Staden R.A new computer developed by Staden method for the storage and manipulation of DNA gel reading data.Nucleic Acids Res.8 (16): it is less to be mainly used for data volume for the related algorithm developed 3673-3694. (1980)) and on its basis Long sequence (especially first generation sequencing sequence) splicing.Common software based on OLC algorithm have Arachne (referring to Batzoglou S,Jaffe DB,Stanley K,et al.ARACHNE:a whole-genome shotgun Assembler.Genome Res12:177-89. (2002)), Celera Assembler is (referring to Myers EW, Sutton GG,Delcher AL,etal.A whole-genome assembly of Drosophila.Science 287:2196– 204. (2000)), CAP3 is (referring to Huang X, Madan A.CAP3:A DNA sequence assembly Program.GenomeRes 9:868-77. (1999)), PCAP is (referring to Huang X, Yang SP.Generating a genome assembly with PCAP.Curr Protoc Bioinformatics.Chapter 11:Unit11.3. (2005)), Phrap is (referring to de la Bastide M, McCombie WR.Assembling genomic DNA sequences with PHRAP.Curr Protoc Bioinformatics.Chapter11:Unit11.4.(2007)), Phusion is (referring to Mullikin JC, Ning Z.The phusion assembler.Genome Res 13:81-90. (2003)) and Newbler is (referring to Marcel Margulies1ME, William EA, Said A, et al.Genome sequencing in open microfabricated high density picoliter reactors.Nature437: 376–80.(2005))。

" DBG algorithm " in the application refers to by algorithm (the Idury RM, Waterman of Idury and Waterman exploitation MS.A new algorithm for DNA sequence assembly.JComput Biol 2:291-306. (1995)) with And the related algorithm developed on its basis, it was widely used in bioinformatics in the assembling of genetic fragment later, it is main It is suitble to the splicing of the short sequence of high coverage rate.Its corresponding software has Euler-USR (referring to Chaisson MJ, Brinza D,Pevzner PA.De novo fragment assembly with short mate-paired reads:does the Read length matter? Genome Res 19:336-46. (2009)), Velvet is (referring to Zerbino DR, Birney E.Velvet:algorithms for de novo short read assembly using de Bruijn Graphs.Genome Res18:821-9. (2008)), ABySS is (referring to Simpson JT, Wong K, Jackman SD, et al.ABySS:a parallel assembler for short read sequence data.GenomeRes 19:1117– (2009)), 23. AllPath-LG is (referring to Gnerre S, Maccallum I, Przybylski D, et al.High- quality draft assemblies of mammalian genomes from massively parallel Sequence data.Proc Natl Acad Sci USA108:1513-18. (2011)), SOAPdenovo (referring to Li R, Zhu H,Ruan J,et al.De novo assembly of human genomes with massively parallel short read sequencing.Genome Res20:265–72.(2009))。

" k-mer " i.e. nucleosides k in the application is conjuncted, refers to and continuously cuts gene order with certain length k, according to every A base sequentially moves the nucleotide sequence that the identical sequence length of a cutting is k, such as sequence are as follows: ATCGTTGCTTA The nucleotide sequence of ATGACGTCAGTCGAATGCGATGACGTGACTGACTG can be with if k-mer is 13-mer Obtain the nucleotide sequence that a series of following length are 13:

ATCGTTGCTTAAT

TCGTTGCTTAATG

CGTTGCTTAATGA

GTTGCTTAATGAC

…

ACGTGACTGACTG。

" Sparse k-mer " i.e. great-jump-forward nucleosides k in the application is conjuncted, refers in analyses and comparison result, for base Because sequence is with the segmentation of k length progress jumping characteristic, rather than successively cut along moving one.

Term " coverage rate " or " depth " in the application refer to the base total amount that sequencing obtains when for being sequenced (bp) with the ratio of tested gene order size, it is one of the index for evaluating sequencing amount.Bring error rate or false sun is sequenced Property result can decline with the promotion of sequencing depth, therefore primary sample would generally be repeated several times in gene order-checking Amplification to increase sequencing depth.

Embodiment

As described in the background art, the high error rate of third generation sequencing to be originally developed for the relatively accurate first generation The OLC algorithm based on overlapping assembling of sequencing technologies is not applicable.Equally, the long sequence sets of tape error that the third generation is sequenced make most Just also no longer it is applicable in designed for the DBG algorithm of the short sequence assembling of second generation sequencing technologies.Therefore, present applicant proposes one kind The method that gene order assembling is carried out using mixed strategy.Also, required for the sequence sets obtained as third generation sequencing approach Data capacity it is very big, even therefore being required in the genome of small microorganism for the error correction of these long sequences very high Computing resource.It is assembled in reducing gene order for this purpose, the application further passes through introducing " anchor point sequence " and calculates time and interior The high complexity deposited, and ensure the precision of gene order assembling.

Hereinafter, will illustratively be illustrated with reference to Fig. 1 for the application.Fig. 1 show the exemplary reality of the application Apply a kind of flow chart of gene order assemble method of example.Firstly, gene order assemble method described herein can be in office What realized on the digital terrace system of form, which includes desktop computer, notebook, Portable palm computer, clothes Business device, computer cluster, network workstation etc., a kind of exemplary implementation later in association with Fig. 9 to the present processes are realized Example system 900 is described in detail, and wouldn't be first described in detail here.

As shown in Figure 1, the gene order assemble method of the application mainly includes tri- steps of S101, S102 and S103.With Under three steps will be introduced respectively in conjunction with attached drawing.

Firstly, in step s101, obtaining the data of a sequence sets to be assembled and the data of multiple anchor point sequences.

Specifically, in step s101, by system 900 (such as computer), by input module, (such as data are transmitted Interface) obtain gene sequencing data.The gene sequencing data had both included the original series obtained by gene order-checking instrument Data also may include the data obtained from the data based on original series are assembled after pretreatment.The warp The data crossed after pre-processing include: for example, by the available high-precision short sequence of second generation genomic sequencing technique, so The sequence (contig) of segment is assembled them into using existing composite software afterwards.In some embodiments, sequence to be assembled Collection can be any sequence sets.In some embodiments, sequence sets to be assembled are obtained by second generation genomic sequencing technique Sequence sets.In other embodiments, sequence sets to be assembled are the obtained sequence sets of third generation genomic sequencing technique. Hereinafter, will be illustrated so that sequence sets to be assembled are the obtained sequence sets of third generation genomic sequencing technique as an example.

Anchor point sequence is sequence corresponding to a certain segment area in testing gene group, anchor point sequence it is corresponding to The segment of cls gene group is having the same or similar nucleic acid sequence.Anchor point sequence, which can correspond to testing gene group, representative The region of property, such as the high conserved region domain of testing gene group.In some embodiments, the anchor point sequence is by available data The sequence obtained in library.In other embodiments, the anchor point sequence is that second generation genomic sequencing technique is obtained Short sequence, or by utilizing existing composite software (such as SOAPdenovo composite software of Huada gene company offer) Sequence obtained from short sequence obtained to second generation genomic sequencing technique assembles or the two combination. In other embodiments, the anchor point sequence is the partial cut away of sequence to be assembled itself, such as by by sequence to be assembled It is a series of short sequence that column, which concentrate longest one section of sequence truncation, will be owned in above-mentioned short sequence and sequence sets to be assembled later Other sequences are compared, and the part of those unfolded sequences of short sequence obtained with above-mentioned truncation are carried out further Interception, be then added in the set of above-mentioned short sequence, until obtained short sequence sets can have with all sequences it is overlapping.

Persons skilled in the art can select the length of anchor point sequence according to the length of sequence to be assembled, for third Generation sequencing, the length of anchor point sequence are generally set to the 10-30% of the average length of sequence in sequence sets to be assembled.For the second generation Sequencing technologies, anchor point length can be longer than average sequencing length.Sequence sets to be assembled for one group can choose fixed anchor point Sequence length also can choose the anchor point sequence of different length.It should be appreciated that the average length of anchor point sequence is longer, then compress The average length of the concordance list of sequence to be assembled afterwards is shorter, and data processing amount is smaller；The anchor point sequence quantity the how then right simultaneously The coverage area of its entire gene order to be assembled is higher, the anchor point sequence quantity being typically chosen multiplied by anchor point sequence average length Degree should be not less than the 30% of the total length of gene order to be assembled.In some embodiments, the anchor point sequence length exists Between 2bp-30bp or between 30bp-100bp or between 100bp-300bp or 30bp-100kbp it Between or between 300bp-100kbp.Different anchor point sequences can indicate that the label can be number by different labels Word perhaps letter perhaps combination or other suitable representations of the number with letter.For example, several bit digitals can be used Number 001,002,003 etc. can be respectively labeled as anchor point sequential labeling, such as multiple anchor point sequences.In another example in generation, can be used Number label, such as x₁、x₂、x₃Etc. respectively indicating different anchor point sequences.

Then, in step s 102, the multiple anchor point sequence will be utilized, to each of described sequence sets to be assembled Sequence to be assembled is compressed, to obtain the label with anchor point sequence of the sequence to be assembled the concordance list that indicates.

Specifically, anchor point sequence is compared with sequence to be assembled, anchor point sequence is found out in sequence to be assembled Exhaust position, and the arrangement position according to anchor point sequence in sequence to be assembled obtain the index for corresponding to the sequence to be assembled Thus table is indicated every sequence to be assembled with the label of anchor point sequence, that is, indicate gene order boil down in the form of label String number or character, complete sequence compaction, and obtain correspond to multiple anchor point sequences multiple concordance lists, the multiple rope Draw table and forms the data set for corresponding to the compression of the sequence sets to be assembled.Such as: if anchor point sequence is contig_331, Contig_101, contig_17, when these contig are successively compared onto a sequence to be assembled, and anchor point sequence to Arrangement position in assembling sequence is sequentially contig_331, contig_101, contig_17, then the sequence to be assembled is corresponding Concordance list be [331,101,17].

In some embodiments, each sequence to be assembled is compressed by k-mer mode below, and obtains each The concordance list of sequence to be assembled: by anchor point sequence and sequence to be assembled be broken into equal length k-mer cis-position set (for example, If k=13, anchor point sequence and sequence to be assembled are lighted to the cis-position set for being broken into 13-mer from its start bit respectively).It will obtain The k-mer of the anchor point sequence and sequence to be assembled that obtain is compared one by one respectively, and selects to close according to the length of anchor point sequence Suitable threshold value, when with the conjuncted number to match of nucleosides k of anchor point sequence being more than pre- during the nucleosides k of the sequence to be assembled is conjuncted Determine threshold value, then it is assumed that the anchor point sequence can be matched to the sequence to be assembled；Pass through each conjuncted point of matched nucleosides k It is conjuncted corresponding that the matched nucleosides k is calculated in relative position not in the anchor point sequence and the sequence to be assembled Arrangement position of the anchor point sequence in sequence to be assembled.For the anchor point sequence for having multiple matched nucleosides k conjuncted, the anchor Arrangement position of the point sequence in sequence to be assembled can be according to the conjuncted anchor point being calculated of each matched nucleosides k The average value of arrangement position of the sequence in sequence to be assembled.Wherein, in certain embodiments of the present invention, the predetermined threshold Value is the 0.1% to 2% of the length of the anchor point sequence.Those skilled in the art can be according to each anchor point sequence being calculated The arrangement position being listed in sequence to be assembled obtains the concordance list of sequence to be assembled.The wherein length of the k-mer are as follows: 2bp- 51bp, 15bp-19bp, 15bp-25bp, 2bp-15bp, 2bp-19bp, 31bp-51bp or 15bp-51bp.

Furthermore it is possible to which the sequence to be assembled concordance list being fully contained within the concordance list of another sequence to be assembled is deleted It removes, with simplification sequence data collection to be assembled.Also, using a sequence to be assembled as current sequence, by the index of the current sequence Table and the concordance list of all other sequence in sequence sets to be assembled are compared one by one, and finding out has with current sequence concordance list Alternative sequence and current sequence are indexed table later and compared or alkali by the sequence (alternatively referred to as " alternative sequence ") of overlapping relation Base compares, and finds the compressed sequence to be assembled wherein relative to current sequence comprising the extension of concordance list in one direction Column, and the sequence for being included by current sequence completely is removed；Picking out in remaining alternative sequence has with current sequence The sequence of leftward or rightward optimum superposing, removes the sequence for not having Maximum overlap degree in one direction, only retains and work as Presequence has the sequence (its optimal extension sequence for being referred to as current sequence) of leftward or rightward optimum superposing, and by institute It states optimal extension sequence and is added to the sequence spreading for obtaining current sequence in current sequence relative to the part that current sequence extends. In above-mentioned comparison process, if current sequence is completely included by other with its sequence with overlapping relation, preamble is worked as in deletion Column, do not generate the sequence spreading of current sequence；If not finding has the sequence extended to the left or to the right relative to current sequence, Obtained sequence spreading is then compared using the current sequence as the wheel.By by every sequence in the sequence sets to be assembled Column repeat the above steps as current sequence to obtain one group of sequence spreading, then according to anchor point sequence mark in the sequence spreading Number position corresponding relationship arrange to each sequence spreading, to complete component composition skeleton.Art technology can be used Any suitable method and software known to personnel carry out the overlapping between the sequence of calculation, for example, by BLASR, DALIGN, BLAST, BLAST-2, ALIGN, ALIGN-2BWA, BOWTIE, SOAP or Megalign (DNASTAR) etc. compare software to working as Presequence is compared one by one with the every other sequence in sequence sets to be assembled, is found out based on comparison result and is had with current sequence There is the sequence of overlapping relation.

In some embodiments, in the construction step of optimum superposing figure or in the screening step to optimal extension sequence In rapid, one sequence to be assembled of each node on behalf, for the combination of one or more continuous anchor point sequential labelings.To obtain Optimum superposing figure can carry out two-wheeled operation, and the 1st wheel deletes all nodes for all being included, for example, if there is node [x₁,x₂,x₃], then delete the node [x being fully contained in above-mentioned node₁,x₂] and node [x₂,x₃], the purpose done so Be in order to avoid unnecessary repetition and comprising node comparison.2nd wheel calculates all between the node for mutually not including Suffix-prefix overlapping.Then optimum superposing figure is simplified, it is all when being higher than from node A to the degree of overlapping of the extension of node B When other overlappings extended from node A to same direction, there are advantages for the extension of node A to node B.In certain embodiments In, in degree of overlapping total length of multiple anchor points as involved by the length of corresponding overlapping region or overlapping region or overlapping region Sequence to be assembled is measured with the matched k-mer quantity of anchor point.In general, using every sequence as current sequence and other sequences Column are compared, may all find one have to the left with the optimal extension sequence of current sequence Maximum overlap and one to The right side has optimal extension sequence with current sequence Maximum overlap, and by the optimal extension sequence to the left and to the right optimal prolongs It stretches sequence and is added to the sequence spreading for obtaining the sequence in the current sequence relative to the part that current sequence extends.But due to Sequence length to be assembled is limited, it is thus possible to the repetition that occurs can not handling (for example a plurality of sequence to be assembled is a certain current The optimum superposing of sequence, referred to as optimal side by side) will lead to occur branch (Miller, J.R.et on map at this time al.Aggressive assembly of pyrosequencing reads with mates.Bioinformatics 24, 2818-2824(2008)).When there is branch, is there is bifurcation disconnection in packing algorithm, or selects phase between a wherein sequence Mutually optimal (i.e. the extension of A to B is that A extends in the optimum superposing of dextrad, and the extension of B to A is that B extends in the optimum superposing of left-hand) Overlapping extended.To not occur each linear region of bifurcated (there is no optimal side by side i.e. between sequence) in optimum superposing figure Individually output is assembling result.The compressed sequence indicated with indexing sheet form is together in series, is re-converted into later with base Assembling skeleton of the unpressed sequence to be assembled that form indicates as genome profile.

Therefore, by utilizing assembling gene sequence method as described in Figure 1, by anchor point sequence (for example, being sequenced in high precision short The assembling segment of short sequence is sequenced in sequence or high-precision) to sequence to be assembled (for example, by second generation genomic sequencing technique or The obtained long sequence of person's third generation genomic sequencing technique) it is marked, and by sequence compaction to be assembled at corresponding index Then table is compared and assembles again, to greatly accelerate assembling speed and reduce calculation amount.

In some embodiments, the concordance list of above-mentioned to be assembled sequence, tool can be relied on the assembling of sequence to be assembled It include: to body a sequence in the selection sequence sets to be assembled as current sequence, by the concordance list of the current sequence It is compared, is found out with the current sequence with Chong Die one by one with the concordance list of all other sequence in sequence sets to be assembled The current sequence and the sequence with current sequence with overlapping relation are indexed table and compared or alkali by the sequence of relationship Base compares to find out the optimal extension sequence of current sequence, and its part extended relative to current sequence is added to current sequence The upper sequence spreading (i.e. leftward or rightward optimum superposing extends) for obtaining current sequence, and pass through the sequence sets to be assembled In every sequence repeated the above steps as current sequence to obtain one group of sequence spreading, then according in the sequence spreading The position corresponding relationship of anchor point sequential labeling arranges to each sequence spreading, to form assembling skeleton.

Fig. 2 show a kind of flow chart of gene order assemble method according to the application another exemplary embodiment.Fig. 2 Shown in gene order assemble method mainly include tetra- steps of S201, S202, S203 and S204.Wherein S201 step is used for A sequence sets to be assembled and multiple anchor point sequences are obtained, it is substantially similar to the S101 in figure 1 described above, herein just no longer It repeats.Equally, S202 step (compressing sequence to be assembled to obtain the concordance list being made of the label of anchor point sequence) and above S102 is similar, is also not repeated herein.

S203 step described in detail below.It is obtained by S202 step with the label of anchor point sequence the concordance list that indicates Afterwards, compressed sequence to be assembled is further corrected by comparing in S203 step.

Specifically, it is reversely retrieved using concordance list of the anchor point sequence to sequence to be assembled, is repaired by comparing Sequence to be assembled being compressed includes: that the anchor point sequence according to obtained in S201 carries out reversely the concordance list of sequence to be assembled Retrieval obtains a plurality of continuous row in concordance list described in S202 and in an anchor point sequence or the anchor point sequence The set (that is, concordance list subset of sequence to be assembled) of the corresponding sequence to be assembled of combination of the anchor point sequence of column；To utilization The reversed concordance list for retrieving every sequence to be assembled in obtained arrangement set to be assembled, as current sequence rope Drawing that table compared one by one with every other sequence to be assembled in the concordance list subset (i.e. will be right in every two sequences to be assembled The label of the anchor point sequence of position is answered to be compared one by one), the frequency occurred according to anchor point sequential labeling different on each position It is secondary to vote, correct every current sequence in occur mistake anchor point sequential labeling, and if current sequence concordance list In with it is duplicate or by comprising anchor point series arrangement relationship, then the current sequence is deleted, thus obtain it is more accurate and The concordance list collection for the sequence to be assembled simplified.

It is contig-x with anchor point sequence₁、-x₂、-x₃。。。-x_iFor, it is compressed to group to correct above by comparing The step of filling sequence includes: the label x using contig₁、x₂、x₃。。。x_iConcordance list is constructed, so as to every anchor point sequence, all All sequences to be assembled containing the anchor point sequence can be found rapidly by concordance list.To each contig anchor point sequential labeling x_i, record includes the x_iSequence to be assembled sequence index table, obtain the set of the concordance list of corresponding sequence to be assembled, This process can be looked at as the process of a sequence of packets.Later, with dynamic programming algorithm or other more efficient calculations Method compare one by one in group to each item sequence to be assembled in obtained arrangement set to be assembled.Above-mentioned compare one by one can be in quilt Can also carry out in unpressed sequence to be assembled in the sequence to be assembled of compression.

For compressed sequence to be assembled, the comparison one by one in group refers to the rope of every compressed sequence (current sequence) Draw table and be compared with the concordance list for organizing interior other sequences: on the one hand, will only be occurred very in current sequence in Multiple Sequence Alignment Few number (for example, only occur 1-5 time or not more than 5 times or not more than 4 times, not more than 3 times, not more than 2 times, be not more than 1 It is secondary) anchor point sequential labeling deleted from the concordance list of the sequence, such as current sequence concordance list be [331,101,21,17], And the anchor point sequence in Multiple Sequence Alignment marked as 21 only occurs in the concordance list of current sequence once, and not any Occurred in the concordance list of other sequences, then deletes the anchor point sequential labeling in current sequence concordance list [331,101,21,17] 21, i.e., the concordance list of former sequence becomes [331,101,17] by [331,101,21,17]；On the other hand, if in multisequencing ratio Centering finds that (i.e. all sequences Chong Die with breakpoint left end are not be overlapped on the right of breakpoint, on the contrary there are breakpoint in current sequence And in this way, then determine that the point is breakpoint), then this whole sequence is known as heterozygous sequence, and it is deleted.Comparing one by one simultaneously To in the process, if current sequence includes by other sequences, delete this by comprising current sequence (for example, if current sequence Concordance list is [x₁,x₂] or [x₂,x₃], when there are concordance list be [x₁,x₂,x₃] sequence, then delete current sequence).It is deleted The sequence removed will be no longer participate in the building of subsequent assembling skeleton.

It is walked in view of the error rate of the sequence to be assembled obtained by third generation sequencing technologies is relatively high, therefore by S203 Suddenly compressed sequence to be assembled is corrected or is removed, it is very big to may make that the accuracy of subsequent assembling skeleton has obtained It improves.

Then, in step S204, component composition skeleton.Specifically, using in S203 it is revised it is compressed to Sequence construct optimum superposing map is assembled, to complete component composition skeleton.Because of step S204 and step described above S104 is similar, is just not repeated here.

Also, present invention also provides the embodiments of another gene order assemble method.Fig. 3, which is that the application is another, to be shown A kind of flow chart of gene order assemble method of example property embodiment.Gene order assemble method shown in Fig. 3 mainly includes Five steps of S301, S302, S303, S304 and S305.

Wherein in step S301, the data of a sequence sets to be assembled and the data of multiple anchor point sequences are obtained；Then exist In step S302, compressing sequence to be assembled is the concordance list indicated with the label of anchor point sequence；In S303 step, pass through comparison Delete heterozygous sequence in compressed sequence to be assembled and by comprising sequence, and will be in the Multiple Sequence Alignment of sequence to be assembled Only occur seldom number (for example, only occur 1-5 times or not more than 5 times or not more than 4 times, not more than 3 times, not more than 2 It is secondary, not more than 1 time) the label of anchor point sequence deleted from the concordance list comprising its sequence；In step s 304, building group Fill skeleton；And in step S305, the higher gene order of accuracy rate is obtained by integration and assembles result.In present embodiment In, step S301, S302, S303 and S304 are similar with above step S101, S102, S103 and S104 respectively, herein Just it is not repeated.Step S305 described in detail below.

Specifically, being corrected again to what the assembling skeleton of building was discussed below: in step S305 by needed group Dress sequence is compared with the obtained assembling skeleton of step S304, for any site on skeleton, by calculating is all can It compares the frequency that every kind of base occurs into each sequence in the site to vote, using the voting results in each site to group Dress skeleton is corrected again, so that finally integration obtains the higher gene order assembling result of accuracy rate.

In some embodiments, above-mentioned ballot integration be by by all sequence alignments to be assembled to assembling skeleton, And according to comparison result construct great-jump-forward k-mer figure (Ye, C., Ma, Z.S., Cannon, C.H., Pop, M.&Yu, D.W.Exploiting sparseness in de novo genome assembly.BMC Bioinformatics13Suppl 6, S1 (2012)), ballot integration is carried out using a plurality of sequence in same site, then correct bone Mistake in frame, and obtain the higher gene order of final accuracy rate using the structure of great-jump-forward k-mer figure and assemble result.Jump The purpose of formula k-mer be in big data quantity by jumping mode to save calculating when the memory that occupies.K value generally uses 1 ~5, the slightly big k value of use can get around the mistake in alignment algorithm and compare.The step-length g that jumps is typically also 1~5, slightly big Step-length can substantially reduce the memory occupied when calculating.In the case where integrating the comparable situation of precision, with existing PBagcon algorithm (Chin,C.S.et al.Nonhybrid,finished microbial genome assemblies from long-read SMRT sequencing data.Nature methods10,563-569 (2013)) it compares, use Sparc as described below Integration algorithm speed is its 5 times or so, when calculating occupy in save as its 1/2-1/5.

Sparc integration algorithm is a kind of method based on great-jump-forward k-mer, can specifically include three steps, step 1: Construct skeleton k-mer figure；Step 2: carrying out sequence alignment and recording new node and side according to comparison result in comparison process And update the weight on each side；And step 3: search maximum path.Above steps is made below with reference to attached drawing 5A-5D It is described in detail out.

As shown in Figure 5A, assembling frame sequence is ACTGGACTAAA, and with k-mer length for 2, jump step-length g is 3 buildings K-mer figure, then: the initial position of the k-mer is respectively Isosorbide-5-Nitrae, 7,10, corresponding k-mer be respectively as follows: AC, GG, CT, AA is chosen as node.Record the connection relationship between k-mer: such as first k-mer AC, which turns right, extends past TGG tri- After base, the end position of second k-mer GG is reached.TGG is then registered as the side of connection AC node and GG node (edge), at this time because TGG thus only occurred once, so the weight on side is 1, and so on.

The comparison result of several sequences and skeleton will be illustratively analyzed below.

As shown in Figure 5 B, first original series (sequence or its segment i.e. to be assembled) is ACTGGACCAAA, by itself and group Skeleton is filled to compare.In comparison result, AC, GG node of initiation site 1,4 can compare completely, then to the node be not required into Row operation bidirectional, only increases the weight on side, i.e. the weight of path TGG becomes 2.When reaching 7 site, due to CC with CT Incomplete matching in skeleton then using CC as new node, and records one from the 4th node GG to the 7th node CC New route ACC, as new side, weight 1.Due to the AA and skeleton track of the 10th node, so not needing to generate new Node, only record the new side AAA from the 7th new node CC to existing tenth site before, weight 1.

As shown in Figure 5 C, Article 2 original series are ACGGACCAAA, it is compared with skeleton.In comparison result, rise The node AC matching in beginning site 1, reaches the 4th node GG by two bases G G, the side generated due to GG node and front TGG is different, so generating new edge GG, weight 1.Next, when from the 4th node GG to the 7th node, due to Node CC presequence in generated, path has also been recorded, then increases the weight on the corresponding side ACC, weight 2.

The process can be carried out by sequence, the maximum path of weight then can be searched in final result and by the path phase It combines (as shown in Figure 5 D), to obtain integration sequence: ACTGGACCAAA.In some embodiments, all weights in figure It needs to be overregulated, for example subtracts some small numerical value (such as 10%-30% of region sequencing depth), to avoid by being sequenced The erroneous path of biggish weight summation caused by long insertion property mistake in region.

Therefore, by comparing all sequences to be assembled and assembling skeleton, the nucleotide to a plurality of sequence in same site Ballot integration (find out the highest nucleotide of the frequency of occurrences and choose nucleotide of the nucleotide as the site) is carried out, preferably , the assembling skeleton is corrected by Sparc integration algorithm again, to obtain the higher gene order assembling result of accuracy rate.

Fig. 4 A to Fig. 4 G is the schematic diagram that the method according to Fig. 1 to Fig. 3 carries out sequence assembling.Firstly, as shown in Figure 4 A, In step S101, S201 or S301, sequence sets and anchor point sequence to be assembled is obtained.Wherein the left Fig. 4 A is various types of thin Lines indicate to have the anchor point sequence of number designation, and the various thick lines of right then indicate that each sequence to be assembled (such as is schemed Original long sequence shown in 4A).As described above, the sequence sets to be assembled can be the original obtained by gene order-checking instrument The data or the data based on original series of beginning sequence data pretreated obtained from being assembled；The anchor point sequence Column can be to be obtained by screening in existing database, or by the way that using existing composite software, (such as Hua Da gene is public The composite software provided is provided) obtained from obtained to second generation genomic sequencing technique short sequence assembles, either The partial cut away of sequence itself to be assembled in sequencing result.

Fig. 4 B shows sequence mark corresponding with S102, S202, S302 step and compression process: i.e. by anchor point sequence Gene order is carried out with sequence to be assembled to compare, and calculates exhaust position of the anchor point sequence in sequence to be assembled, and such as Fig. 4 B institute Show, with the labelled notation of anchor point sequence sequence to be assembled；Labeled each item sequence to be assembled is compressed to by each anchor point There is the list of sequential arrangement with it in long sequence in the label of sequence, completes sequence compaction and obtain with anchor point sequence to be label Sequence to be assembled concordance list.

Step corresponding with S203, S303 as shown in figures 4 c and 4d: it is obtained using from S102, S202, S302 Concordance list, the grouping that reversed retrieval obtains all sequences to be assembled relevant to a certain anchor point sequence are (as shown in Figure 4 C, reversed to examine Suo Suoyou includes the sequence of packets that the sequence of anchor point sequence 1 obtains)；Sequence in obtained group will be reversely retrieved later to carry out one by one It compares, for every sequence (current sequence), its concordance list is compared with the concordance list of other sequences in all groups: one Aspect few number will only occur (for example, only occurring 1-5 times or not more than 5 in Multiple Sequence Alignment in current sequence It is secondary or not more than 4 times, not more than 3 times, not more than 2 times, not more than 1 time) anchor point sequential labeling from the concordance list of the sequence It deletes；On the other hand, if in Multiple Sequence Alignment, it is (i.e. all Chong Die with breakpoint left end that there are breakpoints in discovery current sequence Sequence with breakpoint on the right of it is not be overlapped, otherwise be also in this way, then determining that the point is breakpoint), then using the current sequence as miscellaneous Sequence is closed to delete；In another aspect, in comparison process one by one, if current sequence includes by other sequences, delete this by comprising Current sequence (Fig. 4 D).Deleted anchor point and heterozygous sequence or by comprising sequence will be no longer participate in subsequent assembling skeleton Building.Sequence to be assembled is grouped after above-mentioned reversed retrieval, and the concordance list of every sequence to be assembled is compared Modified process, it is finally obtained the result is that the concordance list collection of sequence to be assembled that is more accurate and simplifying, and this is precisely and essence The concordance list of letter concentrates all sequences that will be used in the building process of subsequent assembling skeleton.

And Fig. 4 E- Fig. 4 F then illustrates the process of component composition skeleton corresponding with S104, S204, S304: being based on The concordance list, by the concordance list of the compressed sequence (current sequence) to be assembled of each and other quilts for assembling skeleton The concordance list of compressed sequence is compared one by one, finds out the alternative sequence for having overlapping relation with the current sequence, will be alternative Sequence and current sequence are indexed table and compare or base ratio pair, and concordance list is entirely by other sequences first in removal comparison result The sequence that completely includes of concordance list, searching out later has Chong Die anchor point sequence with it and can prolong in one direction Concordance list in comparison result is not had the sequence of Maximum overlap in one direction by a plurality of compressed sequence to be assembled of exhibition Column remove, and only retain the optimal extension sequence of current sequence, and it is added to relative to the part that current sequence extends and works as preamble The sequence spreading of current sequence is obtained on column, and by using every sequence in the sequence sets to be assembled as current sequence It repeats the above steps to obtain one group of sequence spreading, then according to the position corresponding relationship pair of anchor point sequential labeling in sequence spreading Each sequence spreading is arranged (as shown in Figure 4 E)；To complete component composition skeleton (as illustrated in figure 4f).

It is as shown in Figure 4 G step corresponding with S305: after component composition skeleton, by all sequence alignments to be assembled Onto assembling skeleton, for any site on assembling skeleton, by calculating all compare into each sequence in the site The frequency that every kind of base in the site occurs is voted, and is repaired again using the voting results in each site to assembling skeleton Just, to obtain accurate gene order assembling result.

Fig. 6-8 is for realizing the device block diagram of gene order assemble method described in above-mentioned Fig. 1-3.Referring to Fig. 6, the dress It sets including obtaining module 601, compression module 602 and assembling module 604.Wherein obtain module 601 be configured as obtain one to Assemble the data of sequence sets and the data of multiple anchor point sequences；Compression module 602 is configured as using the multiple anchor point sequence, Each of the sequence sets to be assembled sequence to be assembled is compressed to obtain the concordance list of the sequence to be assembled, wherein The concordance list is indicated with the label of anchor point sequence；Assembling module 603 is configured as using each in the sequence sets to be assembled The concordance list component composition skeleton that sequence compaction to be assembled obtains.

As shown in fig. 7, the device of above-mentioned realization gene order assemble method is in addition to including and 601,602,604 tool in Fig. 6 Have other than acquisition module 701, compression module 702 and the assembling module 704 of identical function, can further include to be assembled The concordance list correction module 703 of sequence.Wherein the concordance list correction module 703 of the sequence to be assembled is configured as utilizing the anchor Point sequence reversely retrieves the concordance list of sequence to be assembled, obtains sequence to be assembled corresponding with the anchor point sequence Concordance list subset, by the way that the concordance list of every other sequence in the concordance list subset is compared the rope to current sequence to be assembled Draw on table the concordance list of every sequence to be assembled in the concordance list subset for correcting the sequence to be assembled.

As shown in figure 8, the device of above-mentioned realization gene order assemble method in addition to include with 701 in Fig. 7,702,703, The 704 concordance list correction modules 803 and group with the same function for obtaining module 801, compression module 802, sequence to be assembled Other than die-filling piece 804, gene order assembling modified result module 805 can further include.Wherein the gene order assembles What modified result module 805 was configured as that the assembling skeleton to building is discussed below corrects again: after component composition skeleton, All sequences to be assembled are compared with assembling skeleton, for any site on skeleton, by calculate it is all compare to The frequency that every kind of base in each sequence in the site in the site occurs is voted, and the voting results in each site are utilized Assembling skeleton is corrected again, to obtain again revised gene order assembling result.In some embodiments, as schemed Device shown in 8 can further include: gene order assembling result integrates module 806, and the gene order assembles result Integrate module for obtaining the assembling skeleton for covering gene order different zones to be assembled, simultaneously using multiple computer processors The assembling skeleton of the covering gene order different zones to be assembled is corrected respectively, and revised assembling skeleton will be carried out again Integration obtains complete gene order assembling result.

The device of the realization gene order assemble method of the application may include module corresponding with above method step And/or block combiner, and with the change of above method step, above-mentioned apparatus can also carry out changing accordingly to realize this The gene order assemble method of application.

Also, present invention also provides a kind of systems for realizing method described in above-mentioned Fig. 1-3.Fig. 9 is according to one The system 900 of the application shown in exemplary embodiment a kind of.System 900 may be any type of digital platform, including platform Formula computer, notebook, Portable palm computer, digital platform, computer cluster, network workstation etc..

System 900 as shown in Figure 9 includes: one or more input modules 910 by connecting inside bus 950；One A or multiple output precisions 920；One or more central processing units (CPU) 930；And one or more storage assemblies 940, institute It states storage assembly 940 and stores one or more computer programs 944, one or more operating systems 946 and optional wherein One or more databases 948.

Wherein, input module 910 allows to provide data input to system 900.Common input module 910 includes number According to input interface, network transport interface or other types of input part.In this application, input module 910 is for obtaining The data of the data of one sequence sets to be assembled and multiple anchor point sequences.

Also, output precision 920 includes graphical user for providing a user calculated result, common output precision 920 Interface (GUI), three-dimensional display interface, output data interface, network transport interface or other types of output block etc..

The integrated operation of the usual control system 980 of central processing unit (CPU) 930, such as with display, data processing, data Storage, data communication and the associated operation of record operation etc..Central processing unit (CPU) 930 may include one or more Processor executes instruction, to perform all or part of the steps of the methods described above.In addition, central processing unit (CPU) 930 can be with Including one or more modules, convenient for interaction of processing central processing unit (CPU) between 930 and other assemblies.For example, centre Managing device (CPU) 930 may include input/output module, to facilitate input module 910, output precision 920 and central processing unit (CPU) interaction between 930.

Memory 940 can be by any kind of direct access storage device RAM 941 or read only memory devices ROM 942 Or their combination is realized, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), Erasable Programmable Read Only Memory EPROM (EPROM), programmable read only memory (PROM), read-only memory (ROM), CD-ROM, tape, floppy disk and optical data storage devices etc..It is various types of that memory 940 can be configured as storage Data are to support the operation in system 900.The example of these data includes any operating system for operating in system 900 946, computer program 944, database 948 etc..The computer program 944 can be wherein executed in the system 900, counted Calculation machine program 944 executes gene order described herein by the corresponding module of commands direct central processing unit 930 and assembles Method.Also, in the present invention, memory 940 is also configured in the data for storing the sequence sets to be assembled and multiple anchors The data of point sequence.

Wherein, method as shown in Figs. 1-3 can be with such as computer software, hardware or a combination thereof come can in computer It reads to realize in medium.For hardware realization, the embodiments described herein can pass through specific integrated circuit (ASIC), number Signal processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), processor, controller, microcontroller, microprocessor, the other electronic units for being designed to carry out function described here or Its one or more selectively in combination of person is realized.

For for software implementations, the embodiments described herein can use individual software module, such as procedure module and Functional module realizes that each of which is carried out one or more function and operations described here.Software code, which can be used, appoints Software application that programming language appropriate is write realizes, and can store dedicated computer system memory or It in other computer-readable mediums of person, and is executed by the processor of computer system, also may be mounted at and have data and deposit In other electronic equipments of storage and processing function, such as tablet computer, server.

In some embodiments, it can be realized by system 900: the multiple anchor point sequence be utilized, to described to group Each of dress sequence sets sequence to be assembled is compressed to obtain the concordance list of the sequence to be assembled, wherein the concordance list It is indicated with the label of anchor point sequence；And the rope obtained using each sequence compaction to be assembled in the sequence sets to be assembled Draw table component composition skeleton.In some embodiments, it can be further realized by system 900: to be assembled in acquisition every It after the concordance list of sequence, is reversely retrieved, is obtained and the anchor point sequence using concordance list of the anchor point sequence to sequence to be assembled The concordance list subset for arranging corresponding sequence to be assembled, by by the concordance list ratio of every other sequence in the concordance list subset To the rope of every sequence to be assembled in the concordance list subset on the concordance list to current sequence to be assembled to correct sequence to be assembled Draw table.In other embodiments, the assembling skeleton that can be further realized by system 900 to building is discussed below Correct again: after component composition skeleton, by all sequences to be assembled with assembling skeleton be compared, for assembling skeleton on Any site, by calculate it is all compare into each sequence in the site the site every kind of bases appearance the frequency into Row ballot is corrected assembling skeleton using the voting results in each site, again to obtain again revised gene order Assemble result.In some embodiments, it can be further realized by system 900: obtain and cover gene order to be assembled not With the assembling skeleton in region, the covering gene order to be assembled not same district is corrected respectively simultaneously using multiple computer processors The assembling skeleton in domain, and revised assembling skeleton will be integrated to obtain complete gene order assembling result again.

System 900 can be configured in realize the application it is recorded and comprising method either step and either step Combination, be not repeated herein.

In conclusion this application provides it is a kind of efficient and support various sequencing technologies from the beginning (de novo) packing algorithm.With for the third generation sequencing some existing algorithmic tools compared with, the present processes need compared with Low sequential covering rate, the higher continuity of offer, the smallest memory of needs and the gene order-checking sequence sets big in processing Shi Sudu has the promotion of multiple orders of magnitude.For example, can be in office according to the Genome Size that the present processes assemble Anticipate range, such as between 30kbp-300Gbp or between 1Mbp-4.6Mbp or between 1Mbp-12Mbp or Between 1Mbp-120Mbp or between 1Mbp-3Gbp.And be sequenced depth can between 5x-200x, 5x-20x it Between or between 5x-30x or between 5x-40x or between 5x-50x or between 5x-128x or Between 5x-237x.And the assembling time-consuming of the genome by being carried out using the present processes, system and device is existing There are the 10 of assembling tool^-1-10^-5Times or 10^-1-10^-3Times or 10^-2-10^-3Times or 10^-2-10^-5Times or 10^-3- 10^-5Times.

Method, system and device disclosed in the present application and company (such as the Pacific Ocean for carrying out third generation sequencing technologies Biological Science Co., Ltd (Pacific Biosciences) and Oxford nano-pore company (Oxford Nanopore)) sequenator it is equal Compatible, also with second generation sequencing technologies, the various sequencing technologies such as Illumina company are compatible.Also, it is disclosed herein Method, system and device can be used for big animal, plant, microorganism or people genome research.Further, this Shen Method, system and device that please be disclosed allow possibly even high on common desktop computer on the computer of work station rank Effect assembles large-scale genome, and existing assembling tool usually requires supercomputer or computer cluster.Therefore, the application institute Disclosed method, system and device, which will greatly speed up third generation sequencing technologies and extend to, is related to plant, animal, microorganism Field is more widely applied with the research of human genome etc..

Method described herein will be compared with other existing assemble methods by multiple embodiments below.

Embodiment 1:

Third generation sequencing sequence collection is assembled

In the present embodiment, it compares based on program designed by the present processes, system and device (hereinafter referred to as For " DBG2OLC " program) with other assembly programs PacBio2CA (referring to Koren, S.et al.Hybrid error correction and de novo assembly of single-molecule sequencing reads.Nature Biotechnology 30,693-700 (2012)), HGAP is (referring to Chin, C.S.et al.Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.Nature methods 10,563-569(2013)；Berlin,K.et al.Assembling Large Genomes with Single- Molecule Sequencing and Locality Sensitive Hashing, (2014)) and ECtools (referring to Lee, H.et al.Error correction and assembly complexity of single molecule Sequencing reads, (2014)) obtain as a result, wherein other above-mentioned assembly programs be for the third generation be sequenced in difference Sequencing sequence concentrate design genome assembly program.

The sequence to be assembled selected in the present embodiment is the S.cerevisiae measured based on PacBio microarray dataset The sequencing sequence collection of w303 genome (Genome Size 12Mbp), sequencing sequence concentration sequence average length to be assembled are 5kbp.Anchor point sequence comes from: Illumina Miseq sequencing sequence collection is utilized, with the jump side k-mer of SparseAssembler Method is assembled, and about 2500 anchor point sequences are obtained, and anchor point sequence average length is about 4.3kbp, and anchor point sequence total length is about 11.4Mbp, the coverage rate to testing gene group sequence is about 95%.Based on the DBG2OLC of the application to second generation gene sequencing The sequence sets that (50x depth/coverage rate) and third generation gene sequencing (10x/20x depth/coverage rate) obtain are assembled, with k- The parameter that mer length is 17bp, threshold value is (length of each anchor point sequence of 0.1%*) is by anchor point sequence alignment to sequence to be assembled The set of to be assembled sequence index table of the column to be compressed, corrects the mistake in the sequence to be assembled of compression later, and The step of completing the building of assembling skeleton one has shared 10CPU minutes general, has then been run twice with Sparc to gene order The amendment step of result is assembled to obtain suitable sequence assembling result.In the present embodiment, by all sequences to be assembled in Sparc Column are with assembling skeleton to pass through existing Blasr program (Chaisson, M.& the step of corrective gene sequence assembling result Tesler,G.Mapping single molecule sequencing reads using basic local alignment with successive refinement(BLASR):application and theory.BMC Bioinformatics13,238 (2012)) realize, wherein carrying out sequence alignment using Blasr program with corrective gene sequence The process of assembling result occupies in entire assembling process most operation time (about 2 CPU hours).

The present processes are compared with other three existing softwares, and more revised than right W303PacBio assembles sequence results.

Table 1.S.cerevisiae genome assembling property compares (Genome Size: 12M bp)

Term " identity " in table 1 refers to that the alkali of the identical part of time series is compared in two or more pieces nucleic acid sequence Radix accounts for the percentage of whole base numbers of whole sequence.Term " NG50 " in table 1 refers to works as in gene order assembling The minimal size of the long assembling sequence being included of 50% testing gene original.Term " amendment NG50 " in table 1 refers to The minimal size of the long revised assembling sequence being included of testing gene original in gene order assembling when 50%.Unless It is otherwise noted, above-mentioned term is in following embodiment or present specification other parts indicate identical meaning.

It is as shown in Table 1 the result shows that DBG2OLC needs least coverage rate (depth) and calculating time, and can obtain To the assembling result of quality similar with other assembly programs.Under normal circumstances, DBG2OLC can be with than existing most array The dress program low time and memory once to two orders of magnitude generates relatively good as a result, and can be in lower (10x- 20x) the PacBio sequencing sequence of coverage rate (depth), which is concentrated, realizes preferable assembling quality.In the present embodiment, it is noted that big Part-time, which is used in, is compared all obtained original series that are sequenced to the assembling modified result portion of assembling skeleton using Blasr Divide (2CPU hours).Therefore, the CPU time of this part needs is separately labelled in table 1.It is tied since this assembles gene order The comparison amendment step of fruit can be divided into multiple and different regions by that will assemble skeleton, simultaneously using multiple computer processors The different zones of amendment assembling skeleton respectively, are integrated revised different zones to obtain complete gene order group later Dress is as a result, so its elapsed-time standards can be shorter.

Embodiment 2:

The sequence to be assembled selected in the present embodiment is the A.thaliana ler- measured based on PacBio microarray dataset The sequencing sequence collection of 0 genome (Genome Size 118Mbp), sequencing sequence concentration sequence average length to be assembled are 5kbp.Anchor point sequence comes from: Illumina Miseq sequencing sequence collection is utilized, with the jump k-mer of Sparse Assembler Method is assembled, and about 83000 anchor point sequences are obtained, and anchor point sequence average length is about 1.3kbp, anchor point sequence total length About 115Mbp, the coverage rate to testing gene group sequence is about 97%.Sparse Assembler is used in the present embodiment Composite software (Ye, C., Ma, Z.S., Cannon, C.H., Pop, M.&Yu, D.W.Exploiting sparseness in de Novo genome assembly.BMC Bioinformatics13Suppl 6, S1 (2012)) (k-mer length is 51bp； Sparse step-length g=15) the pre-assembled Illumina of 50x depth (the second generation gene order-checking platform) sequencing in 2 hours Sequence sets.DBG2OLC based on the application, using above-mentioned assembling result as anchor point, to 20x/40x PacBio sequencing sequence Collection is assembled, and with k-mer length is 17bp, parameter that threshold value is (every anchor point sequence length of 0.5%*) is by anchor point sequence ratio To the set of the sequence index table to be assembled to sequence to be assembled to be compressed, corrected in the sequence to be assembled of compression later Mistake, and the step of completing the building of assembling skeleton one has shared general 1 CPU hours.And Sparc (k-mer is used later Length 1, step-length g=1) gene order assembling skeleton is modified.Take again 18 CPU hours it is final to group to complete Fill the makeover process of result.In the whole process, maximum memory usage amount is 6GB.

In comparison, due to needing more than 1,000 CPU of time-consuming when existing software is the problem of handling identical scale Hour (Berlin, K.et al.Assembling Large Genomes with Single-Molecule Sequencing And Locality Sensitive Hashing. (2014)), the following are the data comparison of collection (Lee, H.et al.Error correction and assembly complexity of single molecule sequencing reads. (2014)).For the assembling of assessment sequence, the assembling result that the present processes are obtained and the PacBio using high coverage rate It sequencing sequence collection and is compared through Quiver revised HGAP assembling result, calculates error correction Orders Corrected.

Table 2.A.thaliana genome assembling property compares (Genome Size: 118M bp)

Embodiment 3:

In the present embodiment, sequence sets to be assembled are the mankind of the 54x coverage rate measured based on PacBio microarray dataset (H.sapiens) sequence sets for the longer 30x coverage rate sequence that (Genome Size: 3.0Gbp) sequencing sequence is concentrated, the survey Sequence average length to be assembled is 14.5kbp in sequence sequence sets.Anchor point sequence comes from: utilizing the Illumina of 80x coverage rate Miseq sequencing sequence collection, in the jump k-mer method of Sparse Assembler, (k-mer length is 51bp；Sparse step-length g =20) it is assembled, obtains about 2,200,000 anchor point sequences, anchor point sequence average length is about 1.1kbp, anchor point sequence Total length is about 2.6Gbp, and the coverage rate to testing gene group sequence is about 85%.Above-mentioned pre-assembled 80x coverage rate The step of sequencing sequence collection of Illumina (second generation sequencing) obtains anchor point sequence spends 34 CPU hours.

Later based on the DBG2OLC of the application, using the pre-assembled anchor point sequence of second generation Illumina, with k-mer long The parameter that degree is 17bp, threshold value is (every anchor point sequence length of 1%*) covers anchor point sequence alignment to above-mentioned longer 30x In the sequence sets of rate sequence (average length 14.5kbp), thus the set of the sequence index table to be assembled compressed, this step Suddenly 3 are taken CPU hours, and the storage of 17-mer occupies the memory of 70GB altogether.The results show, even in 54x The sufficient sequence concentration of coverage rate also need to only spend the process on anchor point sequence alignment to all sequences to be assembled less than 6 hours (data are not displayed in Table 3).The finally step that gene order assembling skeleton is modified of (k-mer length 1, step-length g=1) Suddenly about 2000 are taken CPU hours.In initial report (http://blog.pacificbiosciences.com/ In 2014/02/data-release-54x-long-read-coverage-for.html), the process flower of building overlapping map 405000 CPU hours, and local susceptibility splicing (locality sensitive hash) can be used in this process (Berlin,K.et al.Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing. (2014)) method reduce approximately half of time.And it is directly high at another Spend (Myers, G.in Algorithms in Bioinformatics, Vol.8701. in the embodiment of optimization (eds.D.Brown&B.Morgenstern), p52-67 (Springer Berlin Heidelberg, 2014)), it can be in benefit With the time required to being further reduced on the basis of more memories to 15600CPU hours.

The quality of DBG2OLC final assembling result can be obtained with currently advanced assembling tool using a greater amount of resources Result compare favourably.For the accuracy of assessment result, result will be assembled in below table and the mankind refer to genome chromosome 14 are compared, to obtain identity and correct the comparison result of NG50.

Table 3.H.sapiens genome assembling property compares (Genome Size: 3.0G bp)

The problem of due to sequencing sequence collection scale, in table identity and amendment NG50 be used only the sequence sets of chromosome 14 into Row calculates.

Embodiment 4:

Third generation sequencing sequence collection based on Oxford Nanopore platform is assembled

DBG2OLC can be normally used for calculating the sequence of 30% or more error rate.However, in order to guarantee accurately to integrate knot Fruit then needs higher sequencing depth.In order to avoid this problem, in an experiment to high-precision degree series (such as the second generation sequencing in Contig) be assigned with higher weight.

In the present embodiment, sequence sets to be assembled are the Escherichia coli of the 30x coverage rate based on Nanopore microarray dataset K12 genome (Genome Size: 4.6Mbp) sequencing sequence collection, sequencing sequence concentration sequence average length to be assembled are 6.5kbp.Anchor point sequence comes from: using the Illumina Miseq sequencing sequence collection of 30x coverage rate, with Sparse The jump k-mer method of Assembler is assembled, and about 2500 anchor point sequences are obtained, and anchor point sequence average length is about 1.7kbp, anchor point sequence total length is about 115Mbp, and the coverage rate to testing gene group sequence is about 97%.Based on the application's DBG2OLC assembles the Nanopore sequencing sequence collection of 30x coverage rate, using above-mentioned assembling result as anchor point with k- The parameter that mer length is 15bp, threshold value is (every anchor point sequence length of 2%*) by anchor point sequence alignment to sequence to be assembled from And the set of the sequence index table to be assembled compressed, the mistake in the sequence to be assembled of compression is corrected later, and is completed Assemble the building of skeleton.And 3 wheels have been run with Sparc (k-mer length 1, step-length g=1) later, skeleton is assembled to gene order The step of being modified, the results are shown in Table 4.

Table 4.E.coli K12 genome assembling property compares (Genome Size: 4,6M bp.)

Embodiment 5:

Second generation sequencing data based on Illumina platform is assembled

In order to assess short sequence sets assembling tool in the long sequence that processing is obtained by accurate second generation sequencing technologies Performance, downloaded for E. coli K-12 bacterial strain MG1655 clone (DNAnexus database (specifically refers to Http:// sra.anexus.com) accession number: SRA073308) 50x coverage rate Illumina Miseq sequencing sequence Collection.In this experiment, result assembling tool described herein assembled longer Illumina sequencing sequence collection With other several common assembling tools (including SGA (Simpson, J.T.&Durbin, R.Efficient de novo assembly of large genomes using compressed data structures.Genome research22, 549-556(2012)),SOAPdenovo2(Luo,R.et al.SOAPdenovo2:an empirically improved memory-efficient short-read de novo assembler.GigaScience1,18(2012)),SPAdes3 (Bankevich,A.et al.SPAdes:a new genome assembly algorithm and its applications to single-cell sequencing.Journal of computational biology:a Journal of computational molecular cell biology19,455-477 (2012)) assembling result into It has gone and has compared.

Ironically, for the relatively long sequencing sequence collection generated in third generation sequencing technologies, traditional tool There is the assembling result of the DBG of fixed k-mer length and non-optimal, has exposed and be difficult to differentiate repeat region and when making The shortcomings that sequencing sequence collection of higher precision and more high coverage rate is needed when with biggish k value.This makes contradictory requirement It is obtained to be difficult to obtain the optimum combination of limited computing resource.Iteration DBG is more to calculate time, expense and more complicated calculation Method design and implementation is cost, partially solves the above problem.For example, must be generated using an error-correcting program it is long and Correct k-mer word string.SGA finds accurate match using FM-index indexing means, this also brings the limit to sequence quality System.

Opposite, the algorithm of the application is stable and looser to the limitation of quality, thus its can it is more effective and Accurately find overlapping region.K=31 is arranged using Sparse Assembler assembly program to assemble in the DBG2OLC of the application Initial anchor point sequence.Compressed sequence is calculated using the anchor point sequence and sequence sets to be assembled.

Wherein anchor point sequence comes from: the Illumina Miseq sequencing sequence that the length using 50x coverage rate is 150bp Collection, is assembled in the jump k-mer method of SparseAssembler, obtains about 400 anchor point sequences, anchor point sequence average Length is about 13kbp, and anchor point sequence total length is about 4.6Mbp, and the coverage rate to testing gene group sequence is about 100%.It is based on The DBG2OLC of the application, using above-mentioned assembling result as anchor point sequence, to the Illumina sequencing sequence collection of 50x coverage rate It is assembled, is pressed anchor point sequence alignment to sequence to be assembled with the parameter that k-mer length is 31bp, threshold value is 0 The set of the sequence index table to be assembled of contracting, and complete the building of assembling skeleton.For Illumina data, due to precision Height, assembling skeleton can be used as final result.Anchor point sequence and sequence to be assembled come from same data set in this example.It is main It wants not can solve than k long in genome the reason is that traditional k-mer method only can solve the repeat region shorter than k, but than reading All repeat regions of length.So while anchor point sequence average length is very long, but still there are a large amount of repeat regions that cannot unlock. DBG2OLC then compresses sequence to be assembled with anchor point, and has utmostly dissolved these repeat regions than reading length.

In this sequence sets, the building of the overlapping map in herein described method only takes the general 1 second time, and Wherein the compression of sequence takes when most calculating m- about two minutes.

5. pairs of the table assembling properties using the Miseq E.coli genome sequence read compare (Genome Size: 4.6M bp)

Embodiment 6:

In the present embodiment, Sparc integration algorithm described in this application is various from bacterial genomes to mammal It is verified in the data set of Genome Size.Sparc integration algorithm PacBio data set (http: // Schatzlab.cshl.edu/data/ectools/ in) and Oxford Nanopore data set (http: // Gigadb.org/dataset/100102 the verification result in) is as follows.

It is PBdagcon with the most similar existing algorithm of Sparc integration algorithm, will be shown below in above-mentioned two data Concentrate the comparison result of PBdagcon algorithm and Sparc integration algorithm.It is inputted based on identical data, uses such as the application The method generates assembling skeleton first with DBG2OLC, finally uses MUMmer 3 (Kurtz, S.et al.Versatile and open software for comparing large genomes.Genome Biology5, R12 (2004)) dnadiff function calculate final integration identity.All experiments are with AMD Opteron It is carried out in the work station of 2425 HE CPUs processors (under the frequency of 800MHz).

Sparc is designed to allow using a kind of (such as second generation or the third generation are sequenced) or a variety of mixing sequencing datas are (such as Simultaneously using second generation sequencing and third generation sequencing sequence) and assign corresponding path higher weight.By the knot of integration Fruit is re-used as input and is iterated to rerun integrating operation and helping to improve identity.In the integration experiment of mixed data set, The Illumina assembled using Sparse Assembler assembling tool is assembled anchor point sequence and the weight on side is increased to 5. In PacBio data set, two-wheeled integration algorithm, existing after the first round and the second wheel by base accuracy rate are run with k=1, g=1 It is indicated in table 6 with identity 1 and identity 2.In first experiment, E.coli sequencing data obtained in PacBio is used Collect and identity is detected under different sequencing depth profiles.It is used in the sequencing data of 10x/30x depth The longest assembling skeleton that DBG2OLC is obtained is respectively 1.3Mbp and 4.6Mbp.Registration number can be found in existing database is NC_000913, the E.coli that length is 4.6Mbp refer to genome.It is concentrated in blended data, using only the data of 10x depth Sparc integration algorithm can reach 99.91% identity, and work as and the data Shi Qineng of 30x depth is used to reach better Integrate quality (99.98% identity).And the identity obtained in PacBio data set is only slightly below in mixed data set Obtained in result.

The calculated result that the E.coli sequencing data obtained in PacBio of table 6. is concentrated

It is sequenced using the A.thaliana (Genome Size 120Mbp) that sequencing depth is 20x obtained in PacBio Data set detects performance of the Sparc integration algorithm when calculating more large data capacity.It is assembled using the longest that DBG2OLC is obtained Skeleton is 7.1Mbp.With HGAP (Chin, C.S.et al.Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.Nature methods 10,563-569 (2013)) the pure PacBio data set genome assembling that method obtains is by as referring to calculate identity.

The calculated result that the A.thaliana sequencing data obtained in PacBio of table 7. is concentrated

In Oxford Nanopore data set, it is based on its higher error rate, four-wheel integration is run with k=2, g=2 It is indicated after algorithm, the first round and fourth round by base accuracy rate in table 8 with identity 1 and identity 2.In blended data The results are shown in Table 8 with PBdagcon integration algorithm with Sparc is concentrated use in Oxford Nanopore data for concentration.Fortune It can be higher than in the Oxford Nanopore data set with about 40% original error rate with Sparc integration algorithm 99.5% identity is higher than the result identity obtained in mixed data set.In this experiment, obtained longest group Dress skeleton is 4.6Mbp.

The calculated result that the E.coli sequencing data obtained in Oxford Nanopore of table 8. is concentrated

It concentrates in the blended data of 30x PacBio E.coli and memory and quality is compared using different parameters, As shown in table 9, in addition to other than setting higher g value and can reduce the consumption of memory, the variation of other parameters fails to draw Play the larger change of result.

Table 9. is compared memory and quality using different parameters

Above-mentioned experiment as the result is shown when using Sparc integration algorithm, can generally set k=1-3, g=1-5.It uses Mixed data set can reduce the demand to third generation sequencing data depth, this will substantially reduce sequencing cost.Due to the second generation Sequencing has higher accuracy rate, and the result that the second generation is sequenced is used to integrate the quality that will be helpful to improve gene assembling.

Although the various aspects and embodiment of the application are disclosed, other aspect and embodiment are for this field skill It is also obvious for art personnel.Various aspects and embodiment disclosed herein are for illustration purposes only, rather than limit mesh 's.The protection scope and purport of the application is only determined by appended claims.

Equally, each chart can show the exemplary architecture or other configurations of disclosed method and system, help It may include the feature and function in disclosed method and system in understanding.Claimed invention is shown shown in being not limited to Example property framework or configuration, and desired feature can be realized with various alternative architectures and configuration.In addition to this, for process Figure, functional descriptions and claim to a method, box sequence described herein, which should not necessarily be limited by, to be implemented with similarly sequence to hold The various embodiments of the row function, unless explicitly pointing out within a context.

Unless otherwise expressly stated, term and phrase and its variant used herein are interpreted as open, Rather than it is restrictive.In some instances, such as " one or more ", " at least ", scalability word as " but being not limited to " It converges and the appearance of phrase or other similar term should not be construed as to be intended in the example without this scalability term Or need the case where indicating constriction.

Claims

1. a kind of method for assembling gene order, comprising:

Obtain the data of a sequence sets to be assembled and the data of multiple anchor point sequences；

Using the multiple anchor point sequence, each of the sequence sets to be assembled sequence to be assembled is compressed to obtain The concordance list of the sequence to be assembled, wherein the concordance list is indicated with the label of anchor point sequence；And

The concordance list component composition skeleton obtained using each sequence compaction to be assembled in the sequence sets to be assembled.

2. the method for claim 1, wherein the compression includes: with the concordance list for obtaining sequence to be assembled

Determine arrangement position of each anchor point sequence in sequence to be assembled, and the row according to anchor point sequence in sequence to be assembled Column position obtains the concordance list for corresponding to the sequence to be assembled.

3. method according to claim 2, wherein arrangement position of each anchor point sequence of determination in sequence to be assembled Include:

It is conjuncted that the anchor point sequence and each sequence to be assembled are broken into a series of nucleosides k respectively, by the anchor point sequence Nucleosides k it is conjuncted respectively with the nucleosides k of the sequence to be assembled is conjuncted is compared, when the nucleosides k of the sequence to be assembled joins With the nucleosides k of anchor point sequence conjuncted matching rate more than predetermined threshold in body, then it is assumed that the anchor point sequence can be matched to this Sequence to be assembled；

By the conjuncted position in the anchor point sequence matched nucleosides k and the matched nucleosides k it is conjuncted described to group The reciprocal correspondence relationship between the position in sequence is filled, determines arrangement position of the matched anchor point sequence in sequence to be assembled.

4. method as claimed in claim 3, wherein the nucleosides k conjuncted length are as follows: 15bp-51bp, 15bp-25bp or 31bp-51bp。

5. method as claimed in claim 3, wherein the threshold value is the 0.1% to 2% of the length of the anchor point sequence.

6. the method for claim 1, wherein the component composition skeleton includes:

The concordance list that sequence compaction to be assembled each in the sequence sets to be assembled obtains is compared one by one, screening is provided There is the sequence subset to be assembled of optimum superposing, to construct the optimum superposing map between sequence to be assembled, then according to the subset In sequence to be assembled in the position corresponding relationship of anchor point sequential labeling arrange to each sequence to be assembled in the subset, To form assembling skeleton.

7. the method for claim 1, wherein the component composition skeleton includes:

A sequence in the sequence sets to be assembled is chosen as current sequence, by the concordance list of the current sequence and to group The concordance list of all other sequence in dress sequence sets is compared one by one, and finding out has overlapping relation with the current sequence Alternative sequence, by alternative sequence and current sequence be indexed table compare or base ratio to find out with current sequence have it is optimal Overlapping and the sequence extended, and it is added in current sequence relative to the part that current sequence extends and obtains current sequence Sequence spreading, by repeating the above steps using every sequence in the sequence sets to be assembled as current sequence to obtain one group Then sequence spreading is arranged each sequence spreading according to the position corresponding relationship of anchor point sequential labeling in the sequence spreading Cloth, to form assembling skeleton.

8. the method as described in claim 1 further comprises: after the concordance list for obtaining every sequence to be assembled, utilizing institute It states anchor point sequence reversely to retrieve the concordance list of sequence to be assembled, obtains sequence to be assembled corresponding with the anchor point sequence The concordance list subset of column, by comparing the concordance list of every other sequence in the concordance list subset to current sequence to be assembled Concordance list on every sequence to be assembled in concordance list subset to correct the sequence to be assembled concordance list.

9. method according to claim 8, wherein in the concordance list subset for correcting the sequence to be assembled by comparing The concordance list of every sequence to be assembled includes:

To the concordance list of every sequence to be assembled in the concordance list subset of sequence to be assembled, as current sequence concordance list with The concordance list of every other sequence to be assembled is compared one by one in the concordance list subset, according to anchor different on each position Point sequence label occur the frequency vote, using each position voting results to current sequence concordance list described in every into Row amendment, to obtain the concordance list of revised sequence to be assembled.

10. method as claimed in claim 9, wherein be modified to the concordance list of sequence to be assembled and include:

To occur in the concordance list subset of sequence to be assembled in comparison result not more than 5 times or not more than 4 times or not more than 3 Secondary or not more than 2 times or not more than 1 time anchor point sequential labelings are deleted from the concordance list comprising its sequence；

The heterozygous sequence that breakpoint will be present is deleted；And

By in comparison result by comprising sequence delete.

11. method as claimed in claim 9, wherein the component composition skeleton includes: will be all described revised to group The concordance list filled in the concordance list subset of sequence is compared one by one, filters out the revised sequence to be assembled with optimum superposing The subset of column, it is then corresponding according to the position of anchor point sequential labeling in the revised sequence index table to be assembled in the subset Relationship arranges to sequence to be assembled, to form assembling skeleton.

12. further comprising that the assembling skeleton to building carries out following institute such as method of any of claims 1-11 That states corrects again: after component composition skeleton, all sequences to be assembled being compared with assembling skeleton, on assembling skeleton Any site, pass through calculate it is all compare into each sequence in the site the site every kind of bases appearance the frequencys It votes, assembling skeleton is corrected again using the voting results in each site, to obtain again revised gene sequence Column assembling result.

13. method as claimed in claim 12, wherein the ballot is by utilizing the conjuncted realization of great-jump-forward nucleosides k.

14. method as claimed in claim 12, wherein include: to assembling skeleton and being corrected again

The assembling skeleton for covering gene order different zones to be assembled is obtained, is corrected respectively simultaneously using multiple computer processors The assembling skeleton of the covering gene order different zones to be assembled, and revised assembling skeleton will be integrated to obtain again Complete gene order assembles result.

15. method as claimed in claim 13, wherein include: to assembling skeleton and being corrected again

16. such as method of any of claims 1-11, wherein the sequence sets to be assembled include passing through second generation base The sequence being sequenced by group sequencing technologies.

17. method as claimed in claim 12, wherein the sequence sets to be assembled include passing through second generation gene order-checking skill The sequence that art is sequenced.

18. the method as described in any one of claim 13-15, wherein the sequence sets to be assembled include passing through the second generation The sequence that genomic sequencing technique is sequenced.

19. such as method of any of claims 1-11, wherein the sequence length in the sequence sets to be assembled exists Between 100bp-120kbp or between 500bp-120kbp or between 500bp-100kbp or in 1kbp- Between 100kbp or between 100bp-1kbp or between 100bp-300bp.

20. method as claimed in claim 12, wherein the sequence length in the sequence sets to be assembled is in 100bp-120kbp Between or between 500bp-120kbp or between 500bp-100kbp or between 1kbp-100kbp or Between 100bp-1kbp or between 100bp-300bp.

21. the method as described in any one of claim 13-15, wherein the sequence length in the sequence sets to be assembled exists Between 100bp-120kbp or between 500bp-120kbp or between 500bp-100kbp or in 1kbp- Between 100kbp or between 100bp-1kbp or between 100bp-300bp.

22. the method described in claim 16, wherein the sequence length in the sequence sets to be assembled is in 100bp-120kbp Between or between 500bp-120kbp or between 500bp-100kbp or between 1kbp-100kbp or Between 100bp-1kbp or between 100bp-300bp.

23. method as claimed in claim 17, wherein the sequence length in the sequence sets to be assembled is in 100bp-120kbp Between or between 500bp-120kbp or between 500bp-100kbp or between 1kbp-100kbp or Between 100bp-1kbp or between 100bp-300bp.

24. method as claimed in claim 18, wherein the sequence length in the sequence sets to be assembled is in 100bp-120kbp Between or between 500bp-120kbp or between 500bp-100kbp or between 1kbp-100kbp or Between 100bp-1kbp or between 100bp-300bp.

25. such as method of any of claims 1-11, wherein the sequence sets to be assembled include passing through third generation base Because of the obtained sequence of group sequencing technologies, and the sequence length in the sequence sets to be assembled is between 500bp-100kbp.

26. method as claimed in claim 12, wherein the sequence sets to be assembled include passing through third generation gene order-checking skill The obtained sequence of art, and the sequence length in the sequence sets to be assembled is between 500bp-100kbp.

27. the method as described in any one of claim 13-15, wherein the sequence sets to be assembled include passing through the third generation The obtained sequence of genomic sequencing technique, and the sequence length in the sequence sets to be assembled 500bp-100kbp it Between.

28. the method described in claim 16, wherein the sequence sets to be assembled include passing through third generation gene order-checking skill The obtained sequence of art, and the sequence length in the sequence sets to be assembled is between 500bp-100kbp.

29. method as claimed in claim 17, wherein the sequence sets to be assembled include passing through third generation gene order-checking skill The obtained sequence of art, and the sequence length in the sequence sets to be assembled is between 500bp-100kbp.

30. method as claimed in claim 18, wherein the sequence sets to be assembled include passing through third generation gene order-checking skill The obtained sequence of art, and the sequence length in the sequence sets to be assembled is between 500bp-100kbp.

31. method as claimed in claim 19, wherein the sequence sets to be assembled include passing through third generation gene order-checking skill The obtained sequence of art, and the sequence length in the sequence sets to be assembled is between 500bp-100kbp.

32. method as claimed in claim 20, wherein the sequence sets to be assembled include passing through third generation gene order-checking skill The obtained sequence of art, and the sequence length in the sequence sets to be assembled is between 500bp-100kbp.

33. method as claimed in claim 21, wherein the sequence sets to be assembled include passing through third generation gene order-checking skill The obtained sequence of art, and the sequence length in the sequence sets to be assembled is between 500bp-100kbp.

34. the method as described in any one of claim 22-24, wherein the sequence sets to be assembled include passing through the third generation The obtained sequence of genomic sequencing technique, and the sequence length in the sequence sets to be assembled 500bp-100kbp it Between.

35. such as method of any of claims 1-11, wherein the length of the anchor point sequence 2bp-30bp it Between or between 30bp-100bp or between 100bp-300bp or between 30bp-100kbp or Between 300bp-100kbp.

36. method as claimed in claim 12, wherein the length of the anchor point sequence between 2bp-30bp or Between 30bp-100bp or between 100bp-300bp or between 30bp-100kbp or in 300bp-100kbp Between.

37. the method as described in any one of claim 13-15, wherein the length of the anchor point sequence 2bp-30bp it Between or between 30bp-100bp or between 100bp-300bp or between 30bp-100kbp or Between 300bp-100kbp.

38. the method described in claim 16, wherein the length of the anchor point sequence between 2bp-30bp or Between 30bp-100bp or between 100bp-300bp or between 30bp-100kbp or in 300bp-100kbp Between.

39. method as claimed in claim 17, wherein the length of the anchor point sequence between 2bp-30bp or Between 30bp-100bp or between 100bp-300bp or between 30bp-100kbp or in 300bp-100kbp Between.

40. method as claimed in claim 18, wherein the length of the anchor point sequence between 2bp-30bp or Between 30bp-100bp or between 100bp-300bp or between 30bp-100kbp or in 300bp-100kbp Between.

41. method as claimed in claim 19, wherein the length of the anchor point sequence between 2bp-30bp or Between 30bp-100bp or between 100bp-300bp or between 30bp-100kbp or in 300bp-100kbp Between.

42. method as claimed in claim 20, wherein the length of the anchor point sequence between 2bp-30bp or Between 30bp-100bp or between 100bp-300bp or between 30bp-100kbp or in 300bp-100kbp Between.

43. method as claimed in claim 21, wherein the length of the anchor point sequence between 2bp-30bp or Between 30bp-100bp or between 100bp-300bp or between 30bp-100kbp or in 300bp-100kbp Between.

44. the method as described in any one of claim 22-24, wherein the length of the anchor point sequence 2bp-30bp it Between or between 30bp-100bp or between 100bp-300bp or between 30bp-100kbp or Between 300bp-100kbp.

45. method as claimed in claim 25, wherein the length of the anchor point sequence between 2bp-30bp or Between 30bp-100bp or between 100bp-300bp or between 30bp-100kbp or in 300bp-100kbp Between.

46. method as claimed in claim 26, wherein the length of the anchor point sequence between 2bp-30bp or Between 30bp-100bp or between 100bp-300bp or between 30bp-100kbp or in 300bp-100kbp Between.

47. method as claimed in claim 27, wherein the length of the anchor point sequence between 2bp-30bp or Between 30bp-100bp or between 100bp-300bp or between 30bp-100kbp or in 300bp-100kbp Between.

48. the method as described in any one of claim 28-33, wherein the length of the anchor point sequence 2bp-30bp it Between or between 30bp-100bp or between 100bp-300bp or between 30bp-100kbp or Between 300bp-100kbp.

49. method as claimed in claim 34, wherein the length of the anchor point sequence between 2bp-30bp or Between 30bp-100bp or between 100bp-300bp or between 30bp-100kbp or in 300bp-100kbp Between.

50. method as claimed in claim 35, wherein the anchor point sequence is selected from the group: precision is long greater than 95% sequence Short sequence of the degree between 100bp-300bp, sequence length of the precision in the domain of genome high conserved region greater than 95% exist The office of long sequence of the sequence length between 1kbp-120kbp in short sequence and sequence sets to be assembled between 100bp-300bp Portion's interception.

51. method as claimed in claim 36, wherein the anchor point sequence is selected from the group: precision is long greater than 95% sequence Short sequence of the degree between 100bp-300bp, sequence length of the precision in the domain of genome high conserved region greater than 95% exist The office of long sequence of the sequence length between 1kbp-120kbp in short sequence and sequence sets to be assembled between 100bp-300bp Portion's interception.

52. method as claimed in claim 37, wherein the anchor point sequence is selected from the group: precision is long greater than 95% sequence Short sequence of the degree between 100bp-300bp, sequence length of the precision in the domain of genome high conserved region greater than 95% exist The office of long sequence of the sequence length between 1kbp-120kbp in short sequence and sequence sets to be assembled between 100bp-300bp Portion's interception.

53. the method as described in any one of claim 38-43, wherein the anchor point sequence is selected from the group: precision is greater than Short sequence of 95% sequence length between 100bp-300bp, the precision in the domain of genome high conserved region are greater than 95% sequence The column length length of sequence length between 1kbp-120kbp in the short sequence and sequence sets to be assembled between 100bp-300bp The partial cut away of sequence.

54. method as claimed in claim 44, wherein the anchor point sequence is selected from the group: precision is long greater than 95% sequence Short sequence of the degree between 100bp-300bp, sequence length of the precision in the domain of genome high conserved region greater than 95% exist The office of long sequence of the sequence length between 1kbp-120kbp in short sequence and sequence sets to be assembled between 100bp-300bp Portion's interception.

55. the method as described in any one of claim 45-47, wherein the anchor point sequence is selected from the group: precision is greater than Short sequence of 95% sequence length between 100bp-300bp, the precision in the domain of genome high conserved region are greater than 95% sequence The column length length of sequence length between 1kbp-120kbp in the short sequence and sequence sets to be assembled between 100bp-300bp The partial cut away of sequence.

56. method as claimed in claim 48, wherein the anchor point sequence is selected from the group: precision is long greater than 95% sequence Short sequence of the degree between 100bp-300bp, sequence length of the precision in the domain of genome high conserved region greater than 95% exist The office of long sequence of the sequence length between 1kbp-120kbp in short sequence and sequence sets to be assembled between 100bp-300bp Portion's interception.

57. method as claimed in claim 49, wherein the anchor point sequence is selected from the group: precision is long greater than 95% sequence Short sequence of the degree between 100bp-300bp, sequence length of the precision in the domain of genome high conserved region greater than 95% exist The office of long sequence of the sequence length between 1kbp-120kbp in short sequence and sequence sets to be assembled between 100bp-300bp Portion's interception.

58. a kind of system for assembling gene order, comprising:

Input module is used to obtain the data of a sequence sets to be assembled and the data of multiple anchor point sequences；

Memory is used to store the data of the sequence sets to be assembled and the data of multiple anchor point sequences；And

Processor is configured in:

Using the multiple anchor point sequence, each of the sequence sets to be assembled sequence to be assembled is compressed to obtain The concordance list of the sequence to be assembled, wherein the concordance list is indicated with the label of anchor point sequence；And it utilizes described to be assembled The concordance list component composition skeleton that each sequence compaction to be assembled obtains in sequence sets.

59. system as claimed in claim 58, wherein the processor be further configured in:

After the concordance list for obtaining every sequence to be assembled, carried out using concordance list of the anchor point sequence to sequence to be assembled anti- To retrieval, the concordance list subset of sequence to be assembled corresponding with the anchor point sequence is obtained, by by the concordance list subset In the concordance list of every other sequence compare to the rope for correcting the sequence to be assembled on the concordance list of current sequence to be assembled Draw the concordance list of every sequence to be assembled in table subset.

60. system as claimed in claim 58, wherein the processor be further configured in the assembling skeleton of building into Row is as described below to be corrected again:

After component composition skeleton, all sequences to be assembled are compared with assembling skeleton, for any site on skeleton, It is voted by calculating the frequency that all every kind of bases compared into each sequence in the site in the site occur, benefit Assembling skeleton is corrected again with the voting results in each site, to obtain again revised gene order assembling result.

61. system as claimed in claim 60, wherein the processor be further configured in: