For assembling the method, system and device of genome sequence
Technical field
This application involves gene sequencing field, in particular to the method, system and device of assembling gene order,
The speed and precision of gene order assembling, the especially speed and precision of genome sequence assembling can be improved.
Background technique
The Human Genome Project is one of biological medicine project maximum so far, greatly accelerates DNA sequencing skill
The development of art.In the past thirty years, three generations's DNA sequencing technology has been developed, is currently under and skill is sequenced in the second generation
The crossroad that art and third generation sequencing technologies substitute.
Wherein, sequence length caused by first generation Sanger sequencing technologies is generally less than 800bp.It is sequenced according to the first generation
Technology searches for shared region using overlay chart (Overlapping Graph) among the sequence sets that sequencing obtains to assemble base
Because of sequence, but this search and alignment algorithm based on overlay chart expended on time and storage it is higher.And the second generation is sequenced
Sequence caused by technology (also referred to as next-generation sequencing technologies, NGS) is shorter, generally in the range of 30-300bp.Second
In the case where for sequencing technologies, the de Brujin graph (de Bruijn Graph, DBG) for being based on nucleosides k conjuncted (k-mer) is utilized
Algorithm (Conway, T.C.&Bromage, A.J.Succinct data structures for assembling large
genomes,Bioinformatics 27,479-486(2011);Melsted,P.&Pritchard,J.Efficient
counting of k-mers in DNA sequences using a bloom filter,BMC
Bioinformatics12,333(2011);Ye,C.,Ma,Z.S.,Cannon,C.H.,Pop,M.&Yu,D.W.Exploiting
sparseness in de novo genome assembly,BMC Bioinformatics 13 Suppl 6,S1(2012))
The sequence sets obtained to sequencing assemble.Time associated with overlay chart and storage problem are in processing first generation sequencing technologies
It can also bear, but be generated by second generation sequenator in processing huge number of short when the sequence of the relatively small amount of generation
When sequence, with regard to becoming very big problem.Although the problem of avoiding paired comparison using DBG mode, but have to face
The problem of huge EMS memory occupation.DBG become standard method before, be originally used for assembling the first generation sequence based on overlapping
The software of (Overlap-Layout-Consensus, OLC) is integrated in graph model or overlapping-arrangement-, as Celera (Venter,
J.C.et al.The sequence of the human genome, Science 291,1304-1351 (2001)), AMOS
(Treangen,T.J.,Sommer,D.D.,Angly,F.E.,Koren,S.&Pop,M.Next generation sequence
assembly with AMOS,Current protocols in bioinformatics Chapter 11,Unit 11.18
(2011)) and ARACHNE (Batzoglou, S.et al.ARACHNE:a whole-genome shotgun assembler,
Genome research 12,177-189 (2002)), also it is applied to the assembling of sequence obtained by second generation sequencing technologies.Word string
Figure and optimum superposing figure are the particular forms of OLC method, non-in terms of assembling long sequence by simplifying whole overlapping map
Chang Youxiao.
Currently, having developed third generation sequencing technologies (it is also referred to as single-molecule sequencing technology) (Levene, M.J.et
al.Zero-mode waveguides for single-molecule analysis at high concentrations,
Science299,682-686(2003);Gu,L.Q.,Cheley,S.&Bayley,H.Capture of a single
Molecule in a nanocavity, Science291,636-640 (2001)), so that producible maximum length sequence range is
Between 500bp-120kbp.Third generation sequencing technologies biggest advantage is that by high-throughput long sequence.On the one hand, increase
Add sequencing length have assist distinguish gene order repeat region significant advantage, thus facilitate the mankind can more directly and have
Biological question is explored on effect ground, such as the problems such as Genotyping and mutation.However, on the other hand, increasing sequencing length will mention
High error rate, so that there are great problems in terms of the detection and amendment of mistake.
The high error rate of third generation sequencing to be originally developed for relatively accurate first generation sequencing technologies based on weight
The OLC method of stacked group dress is not applicable.Equally, the long sequence sets of tape error that the third generation is sequenced to be initially designed to the second generation
The DBG method of the short sequence assembling of sequencing technologies is also no longer applicable in.
Therefore, at this stage, the very big obstacle that gene sequencing technology is faced is the absence of a kind of efficient gene sequence group
Dress method.
Summary of the invention
Therefore, this application provides the method, system and device for assembling gene order, and it is suitable for random length surveys
Sequence sequence is particularly suitable for the long gene order that assembling is measured by third generation sequencing technologies, so as to substantially reduce gene
Sequence assembling is calculating the high complexity on time and memory, and improves the speed and precision of gene order assembling.
The one aspect of the application provides a kind of method for assembling gene order, comprising: obtains a sequence to be assembled
Arrange the data of collection and the data of multiple anchor point sequences;Using the multiple anchor point sequence, to every in the sequence sets to be assembled
One sequence to be assembled is compressed to obtain the concordance list of the sequence to be assembled, wherein the concordance list is with the mark of anchor point sequence
It number indicates;And the concordance list component composition bone obtained using each sequence compaction to be assembled in the sequence sets to be assembled
Frame.
In some embodiments, above-mentioned compression comprises determining that each anchor point sequence to obtain the concordance list of sequence to be assembled
The arrangement position being listed in sequence to be assembled, and the arrangement position according to anchor point sequence in sequence to be assembled, are corresponded to
The concordance list of the sequence to be assembled.In some embodiments, row of each anchor point sequence of above-mentioned determination in sequence to be assembled
Column position includes: that the anchor point sequence and each sequence to be assembled are broken into a series of nucleosides k respectively is conjuncted, by the anchor
The nucleosides k of point sequence it is conjuncted respectively with the nucleosides k of the sequence to be assembled is conjuncted is compared, when the core of the sequence to be assembled
It with the conjuncted number to match of the nucleosides k of anchor point sequence is more than predetermined threshold during glycosides k is conjuncted, then it is assumed that the anchor point sequence can be with
It is matched to the sequence to be assembled;It is conjuncted respectively in the anchor point sequence and described to be assembled by each matched nucleosides k
Arrangement position of the matched anchor point sequence in sequence to be assembled is calculated in the corresponding relationship of position in sequence.It is right
In the anchor point sequence for having multiple matched nucleosides k conjuncted, arrangement position of the anchor point sequence in sequence to be assembled be can be
According to being averaged for arrangement position of the conjuncted anchor point sequence being calculated of each matched nucleosides k in sequence to be assembled
Value.In some embodiments, the length of the k-mer are as follows: 15bp-51bp.In some embodiments, the threshold value is institute
State the 0.1% to 2% of the length of anchor point sequence.
In some embodiments, above-mentioned component composition skeleton includes: by sequence to be assembled each in the sequence sets to be assembled
The concordance list that column compression obtains is compared one by one, filters out the sequence subset to be assembled with optimum superposing, with building
Optimum superposing map between sequence to be assembled, then according to anchor point sequential labeling in the sequence index table to be assembled in the subset
Position corresponding relationship arrange to each sequence to be assembled in the subset, to form assembling skeleton.In some implementations
In mode, above-mentioned component composition skeleton includes: to choose a sequence in the sequence sets to be assembled as current sequence, by institute
The concordance list of the concordance list and all other sequence in sequence sets to be assembled of stating current sequence is compared one by one, is found out and institute
The alternative sequence that current sequence has overlapping relation is stated, alternative sequence and current sequence are indexed table and compared or base ratio pair
To find out the sequence that there is optimum superposing and extension with current sequence, and it is added to relative to the part that current sequence extends and is worked as
The sequence spreading that current sequence is obtained on presequence, by using every sequence in the sequence sets to be assembled as working as preamble
Column repeat the above steps to obtain one group of sequence spreading, then corresponding according to the position of anchor point sequential labeling in the sequence spreading
Relationship arranges to each sequence spreading, to form assembling skeleton.In some embodiments, provided by the present application to be used for group
The method of dress gene order further comprises: after the concordance list for obtaining every sequence to be assembled, treating group using anchor point sequence
The concordance list of dress sequence is reversely retrieved, and the concordance list subset of sequence to be assembled corresponding with the anchor point sequence is obtained,
By comparing the concordance list of every other sequence in the concordance list subset to repairing on the concordance list of current sequence to be assembled
The concordance list of sequence to be assembled in the concordance list subset of the just described sequence to be assembled.In some embodiments, above by
It compares to correct the rope that compressed sequence to be assembled includes: every sequence to be assembled in the concordance list subset to sequence to be assembled
Draw table, carries out one as the concordance list of every other sequence to be assembled in current sequence concordance list and the concordance list subset
One compares, and is voted according to the frequency that anchor point sequential labeling different on each position occurs, utilizes the ballot of each position
As a result the concordance list of every current sequence is modified, to obtain the concordance list of revised sequence to be assembled.Some
In embodiment, the above-mentioned concordance list to sequence to be assembled be modified include: by comparison result in the rope of sequence to be assembled
Draw the anchor point for occurring not more than 5 times or not more than 4 times or not more than 3 times or not more than 2 times or not more than 1 time in table subset
It is deleted from the concordance list comprising its sequence to be assembled;The heterozygous sequence that breakpoint will be present is deleted;And it will be in comparison result
By comprising sequence delete.In some embodiments, component composition skeleton is carried out in the concordance list to sequence to be assembled
It is carried out after amendment comprising: the concordance list of all above-mentioned revised sequences to be assembled is compared one by one, screening is provided
There is the subset of the revised sequence to be assembled of optimum superposing, then according to the revised sequence rope to be assembled in the subset
The position corresponding relationship for drawing anchor point sequential labeling in table arranges to sequence to be assembled, to form assembling skeleton.Some
In embodiment, the method provided by the present application for assembling gene order further comprises that the assembling skeleton to building carries out such as
Lower described corrects again: after component composition skeleton, all sequences to be assembled being compared with assembling skeleton, for assembling bone
Any site on frame occurs by all every kind of bases compared into each sequence in the site in the site of calculating
The frequency is voted, and is corrected again using the voting results in each site to assembling skeleton, to obtain again revised base
Because of sequence assembling result.In some embodiments, above-mentioned ballot is by utilizing the conjuncted realization of great-jump-forward nucleosides k.One
In a little embodiments, above-mentioned corrected again to assembling skeleton includes: to obtain the group for covering gene order different zones to be assembled
Skeleton is filled, corrects the assembling bone of the covering gene order different zones to be assembled respectively simultaneously using multiple computer processors
Frame, and revised assembling skeleton will be integrated to obtain complete gene order assembling result again.
In some embodiments, sequence sets to be assembled described herein include passing through second generation genomic sequencing technique
Obtained sequence is sequenced.In some embodiments, the sequence length in sequence sets to be assembled described herein is in 100bp-
Between 120kbp or between 500bp-120kbp or between 500bp-100kbp or 1kbp-100kbp it
Between or between 100bp-1kbp or between 100bp-300bp.In some embodiments, it is described herein to
Assembling sequence sets includes by the obtained sequence of third generation genomic sequencing technique, and the sequence in the sequence sets to be assembled
Column length is between 500bp-100kbp.In some embodiments, the length of anchor point sequence described herein is in 2bp-
Between 30bp or between 30bp-100bp or between 100bp-300bp or between 30bp-100kbp or
Person is between 300bp-100kbp.In some embodiments, anchor point sequence described herein is selected from the group: precision is greater than
Short sequence of 95% sequence length between 100bp-300bp, the precision in the domain of genome high conserved region are greater than 95% sequence
The column length length of sequence length between 1kbp-120kbp in the short sequence and sequence sets to be assembled between 100bp-300bp
The partial cut away of sequence.In some embodiments, the length of anchor point sequence described herein is between 2-30bp.
Further aspect of the application provides a kind of for assembling the device of gene order, comprising: module is obtained, it is described
Module is obtained for obtaining the data of a sequence sets to be assembled and the data of multiple anchor point sequences;Compression module, the compression mould
Block is used to utilize the multiple anchor point sequence, is compressed each of the sequence sets to be assembled sequence to be assembled to obtain
The sequence to be assembled concordance list, wherein the concordance list is indicated with the label of anchor point sequence;And assembling module, it is described
The concordance list component composition bone that assembling module is used to obtain using each sequence compaction to be assembled in the sequence sets to be assembled
Frame.
In some embodiments, provided by the present application for assembling the device of gene order, further comprise: to be assembled
The concordance list correction module of sequence, the concordance list correction module of the sequence to be assembled are used to treat group using the anchor point sequence
The concordance list of dress sequence is reversely retrieved, and the concordance list subset of sequence to be assembled corresponding with the anchor point sequence is obtained,
By comparing the concordance list of every other sequence in the concordance list subset to repairing on the concordance list of current sequence to be assembled
The concordance list of the just described sequence to be assembled.
In some embodiments, provided by the present application for assembling the device of gene order, further comprise: gene sequence
Column assembling modified result module, the gene order assembling modified result module are used to carry out following institute to the assembling skeleton of building
That states corrects again: after component composition skeleton, all sequences to be assembled being compared with assembling skeleton, on assembling skeleton
Any site, pass through calculate it is all compare into each sequence in the site the site every kind of bases appearance the frequencys
It votes, assembling skeleton is corrected again using the voting results in each site, to obtain again revised gene sequence
Column assembling result.
In some embodiments, provided by the present application for assembling the device of gene order, further comprise: gene sequence
Column assembling result integrates module, the gene order assembling result integrate module for obtain cover gene order difference to be assembled
The assembling skeleton in region corrects the covering gene order different zones to be assembled using multiple computer processors respectively simultaneously
Assembling skeleton, and will be integrated to obtain complete gene order assembling result by revised assembling skeleton again.
Another aspect of the application provides a kind of system for assembling gene order, comprising: input module is used
In the data for the data and multiple anchor point sequences for obtaining a sequence sets to be assembled;Memory is used to store the sequence to be assembled
Arrange the data of collection and the data of multiple anchor point sequences;And processor, it is configured in;It is right using the multiple anchor point sequence
Each of the sequence sets to be assembled sequence to be assembled is compressed to obtain the concordance list of the sequence to be assembled, wherein institute
State concordance list is indicated with the label of anchor point sequence;And it is obtained using each sequence compaction to be assembled in the sequence sets to be assembled
The concordance list component composition skeleton.
In some embodiments, in the system provided by the present application for assembling gene order, processor is further
It is configured at: after obtaining the concordance list, reversely being retrieved, obtained using concordance list of the anchor point sequence to sequence to be assembled
To the concordance list subset of sequence to be assembled corresponding with the anchor point sequence, by will be every other in the concordance list subset
In concordance list subset in the concordance list comparison of sequence to the concordance list of current sequence to be assembled to correct the sequence to be assembled
The concordance list of every sequence to be assembled.
In some embodiments, in the system provided by the present application for assembling gene order, processor is further
Be configured at that the assembling skeleton to building is discussed below corrects again: after component composition skeleton, by all sequences to be assembled
It is compared with assembling skeleton, for any site on assembling skeleton, by calculating all compare to each item in the site
In sequence the site every kind of base occur the frequency vote, using each site voting results to assembling skeleton into
Row is corrected again, to obtain again revised gene order assembling result.
In some embodiments, in the system provided by the present application for assembling gene order, processor is to assembling bone
It includes: to obtain the assembling skeleton for covering gene order different zones to be assembled that frame carried out corrects again, at multiple computers
Reason device corrects the assembling skeleton of the covering gene order different zones to be assembled respectively simultaneously, and will revised assembling again
Skeleton is integrated to obtain complete gene order assembling result.
Correspondingly, by applying mthods, systems and devices described herein, by short sequence or high-precision is sequenced in high precision
The assembling segment that short sequence is sequenced in degree forms anchor point sequence, and sequence to be assembled is marked using anchor point sequence, and will be to
Sequence compaction is assembled into corresponding concordance list, is then compared and assembles again, to greatly accelerate assembling speed and reduce
Calculation amount.
Also, the error rate of the sequence to be assembled in view of being obtained by third generation sequencing technologies is relatively high, therefore into one
Step is corrected or removes to compressed sequence to be assembled, so that the mistake of sequence to be assembled is corrected, improves
Assemble the accuracy of skeleton.Finally, for any site on assembling skeleton, by calculate it is all compare it is each to the site
The frequency that every kind of base in sequence in the site occurs is voted, using the voting results in each site to assembling skeleton
It is modified to have obtained the higher gene order assembling result of accuracy rate.
Therefore, mthods, systems and devices described herein only need lower gene order coverage rate to provide for more
High continuity, and the smallest calculator memory is only needed, and when handling big gene order-checking sequence sets, speed has
The promotion of multiple orders of magnitude, therefore it is suitable for random length sequencing sequence, it is particularly suitable for assembling and passes through third generation sequencing technologies
The long gene order measured.
Detailed description of the invention
The above-mentioned and other feature of the application will by with reference to the accompanying drawing and its detailed description be described further.It should
Understand, these attached drawings illustrate only several illustrative embodiments according to the application, therefore are not considered as pair
The limitation of the application protection scope.Unless stated otherwise, attached drawing is not necessarily to scale, and wherein similar label indicates class
As component.
Fig. 1 is a kind of flow chart of gene order assemble method shown according to an exemplary embodiment.
Fig. 2 is a kind of flow chart of the gene order assemble method shown according to another exemplary embodiment.
Fig. 3 is a kind of flow chart of the gene order assemble method shown according to a further exemplary embodiment.
Fig. 4 A to Fig. 4 G is the schematic diagram that sequence assembling is carried out according to Fig. 1 to method shown in Fig. 3.
Fig. 5 A to Fig. 5 D is the graphical representation of exemplary for illustrating integration step shown in Fig. 3.
Fig. 6 is for realizing the device block diagram of gene order assemble method described in above-mentioned Fig. 1.
Fig. 7 is for realizing the device block diagram of gene order assemble method described in above-mentioned Fig. 2.
Fig. 8 is for realizing the device block diagram of gene order assemble method described in above-mentioned Fig. 3.
Fig. 9 is a kind of block diagram of gene order package system shown according to an exemplary embodiment.
Specific embodiment
Refer to constitute the attached drawing of this specification a part in the following detailed description.Show mentioned by the description and the appended drawings
Meaning property embodiment is not intended to the protection scope of limitation the application only merely for being illustrative purpose.Those skilled in the art
Member, which is appreciated that, can also be used many other embodiments, and can make various changes to described embodiment, without
Away from the purport and protection scope of the application.It should be understood that the various aspects for the application for illustrating and illustrating herein can be with
It arranges, replace, combine, separate and design according to many different configurations, these difference configurations are included in the application.
Definition
" assembling " of gene order is referred in the application, multiple sequences to be assembled are compared, and using between sequence
Overlapping relation construct the genome based on the sequence to be assembled.
Term " comparison " in the application is when public for that can pass through those skilled in the art when comparison to nucleic acid sequence
Any method for knowing is completed, for example, can by existing software BLASR, DALIGN, BLAST, BLAST-2, ALIGN,
ALIGN-2BWA, BOWTIE or Megalign (DNASTAR) etc..Term " comparison " in the application when for concordance list it
Between comparison when can be by the way that well known to a person skilled in the art the completions of any other method.(such as Smith, Temple F.and
Waterman,Michael S.(1981),"Identification of Common Molecular Subsequences",
Journal of Molecular Biology 147:195–197;Or Lipman, DJ;Pearson,WR(1985),"
Rapid and sensitive protein similarity searches",Science 227(4693):1435–41;Or
Person Pearson, WR;Lipman,DJ(1988),"Improved tools for biological sequence
comparison",Proceedings of the National Academy of Sciences of the United
States of America 85(8):2444–8)。
" OLC algorithm " in the application refers to algorithm (the Staden R.A new computer developed by Staden
method for the storage and manipulation of DNA gel reading data.Nucleic Acids
Res.8 (16): it is less to be mainly used for data volume for the related algorithm developed 3673-3694. (1980)) and on its basis
Long sequence (especially first generation sequencing sequence) splicing.Common software based on OLC algorithm have Arachne (referring to
Batzoglou S,Jaffe DB,Stanley K,et al.ARACHNE:a whole-genome shotgun
Assembler.Genome Res12:177-89. (2002)), Celera Assembler is (referring to Myers EW, Sutton
GG,Delcher AL,etal.A whole-genome assembly of Drosophila.Science 287:2196–
204. (2000)), CAP3 is (referring to Huang X, Madan A.CAP3:A DNA sequence assembly
Program.GenomeRes 9:868-77. (1999)), PCAP is (referring to Huang X, Yang SP.Generating a
genome assembly with PCAP.Curr Protoc Bioinformatics.Chapter 11:Unit11.3.
(2005)), Phrap is (referring to de la Bastide M, McCombie WR.Assembling genomic DNA
sequences with PHRAP.Curr Protoc Bioinformatics.Chapter11:Unit11.4.(2007)),
Phusion is (referring to Mullikin JC, Ning Z.The phusion assembler.Genome Res 13:81-90.
(2003)) and Newbler is (referring to Marcel Margulies1ME, William EA, Said A, et al.Genome
sequencing in open microfabricated high density picoliter reactors.Nature437:
376–80.(2005))。
" DBG algorithm " in the application refers to by algorithm (the Idury RM, Waterman of Idury and Waterman exploitation
MS.A new algorithm for DNA sequence assembly.JComput Biol 2:291-306. (1995)) with
And the related algorithm developed on its basis, it was widely used in bioinformatics in the assembling of genetic fragment later, it is main
It is suitble to the splicing of the short sequence of high coverage rate.Its corresponding software has Euler-USR (referring to Chaisson MJ, Brinza
D,Pevzner PA.De novo fragment assembly with short mate-paired reads:does the
Read length matter? Genome Res 19:336-46. (2009)), Velvet is (referring to Zerbino DR, Birney
E.Velvet:algorithms for de novo short read assembly using de Bruijn
Graphs.Genome Res18:821-9. (2008)), ABySS is (referring to Simpson JT, Wong K, Jackman SD, et
al.ABySS:a parallel assembler for short read sequence data.GenomeRes 19:1117–
(2009)), 23. AllPath-LG is (referring to Gnerre S, Maccallum I, Przybylski D, et al.High-
quality draft assemblies of mammalian genomes from massively parallel
Sequence data.Proc Natl Acad Sci USA108:1513-18. (2011)), SOAPdenovo (referring to Li R,
Zhu H,Ruan J,et al.De novo assembly of human genomes with massively parallel
short read sequencing.Genome Res20:265–72.(2009))。
" k-mer " i.e. nucleosides k in the application is conjuncted, refers to and continuously cuts gene order with certain length k, according to every
A base sequentially moves the nucleotide sequence that the identical sequence length of a cutting is k, such as sequence are as follows: ATCGTTGCTTA
The nucleotide sequence of ATGACGTCAGTCGAATGCGATGACGTGACTGACTG can be with if k-mer is 13-mer
Obtain the nucleotide sequence that a series of following length are 13:
ATCGTTGCTTAAT
TCGTTGCTTAATG
CGTTGCTTAATGA
GTTGCTTAATGAC
…
ACGTGACTGACTG。
" Sparse k-mer " i.e. great-jump-forward nucleosides k in the application is conjuncted, refers in analyses and comparison result, for base
Because sequence is with the segmentation of k length progress jumping characteristic, rather than successively cut along moving one.
Term " coverage rate " or " depth " in the application refer to the base total amount that sequencing obtains when for being sequenced
(bp) with the ratio of tested gene order size, it is one of the index for evaluating sequencing amount.Bring error rate or false sun is sequenced
Property result can decline with the promotion of sequencing depth, therefore primary sample would generally be repeated several times in gene order-checking
Amplification to increase sequencing depth.
Embodiment
As described in the background art, the high error rate of third generation sequencing to be originally developed for the relatively accurate first generation
The OLC algorithm based on overlapping assembling of sequencing technologies is not applicable.Equally, the long sequence sets of tape error that the third generation is sequenced make most
Just also no longer it is applicable in designed for the DBG algorithm of the short sequence assembling of second generation sequencing technologies.Therefore, present applicant proposes one kind
The method that gene order assembling is carried out using mixed strategy.Also, required for the sequence sets obtained as third generation sequencing approach
Data capacity it is very big, even therefore being required in the genome of small microorganism for the error correction of these long sequences very high
Computing resource.It is assembled in reducing gene order for this purpose, the application further passes through introducing " anchor point sequence " and calculates time and interior
The high complexity deposited, and ensure the precision of gene order assembling.
Hereinafter, will illustratively be illustrated with reference to Fig. 1 for the application.Fig. 1 show the exemplary reality of the application
Apply a kind of flow chart of gene order assemble method of example.Firstly, gene order assemble method described herein can be in office
What realized on the digital terrace system of form, which includes desktop computer, notebook, Portable palm computer, clothes
Business device, computer cluster, network workstation etc., a kind of exemplary implementation later in association with Fig. 9 to the present processes are realized
Example system 900 is described in detail, and wouldn't be first described in detail here.
As shown in Figure 1, the gene order assemble method of the application mainly includes tri- steps of S101, S102 and S103.With
Under three steps will be introduced respectively in conjunction with attached drawing.
Firstly, in step s101, obtaining the data of a sequence sets to be assembled and the data of multiple anchor point sequences.
Specifically, in step s101, by system 900 (such as computer), by input module, (such as data are transmitted
Interface) obtain gene sequencing data.The gene sequencing data had both included the original series obtained by gene order-checking instrument
Data also may include the data obtained from the data based on original series are assembled after pretreatment.The warp
The data crossed after pre-processing include: for example, by the available high-precision short sequence of second generation genomic sequencing technique, so
The sequence (contig) of segment is assembled them into using existing composite software afterwards.In some embodiments, sequence to be assembled
Collection can be any sequence sets.In some embodiments, sequence sets to be assembled are obtained by second generation genomic sequencing technique
Sequence sets.In other embodiments, sequence sets to be assembled are the obtained sequence sets of third generation genomic sequencing technique.
Hereinafter, will be illustrated so that sequence sets to be assembled are the obtained sequence sets of third generation genomic sequencing technique as an example.
Anchor point sequence is sequence corresponding to a certain segment area in testing gene group, anchor point sequence it is corresponding to
The segment of cls gene group is having the same or similar nucleic acid sequence.Anchor point sequence, which can correspond to testing gene group, representative
The region of property, such as the high conserved region domain of testing gene group.In some embodiments, the anchor point sequence is by available data
The sequence obtained in library.In other embodiments, the anchor point sequence is that second generation genomic sequencing technique is obtained
Short sequence, or by utilizing existing composite software (such as SOAPdenovo composite software of Huada gene company offer)
Sequence obtained from short sequence obtained to second generation genomic sequencing technique assembles or the two combination.
In other embodiments, the anchor point sequence is the partial cut away of sequence to be assembled itself, such as by by sequence to be assembled
It is a series of short sequence that column, which concentrate longest one section of sequence truncation, will be owned in above-mentioned short sequence and sequence sets to be assembled later
Other sequences are compared, and the part of those unfolded sequences of short sequence obtained with above-mentioned truncation are carried out further
Interception, be then added in the set of above-mentioned short sequence, until obtained short sequence sets can have with all sequences it is overlapping.
Persons skilled in the art can select the length of anchor point sequence according to the length of sequence to be assembled, for third
Generation sequencing, the length of anchor point sequence are generally set to the 10-30% of the average length of sequence in sequence sets to be assembled.For the second generation
Sequencing technologies, anchor point length can be longer than average sequencing length.Sequence sets to be assembled for one group can choose fixed anchor point
Sequence length also can choose the anchor point sequence of different length.It should be appreciated that the average length of anchor point sequence is longer, then compress
The average length of the concordance list of sequence to be assembled afterwards is shorter, and data processing amount is smaller;The anchor point sequence quantity the how then right simultaneously
The coverage area of its entire gene order to be assembled is higher, the anchor point sequence quantity being typically chosen multiplied by anchor point sequence average length
Degree should be not less than the 30% of the total length of gene order to be assembled.In some embodiments, the anchor point sequence length exists
Between 2bp-30bp or between 30bp-100bp or between 100bp-300bp or 30bp-100kbp it
Between or between 300bp-100kbp.Different anchor point sequences can indicate that the label can be number by different labels
Word perhaps letter perhaps combination or other suitable representations of the number with letter.For example, several bit digitals can be used
Number 001,002,003 etc. can be respectively labeled as anchor point sequential labeling, such as multiple anchor point sequences.In another example in generation, can be used
Number label, such as x1、x2、x3Etc. respectively indicating different anchor point sequences.
Then, in step s 102, the multiple anchor point sequence will be utilized, to each of described sequence sets to be assembled
Sequence to be assembled is compressed, to obtain the label with anchor point sequence of the sequence to be assembled the concordance list that indicates.
Specifically, anchor point sequence is compared with sequence to be assembled, anchor point sequence is found out in sequence to be assembled
Exhaust position, and the arrangement position according to anchor point sequence in sequence to be assembled obtain the index for corresponding to the sequence to be assembled
Thus table is indicated every sequence to be assembled with the label of anchor point sequence, that is, indicate gene order boil down in the form of label
String number or character, complete sequence compaction, and obtain correspond to multiple anchor point sequences multiple concordance lists, the multiple rope
Draw table and forms the data set for corresponding to the compression of the sequence sets to be assembled.Such as: if anchor point sequence is contig_331,
Contig_101, contig_17, when these contig are successively compared onto a sequence to be assembled, and anchor point sequence to
Arrangement position in assembling sequence is sequentially contig_331, contig_101, contig_17, then the sequence to be assembled is corresponding
Concordance list be [331,101,17].
In some embodiments, each sequence to be assembled is compressed by k-mer mode below, and obtains each
The concordance list of sequence to be assembled: by anchor point sequence and sequence to be assembled be broken into equal length k-mer cis-position set (for example,
If k=13, anchor point sequence and sequence to be assembled are lighted to the cis-position set for being broken into 13-mer from its start bit respectively).It will obtain
The k-mer of the anchor point sequence and sequence to be assembled that obtain is compared one by one respectively, and selects to close according to the length of anchor point sequence
Suitable threshold value, when with the conjuncted number to match of nucleosides k of anchor point sequence being more than pre- during the nucleosides k of the sequence to be assembled is conjuncted
Determine threshold value, then it is assumed that the anchor point sequence can be matched to the sequence to be assembled;Pass through each conjuncted point of matched nucleosides k
It is conjuncted corresponding that the matched nucleosides k is calculated in relative position not in the anchor point sequence and the sequence to be assembled
Arrangement position of the anchor point sequence in sequence to be assembled.For the anchor point sequence for having multiple matched nucleosides k conjuncted, the anchor
Arrangement position of the point sequence in sequence to be assembled can be according to the conjuncted anchor point being calculated of each matched nucleosides k
The average value of arrangement position of the sequence in sequence to be assembled.Wherein, in certain embodiments of the present invention, the predetermined threshold
Value is the 0.1% to 2% of the length of the anchor point sequence.Those skilled in the art can be according to each anchor point sequence being calculated
The arrangement position being listed in sequence to be assembled obtains the concordance list of sequence to be assembled.The wherein length of the k-mer are as follows: 2bp-
51bp, 15bp-19bp, 15bp-25bp, 2bp-15bp, 2bp-19bp, 31bp-51bp or 15bp-51bp.
Furthermore it is possible to which the sequence to be assembled concordance list being fully contained within the concordance list of another sequence to be assembled is deleted
It removes, with simplification sequence data collection to be assembled.Also, using a sequence to be assembled as current sequence, by the index of the current sequence
Table and the concordance list of all other sequence in sequence sets to be assembled are compared one by one, and finding out has with current sequence concordance list
Alternative sequence and current sequence are indexed table later and compared or alkali by the sequence (alternatively referred to as " alternative sequence ") of overlapping relation
Base compares, and finds the compressed sequence to be assembled wherein relative to current sequence comprising the extension of concordance list in one direction
Column, and the sequence for being included by current sequence completely is removed;Picking out in remaining alternative sequence has with current sequence
The sequence of leftward or rightward optimum superposing, removes the sequence for not having Maximum overlap degree in one direction, only retains and work as
Presequence has the sequence (its optimal extension sequence for being referred to as current sequence) of leftward or rightward optimum superposing, and by institute
It states optimal extension sequence and is added to the sequence spreading for obtaining current sequence in current sequence relative to the part that current sequence extends.
In above-mentioned comparison process, if current sequence is completely included by other with its sequence with overlapping relation, preamble is worked as in deletion
Column, do not generate the sequence spreading of current sequence;If not finding has the sequence extended to the left or to the right relative to current sequence,
Obtained sequence spreading is then compared using the current sequence as the wheel.By by every sequence in the sequence sets to be assembled
Column repeat the above steps as current sequence to obtain one group of sequence spreading, then according to anchor point sequence mark in the sequence spreading
Number position corresponding relationship arrange to each sequence spreading, to complete component composition skeleton.Art technology can be used
Any suitable method and software known to personnel carry out the overlapping between the sequence of calculation, for example, by BLASR, DALIGN,
BLAST, BLAST-2, ALIGN, ALIGN-2BWA, BOWTIE, SOAP or Megalign (DNASTAR) etc. compare software to working as
Presequence is compared one by one with the every other sequence in sequence sets to be assembled, is found out based on comparison result and is had with current sequence
There is the sequence of overlapping relation.
In some embodiments, in the construction step of optimum superposing figure or in the screening step to optimal extension sequence
In rapid, one sequence to be assembled of each node on behalf, for the combination of one or more continuous anchor point sequential labelings.To obtain
Optimum superposing figure can carry out two-wheeled operation, and the 1st wheel deletes all nodes for all being included, for example, if there is node
[x1,x2,x3], then delete the node [x being fully contained in above-mentioned node1,x2] and node [x2,x3], the purpose done so
Be in order to avoid unnecessary repetition and comprising node comparison.2nd wheel calculates all between the node for mutually not including
Suffix-prefix overlapping.Then optimum superposing figure is simplified, it is all when being higher than from node A to the degree of overlapping of the extension of node B
When other overlappings extended from node A to same direction, there are advantages for the extension of node A to node B.In certain embodiments
In, in degree of overlapping total length of multiple anchor points as involved by the length of corresponding overlapping region or overlapping region or overlapping region
Sequence to be assembled is measured with the matched k-mer quantity of anchor point.In general, using every sequence as current sequence and other sequences
Column are compared, may all find one have to the left with the optimal extension sequence of current sequence Maximum overlap and one to
The right side has optimal extension sequence with current sequence Maximum overlap, and by the optimal extension sequence to the left and to the right optimal prolongs
It stretches sequence and is added to the sequence spreading for obtaining the sequence in the current sequence relative to the part that current sequence extends.But due to
Sequence length to be assembled is limited, it is thus possible to the repetition that occurs can not handling (for example a plurality of sequence to be assembled is a certain current
The optimum superposing of sequence, referred to as optimal side by side) will lead to occur branch (Miller, J.R.et on map at this time
al.Aggressive assembly of pyrosequencing reads with mates.Bioinformatics 24,
2818-2824(2008)).When there is branch, is there is bifurcation disconnection in packing algorithm, or selects phase between a wherein sequence
Mutually optimal (i.e. the extension of A to B is that A extends in the optimum superposing of dextrad, and the extension of B to A is that B extends in the optimum superposing of left-hand)
Overlapping extended.To not occur each linear region of bifurcated (there is no optimal side by side i.e. between sequence) in optimum superposing figure
Individually output is assembling result.The compressed sequence indicated with indexing sheet form is together in series, is re-converted into later with base
Assembling skeleton of the unpressed sequence to be assembled that form indicates as genome profile.
Therefore, by utilizing assembling gene sequence method as described in Figure 1, by anchor point sequence (for example, being sequenced in high precision short
The assembling segment of short sequence is sequenced in sequence or high-precision) to sequence to be assembled (for example, by second generation genomic sequencing technique or
The obtained long sequence of person's third generation genomic sequencing technique) it is marked, and by sequence compaction to be assembled at corresponding index
Then table is compared and assembles again, to greatly accelerate assembling speed and reduce calculation amount.
In some embodiments, the concordance list of above-mentioned to be assembled sequence, tool can be relied on the assembling of sequence to be assembled
It include: to body a sequence in the selection sequence sets to be assembled as current sequence, by the concordance list of the current sequence
It is compared, is found out with the current sequence with Chong Die one by one with the concordance list of all other sequence in sequence sets to be assembled
The current sequence and the sequence with current sequence with overlapping relation are indexed table and compared or alkali by the sequence of relationship
Base compares to find out the optimal extension sequence of current sequence, and its part extended relative to current sequence is added to current sequence
The upper sequence spreading (i.e. leftward or rightward optimum superposing extends) for obtaining current sequence, and pass through the sequence sets to be assembled
In every sequence repeated the above steps as current sequence to obtain one group of sequence spreading, then according in the sequence spreading
The position corresponding relationship of anchor point sequential labeling arranges to each sequence spreading, to form assembling skeleton.
Fig. 2 show a kind of flow chart of gene order assemble method according to the application another exemplary embodiment.Fig. 2
Shown in gene order assemble method mainly include tetra- steps of S201, S202, S203 and S204.Wherein S201 step is used for
A sequence sets to be assembled and multiple anchor point sequences are obtained, it is substantially similar to the S101 in figure 1 described above, herein just no longer
It repeats.Equally, S202 step (compressing sequence to be assembled to obtain the concordance list being made of the label of anchor point sequence) and above
S102 is similar, is also not repeated herein.
S203 step described in detail below.It is obtained by S202 step with the label of anchor point sequence the concordance list that indicates
Afterwards, compressed sequence to be assembled is further corrected by comparing in S203 step.
Specifically, it is reversely retrieved using concordance list of the anchor point sequence to sequence to be assembled, is repaired by comparing
Sequence to be assembled being compressed includes: that the anchor point sequence according to obtained in S201 carries out reversely the concordance list of sequence to be assembled
Retrieval obtains a plurality of continuous row in concordance list described in S202 and in an anchor point sequence or the anchor point sequence
The set (that is, concordance list subset of sequence to be assembled) of the corresponding sequence to be assembled of combination of the anchor point sequence of column;To utilization
The reversed concordance list for retrieving every sequence to be assembled in obtained arrangement set to be assembled, as current sequence rope
Drawing that table compared one by one with every other sequence to be assembled in the concordance list subset (i.e. will be right in every two sequences to be assembled
The label of the anchor point sequence of position is answered to be compared one by one), the frequency occurred according to anchor point sequential labeling different on each position
It is secondary to vote, correct every current sequence in occur mistake anchor point sequential labeling, and if current sequence concordance list
In with it is duplicate or by comprising anchor point series arrangement relationship, then the current sequence is deleted, thus obtain it is more accurate and
The concordance list collection for the sequence to be assembled simplified.
It is contig-x with anchor point sequence1、-x2、-x3。。。-xiFor, it is compressed to group to correct above by comparing
The step of filling sequence includes: the label x using contig1、x2、x3。。。xiConcordance list is constructed, so as to every anchor point sequence, all
All sequences to be assembled containing the anchor point sequence can be found rapidly by concordance list.To each contig anchor point sequential labeling
xi, record includes the xiSequence to be assembled sequence index table, obtain the set of the concordance list of corresponding sequence to be assembled,
This process can be looked at as the process of a sequence of packets.Later, with dynamic programming algorithm or other more efficient calculations
Method compare one by one in group to each item sequence to be assembled in obtained arrangement set to be assembled.Above-mentioned compare one by one can be in quilt
Can also carry out in unpressed sequence to be assembled in the sequence to be assembled of compression.
For compressed sequence to be assembled, the comparison one by one in group refers to the rope of every compressed sequence (current sequence)
Draw table and be compared with the concordance list for organizing interior other sequences: on the one hand, will only be occurred very in current sequence in Multiple Sequence Alignment
Few number (for example, only occur 1-5 time or not more than 5 times or not more than 4 times, not more than 3 times, not more than 2 times, be not more than 1
It is secondary) anchor point sequential labeling deleted from the concordance list of the sequence, such as current sequence concordance list be [331,101,21,17],
And the anchor point sequence in Multiple Sequence Alignment marked as 21 only occurs in the concordance list of current sequence once, and not any
Occurred in the concordance list of other sequences, then deletes the anchor point sequential labeling in current sequence concordance list [331,101,21,17]
21, i.e., the concordance list of former sequence becomes [331,101,17] by [331,101,21,17];On the other hand, if in multisequencing ratio
Centering finds that (i.e. all sequences Chong Die with breakpoint left end are not be overlapped on the right of breakpoint, on the contrary there are breakpoint in current sequence
And in this way, then determine that the point is breakpoint), then this whole sequence is known as heterozygous sequence, and it is deleted.Comparing one by one simultaneously
To in the process, if current sequence includes by other sequences, delete this by comprising current sequence (for example, if current sequence
Concordance list is [x1,x2] or [x2,x3], when there are concordance list be [x1,x2,x3] sequence, then delete current sequence).It is deleted
The sequence removed will be no longer participate in the building of subsequent assembling skeleton.
It is walked in view of the error rate of the sequence to be assembled obtained by third generation sequencing technologies is relatively high, therefore by S203
Suddenly compressed sequence to be assembled is corrected or is removed, it is very big to may make that the accuracy of subsequent assembling skeleton has obtained
It improves.
Then, in step S204, component composition skeleton.Specifically, using in S203 it is revised it is compressed to
Sequence construct optimum superposing map is assembled, to complete component composition skeleton.Because of step S204 and step described above
S104 is similar, is just not repeated here.
Also, present invention also provides the embodiments of another gene order assemble method.Fig. 3, which is that the application is another, to be shown
A kind of flow chart of gene order assemble method of example property embodiment.Gene order assemble method shown in Fig. 3 mainly includes
Five steps of S301, S302, S303, S304 and S305.
Wherein in step S301, the data of a sequence sets to be assembled and the data of multiple anchor point sequences are obtained;Then exist
In step S302, compressing sequence to be assembled is the concordance list indicated with the label of anchor point sequence;In S303 step, pass through comparison
Delete heterozygous sequence in compressed sequence to be assembled and by comprising sequence, and will be in the Multiple Sequence Alignment of sequence to be assembled
Only occur seldom number (for example, only occur 1-5 times or not more than 5 times or not more than 4 times, not more than 3 times, not more than 2
It is secondary, not more than 1 time) the label of anchor point sequence deleted from the concordance list comprising its sequence;In step s 304, building group
Fill skeleton;And in step S305, the higher gene order of accuracy rate is obtained by integration and assembles result.In present embodiment
In, step S301, S302, S303 and S304 are similar with above step S101, S102, S103 and S104 respectively, herein
Just it is not repeated.Step S305 described in detail below.
Specifically, being corrected again to what the assembling skeleton of building was discussed below: in step S305 by needed group
Dress sequence is compared with the obtained assembling skeleton of step S304, for any site on skeleton, by calculating is all can
It compares the frequency that every kind of base occurs into each sequence in the site to vote, using the voting results in each site to group
Dress skeleton is corrected again, so that finally integration obtains the higher gene order assembling result of accuracy rate.
In some embodiments, above-mentioned ballot integration be by by all sequence alignments to be assembled to assembling skeleton,
And according to comparison result construct great-jump-forward k-mer figure (Ye, C., Ma, Z.S., Cannon, C.H., Pop, M.&Yu,
D.W.Exploiting sparseness in de novo genome assembly.BMC
Bioinformatics13Suppl 6, S1 (2012)), ballot integration is carried out using a plurality of sequence in same site, then correct bone
Mistake in frame, and obtain the higher gene order of final accuracy rate using the structure of great-jump-forward k-mer figure and assemble result.Jump
The purpose of formula k-mer be in big data quantity by jumping mode to save calculating when the memory that occupies.K value generally uses 1
~5, the slightly big k value of use can get around the mistake in alignment algorithm and compare.The step-length g that jumps is typically also 1~5, slightly big
Step-length can substantially reduce the memory occupied when calculating.In the case where integrating the comparable situation of precision, with existing PBagcon algorithm
(Chin,C.S.et al.Nonhybrid,finished microbial genome assemblies from long-read
SMRT sequencing data.Nature methods10,563-569 (2013)) it compares, use Sparc as described below
Integration algorithm speed is its 5 times or so, when calculating occupy in save as its 1/2-1/5.
Sparc integration algorithm is a kind of method based on great-jump-forward k-mer, can specifically include three steps, step 1:
Construct skeleton k-mer figure;Step 2: carrying out sequence alignment and recording new node and side according to comparison result in comparison process
And update the weight on each side;And step 3: search maximum path.Above steps is made below with reference to attached drawing 5A-5D
It is described in detail out.
As shown in Figure 5A, assembling frame sequence is ACTGGACTAAA, and with k-mer length for 2, jump step-length g is 3 buildings
K-mer figure, then: the initial position of the k-mer is respectively Isosorbide-5-Nitrae, 7,10, corresponding k-mer be respectively as follows: AC, GG, CT,
AA is chosen as node.Record the connection relationship between k-mer: such as first k-mer AC, which turns right, extends past TGG tri-
After base, the end position of second k-mer GG is reached.TGG is then registered as the side of connection AC node and GG node
(edge), at this time because TGG thus only occurred once, so the weight on side is 1, and so on.
The comparison result of several sequences and skeleton will be illustratively analyzed below.
As shown in Figure 5 B, first original series (sequence or its segment i.e. to be assembled) is ACTGGACCAAA, by itself and group
Skeleton is filled to compare.In comparison result, AC, GG node of initiation site 1,4 can compare completely, then to the node be not required into
Row operation bidirectional, only increases the weight on side, i.e. the weight of path TGG becomes 2.When reaching 7 site, due to CC with
CT Incomplete matching in skeleton then using CC as new node, and records one from the 4th node GG to the 7th node CC
New route ACC, as new side, weight 1.Due to the AA and skeleton track of the 10th node, so not needing to generate new
Node, only record the new side AAA from the 7th new node CC to existing tenth site before, weight 1.
As shown in Figure 5 C, Article 2 original series are ACGGACCAAA, it is compared with skeleton.In comparison result, rise
The node AC matching in beginning site 1, reaches the 4th node GG by two bases G G, the side generated due to GG node and front
TGG is different, so generating new edge GG, weight 1.Next, when from the 4th node GG to the 7th node, due to
Node CC presequence in generated, path has also been recorded, then increases the weight on the corresponding side ACC, weight 2.
The process can be carried out by sequence, the maximum path of weight then can be searched in final result and by the path phase
It combines (as shown in Figure 5 D), to obtain integration sequence: ACTGGACCAAA.In some embodiments, all weights in figure
It needs to be overregulated, for example subtracts some small numerical value (such as 10%-30% of region sequencing depth), to avoid by being sequenced
The erroneous path of biggish weight summation caused by long insertion property mistake in region.
Therefore, by comparing all sequences to be assembled and assembling skeleton, the nucleotide to a plurality of sequence in same site
Ballot integration (find out the highest nucleotide of the frequency of occurrences and choose nucleotide of the nucleotide as the site) is carried out, preferably
, the assembling skeleton is corrected by Sparc integration algorithm again, to obtain the higher gene order assembling result of accuracy rate.
Fig. 4 A to Fig. 4 G is the schematic diagram that the method according to Fig. 1 to Fig. 3 carries out sequence assembling.Firstly, as shown in Figure 4 A,
In step S101, S201 or S301, sequence sets and anchor point sequence to be assembled is obtained.Wherein the left Fig. 4 A is various types of thin
Lines indicate to have the anchor point sequence of number designation, and the various thick lines of right then indicate that each sequence to be assembled (such as is schemed
Original long sequence shown in 4A).As described above, the sequence sets to be assembled can be the original obtained by gene order-checking instrument
The data or the data based on original series of beginning sequence data pretreated obtained from being assembled;The anchor point sequence
Column can be to be obtained by screening in existing database, or by the way that using existing composite software, (such as Hua Da gene is public
The composite software provided is provided) obtained from obtained to second generation genomic sequencing technique short sequence assembles, either
The partial cut away of sequence itself to be assembled in sequencing result.
Fig. 4 B shows sequence mark corresponding with S102, S202, S302 step and compression process: i.e. by anchor point sequence
Gene order is carried out with sequence to be assembled to compare, and calculates exhaust position of the anchor point sequence in sequence to be assembled, and such as Fig. 4 B institute
Show, with the labelled notation of anchor point sequence sequence to be assembled;Labeled each item sequence to be assembled is compressed to by each anchor point
There is the list of sequential arrangement with it in long sequence in the label of sequence, completes sequence compaction and obtain with anchor point sequence to be label
Sequence to be assembled concordance list.
Step corresponding with S203, S303 as shown in figures 4 c and 4d: it is obtained using from S102, S202, S302
Concordance list, the grouping that reversed retrieval obtains all sequences to be assembled relevant to a certain anchor point sequence are (as shown in Figure 4 C, reversed to examine
Suo Suoyou includes the sequence of packets that the sequence of anchor point sequence 1 obtains);Sequence in obtained group will be reversely retrieved later to carry out one by one
It compares, for every sequence (current sequence), its concordance list is compared with the concordance list of other sequences in all groups: one
Aspect few number will only occur (for example, only occurring 1-5 times or not more than 5 in Multiple Sequence Alignment in current sequence
It is secondary or not more than 4 times, not more than 3 times, not more than 2 times, not more than 1 time) anchor point sequential labeling from the concordance list of the sequence
It deletes;On the other hand, if in Multiple Sequence Alignment, it is (i.e. all Chong Die with breakpoint left end that there are breakpoints in discovery current sequence
Sequence with breakpoint on the right of it is not be overlapped, otherwise be also in this way, then determining that the point is breakpoint), then using the current sequence as miscellaneous
Sequence is closed to delete;In another aspect, in comparison process one by one, if current sequence includes by other sequences, delete this by comprising
Current sequence (Fig. 4 D).Deleted anchor point and heterozygous sequence or by comprising sequence will be no longer participate in subsequent assembling skeleton
Building.Sequence to be assembled is grouped after above-mentioned reversed retrieval, and the concordance list of every sequence to be assembled is compared
Modified process, it is finally obtained the result is that the concordance list collection of sequence to be assembled that is more accurate and simplifying, and this is precisely and essence
The concordance list of letter concentrates all sequences that will be used in the building process of subsequent assembling skeleton.
And Fig. 4 E- Fig. 4 F then illustrates the process of component composition skeleton corresponding with S104, S204, S304: being based on
The concordance list, by the concordance list of the compressed sequence (current sequence) to be assembled of each and other quilts for assembling skeleton
The concordance list of compressed sequence is compared one by one, finds out the alternative sequence for having overlapping relation with the current sequence, will be alternative
Sequence and current sequence are indexed table and compare or base ratio pair, and concordance list is entirely by other sequences first in removal comparison result
The sequence that completely includes of concordance list, searching out later has Chong Die anchor point sequence with it and can prolong in one direction
Concordance list in comparison result is not had the sequence of Maximum overlap in one direction by a plurality of compressed sequence to be assembled of exhibition
Column remove, and only retain the optimal extension sequence of current sequence, and it is added to relative to the part that current sequence extends and works as preamble
The sequence spreading of current sequence is obtained on column, and by using every sequence in the sequence sets to be assembled as current sequence
It repeats the above steps to obtain one group of sequence spreading, then according to the position corresponding relationship pair of anchor point sequential labeling in sequence spreading
Each sequence spreading is arranged (as shown in Figure 4 E);To complete component composition skeleton (as illustrated in figure 4f).
It is as shown in Figure 4 G step corresponding with S305: after component composition skeleton, by all sequence alignments to be assembled
Onto assembling skeleton, for any site on assembling skeleton, by calculating all compare into each sequence in the site
The frequency that every kind of base in the site occurs is voted, and is repaired again using the voting results in each site to assembling skeleton
Just, to obtain accurate gene order assembling result.
Fig. 6-8 is for realizing the device block diagram of gene order assemble method described in above-mentioned Fig. 1-3.Referring to Fig. 6, the dress
It sets including obtaining module 601, compression module 602 and assembling module 604.Wherein obtain module 601 be configured as obtain one to
Assemble the data of sequence sets and the data of multiple anchor point sequences;Compression module 602 is configured as using the multiple anchor point sequence,
Each of the sequence sets to be assembled sequence to be assembled is compressed to obtain the concordance list of the sequence to be assembled, wherein
The concordance list is indicated with the label of anchor point sequence;Assembling module 603 is configured as using each in the sequence sets to be assembled
The concordance list component composition skeleton that sequence compaction to be assembled obtains.
As shown in fig. 7, the device of above-mentioned realization gene order assemble method is in addition to including and 601,602,604 tool in Fig. 6
Have other than acquisition module 701, compression module 702 and the assembling module 704 of identical function, can further include to be assembled
The concordance list correction module 703 of sequence.Wherein the concordance list correction module 703 of the sequence to be assembled is configured as utilizing the anchor
Point sequence reversely retrieves the concordance list of sequence to be assembled, obtains sequence to be assembled corresponding with the anchor point sequence
Concordance list subset, by the way that the concordance list of every other sequence in the concordance list subset is compared the rope to current sequence to be assembled
Draw on table the concordance list of every sequence to be assembled in the concordance list subset for correcting the sequence to be assembled.
As shown in figure 8, the device of above-mentioned realization gene order assemble method in addition to include with 701 in Fig. 7,702,703,
The 704 concordance list correction modules 803 and group with the same function for obtaining module 801, compression module 802, sequence to be assembled
Other than die-filling piece 804, gene order assembling modified result module 805 can further include.Wherein the gene order assembles
What modified result module 805 was configured as that the assembling skeleton to building is discussed below corrects again: after component composition skeleton,
All sequences to be assembled are compared with assembling skeleton, for any site on skeleton, by calculate it is all compare to
The frequency that every kind of base in each sequence in the site in the site occurs is voted, and the voting results in each site are utilized
Assembling skeleton is corrected again, to obtain again revised gene order assembling result.In some embodiments, as schemed
Device shown in 8 can further include: gene order assembling result integrates module 806, and the gene order assembles result
Integrate module for obtaining the assembling skeleton for covering gene order different zones to be assembled, simultaneously using multiple computer processors
The assembling skeleton of the covering gene order different zones to be assembled is corrected respectively, and revised assembling skeleton will be carried out again
Integration obtains complete gene order assembling result.
The device of the realization gene order assemble method of the application may include module corresponding with above method step
And/or block combiner, and with the change of above method step, above-mentioned apparatus can also carry out changing accordingly to realize this
The gene order assemble method of application.
Also, present invention also provides a kind of systems for realizing method described in above-mentioned Fig. 1-3.Fig. 9 is according to one
The system 900 of the application shown in exemplary embodiment a kind of.System 900 may be any type of digital platform, including platform
Formula computer, notebook, Portable palm computer, digital platform, computer cluster, network workstation etc..
System 900 as shown in Figure 9 includes: one or more input modules 910 by connecting inside bus 950;One
A or multiple output precisions 920;One or more central processing units (CPU) 930;And one or more storage assemblies 940, institute
It states storage assembly 940 and stores one or more computer programs 944, one or more operating systems 946 and optional wherein
One or more databases 948.
Wherein, input module 910 allows to provide data input to system 900.Common input module 910 includes number
According to input interface, network transport interface or other types of input part.In this application, input module 910 is for obtaining
The data of the data of one sequence sets to be assembled and multiple anchor point sequences.
Also, output precision 920 includes graphical user for providing a user calculated result, common output precision 920
Interface (GUI), three-dimensional display interface, output data interface, network transport interface or other types of output block etc..
The integrated operation of the usual control system 980 of central processing unit (CPU) 930, such as with display, data processing, data
Storage, data communication and the associated operation of record operation etc..Central processing unit (CPU) 930 may include one or more
Processor executes instruction, to perform all or part of the steps of the methods described above.In addition, central processing unit (CPU) 930 can be with
Including one or more modules, convenient for interaction of processing central processing unit (CPU) between 930 and other assemblies.For example, centre
Managing device (CPU) 930 may include input/output module, to facilitate input module 910, output precision 920 and central processing unit
(CPU) interaction between 930.
Memory 940 can be by any kind of direct access storage device RAM 941 or read only memory devices ROM 942
Or their combination is realized, such as static random access memory (SRAM), electrically erasable programmable read-only memory
(EEPROM), Erasable Programmable Read Only Memory EPROM (EPROM), programmable read only memory (PROM), read-only memory
(ROM), CD-ROM, tape, floppy disk and optical data storage devices etc..It is various types of that memory 940 can be configured as storage
Data are to support the operation in system 900.The example of these data includes any operating system for operating in system 900
946, computer program 944, database 948 etc..The computer program 944 can be wherein executed in the system 900, counted
Calculation machine program 944 executes gene order described herein by the corresponding module of commands direct central processing unit 930 and assembles
Method.Also, in the present invention, memory 940 is also configured in the data for storing the sequence sets to be assembled and multiple anchors
The data of point sequence.
Wherein, method as shown in Figs. 1-3 can be with such as computer software, hardware or a combination thereof come can in computer
It reads to realize in medium.For hardware realization, the embodiments described herein can pass through specific integrated circuit (ASIC), number
Signal processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array
(FPGA), processor, controller, microcontroller, microprocessor, the other electronic units for being designed to carry out function described here or
Its one or more selectively in combination of person is realized.
For for software implementations, the embodiments described herein can use individual software module, such as procedure module and
Functional module realizes that each of which is carried out one or more function and operations described here.Software code, which can be used, appoints
Software application that programming language appropriate is write realizes, and can store dedicated computer system memory or
It in other computer-readable mediums of person, and is executed by the processor of computer system, also may be mounted at and have data and deposit
In other electronic equipments of storage and processing function, such as tablet computer, server.
In some embodiments, it can be realized by system 900: the multiple anchor point sequence be utilized, to described to group
Each of dress sequence sets sequence to be assembled is compressed to obtain the concordance list of the sequence to be assembled, wherein the concordance list
It is indicated with the label of anchor point sequence;And the rope obtained using each sequence compaction to be assembled in the sequence sets to be assembled
Draw table component composition skeleton.In some embodiments, it can be further realized by system 900: to be assembled in acquisition every
It after the concordance list of sequence, is reversely retrieved, is obtained and the anchor point sequence using concordance list of the anchor point sequence to sequence to be assembled
The concordance list subset for arranging corresponding sequence to be assembled, by by the concordance list ratio of every other sequence in the concordance list subset
To the rope of every sequence to be assembled in the concordance list subset on the concordance list to current sequence to be assembled to correct sequence to be assembled
Draw table.In other embodiments, the assembling skeleton that can be further realized by system 900 to building is discussed below
Correct again: after component composition skeleton, by all sequences to be assembled with assembling skeleton be compared, for assembling skeleton on
Any site, by calculate it is all compare into each sequence in the site the site every kind of bases appearance the frequency into
Row ballot is corrected assembling skeleton using the voting results in each site, again to obtain again revised gene order
Assemble result.In some embodiments, it can be further realized by system 900: obtain and cover gene order to be assembled not
With the assembling skeleton in region, the covering gene order to be assembled not same district is corrected respectively simultaneously using multiple computer processors
The assembling skeleton in domain, and revised assembling skeleton will be integrated to obtain complete gene order assembling result again.
System 900 can be configured in realize the application it is recorded and comprising method either step and either step
Combination, be not repeated herein.
In conclusion this application provides it is a kind of efficient and support various sequencing technologies from the beginning
(de novo) packing algorithm.With for the third generation sequencing some existing algorithmic tools compared with, the present processes need compared with
Low sequential covering rate, the higher continuity of offer, the smallest memory of needs and the gene order-checking sequence sets big in processing
Shi Sudu has the promotion of multiple orders of magnitude.For example, can be in office according to the Genome Size that the present processes assemble
Anticipate range, such as between 30kbp-300Gbp or between 1Mbp-4.6Mbp or between 1Mbp-12Mbp or
Between 1Mbp-120Mbp or between 1Mbp-3Gbp.And be sequenced depth can between 5x-200x, 5x-20x it
Between or between 5x-30x or between 5x-40x or between 5x-50x or between 5x-128x or
Between 5x-237x.And the assembling time-consuming of the genome by being carried out using the present processes, system and device is existing
There are the 10 of assembling tool-1-10-5Times or 10-1-10-3Times or 10-2-10-3Times or 10-2-10-5Times or 10-3-
10-5Times.
Method, system and device disclosed in the present application and company (such as the Pacific Ocean for carrying out third generation sequencing technologies
Biological Science Co., Ltd (Pacific Biosciences) and Oxford nano-pore company (Oxford Nanopore)) sequenator it is equal
Compatible, also with second generation sequencing technologies, the various sequencing technologies such as Illumina company are compatible.Also, it is disclosed herein
Method, system and device can be used for big animal, plant, microorganism or people genome research.Further, this Shen
Method, system and device that please be disclosed allow possibly even high on common desktop computer on the computer of work station rank
Effect assembles large-scale genome, and existing assembling tool usually requires supercomputer or computer cluster.Therefore, the application institute
Disclosed method, system and device, which will greatly speed up third generation sequencing technologies and extend to, is related to plant, animal, microorganism
Field is more widely applied with the research of human genome etc..
Method described herein will be compared with other existing assemble methods by multiple embodiments below.
Embodiment 1:
Third generation sequencing sequence collection is assembled
In the present embodiment, it compares based on program designed by the present processes, system and device (hereinafter referred to as
For " DBG2OLC " program) with other assembly programs PacBio2CA (referring to Koren, S.et al.Hybrid error
correction and de novo assembly of single-molecule sequencing reads.Nature
Biotechnology 30,693-700 (2012)), HGAP is (referring to Chin, C.S.et al.Nonhybrid, finished
microbial genome assemblies from long-read SMRT sequencing data.Nature
methods 10,563-569(2013);Berlin,K.et al.Assembling Large Genomes with Single-
Molecule Sequencing and Locality Sensitive Hashing, (2014)) and ECtools (referring to Lee,
H.et al.Error correction and assembly complexity of single molecule
Sequencing reads, (2014)) obtain as a result, wherein other above-mentioned assembly programs be for the third generation be sequenced in difference
Sequencing sequence concentrate design genome assembly program.
The sequence to be assembled selected in the present embodiment is the S.cerevisiae measured based on PacBio microarray dataset
The sequencing sequence collection of w303 genome (Genome Size 12Mbp), sequencing sequence concentration sequence average length to be assembled are
5kbp.Anchor point sequence comes from: Illumina Miseq sequencing sequence collection is utilized, with the jump side k-mer of SparseAssembler
Method is assembled, and about 2500 anchor point sequences are obtained, and anchor point sequence average length is about 4.3kbp, and anchor point sequence total length is about
11.4Mbp, the coverage rate to testing gene group sequence is about 95%.Based on the DBG2OLC of the application to second generation gene sequencing
The sequence sets that (50x depth/coverage rate) and third generation gene sequencing (10x/20x depth/coverage rate) obtain are assembled, with k-
The parameter that mer length is 17bp, threshold value is (length of each anchor point sequence of 0.1%*) is by anchor point sequence alignment to sequence to be assembled
The set of to be assembled sequence index table of the column to be compressed, corrects the mistake in the sequence to be assembled of compression later, and
The step of completing the building of assembling skeleton one has shared 10CPU minutes general, has then been run twice with Sparc to gene order
The amendment step of result is assembled to obtain suitable sequence assembling result.In the present embodiment, by all sequences to be assembled in Sparc
Column are with assembling skeleton to pass through existing Blasr program (Chaisson, M.& the step of corrective gene sequence assembling result
Tesler,G.Mapping single molecule sequencing reads using basic local alignment
with successive refinement(BLASR):application and theory.BMC
Bioinformatics13,238 (2012)) realize, wherein carrying out sequence alignment using Blasr program with corrective gene sequence
The process of assembling result occupies in entire assembling process most operation time (about 2 CPU hours).
The present processes are compared with other three existing softwares, and more revised than right
W303PacBio assembles sequence results.
Table 1.S.cerevisiae genome assembling property compares (Genome Size: 12M bp)
Term " identity " in table 1 refers to that the alkali of the identical part of time series is compared in two or more pieces nucleic acid sequence
Radix accounts for the percentage of whole base numbers of whole sequence.Term " NG50 " in table 1 refers to works as in gene order assembling
The minimal size of the long assembling sequence being included of 50% testing gene original.Term " amendment NG50 " in table 1 refers to
The minimal size of the long revised assembling sequence being included of testing gene original in gene order assembling when 50%.Unless
It is otherwise noted, above-mentioned term is in following embodiment or present specification other parts indicate identical meaning.
It is as shown in Table 1 the result shows that DBG2OLC needs least coverage rate (depth) and calculating time, and can obtain
To the assembling result of quality similar with other assembly programs.Under normal circumstances, DBG2OLC can be with than existing most array
The dress program low time and memory once to two orders of magnitude generates relatively good as a result, and can be in lower (10x-
20x) the PacBio sequencing sequence of coverage rate (depth), which is concentrated, realizes preferable assembling quality.In the present embodiment, it is noted that big
Part-time, which is used in, is compared all obtained original series that are sequenced to the assembling modified result portion of assembling skeleton using Blasr
Divide (2CPU hours).Therefore, the CPU time of this part needs is separately labelled in table 1.It is tied since this assembles gene order
The comparison amendment step of fruit can be divided into multiple and different regions by that will assemble skeleton, simultaneously using multiple computer processors
The different zones of amendment assembling skeleton respectively, are integrated revised different zones to obtain complete gene order group later
Dress is as a result, so its elapsed-time standards can be shorter.
Embodiment 2:
The sequence to be assembled selected in the present embodiment is the A.thaliana ler- measured based on PacBio microarray dataset
The sequencing sequence collection of 0 genome (Genome Size 118Mbp), sequencing sequence concentration sequence average length to be assembled are
5kbp.Anchor point sequence comes from: Illumina Miseq sequencing sequence collection is utilized, with the jump k-mer of Sparse Assembler
Method is assembled, and about 83000 anchor point sequences are obtained, and anchor point sequence average length is about 1.3kbp, anchor point sequence total length
About 115Mbp, the coverage rate to testing gene group sequence is about 97%.Sparse Assembler is used in the present embodiment
Composite software (Ye, C., Ma, Z.S., Cannon, C.H., Pop, M.&Yu, D.W.Exploiting sparseness in de
Novo genome assembly.BMC Bioinformatics13Suppl 6, S1 (2012)) (k-mer length is 51bp;
Sparse step-length g=15) the pre-assembled Illumina of 50x depth (the second generation gene order-checking platform) sequencing in 2 hours
Sequence sets.DBG2OLC based on the application, using above-mentioned assembling result as anchor point, to 20x/40x PacBio sequencing sequence
Collection is assembled, and with k-mer length is 17bp, parameter that threshold value is (every anchor point sequence length of 0.5%*) is by anchor point sequence ratio
To the set of the sequence index table to be assembled to sequence to be assembled to be compressed, corrected in the sequence to be assembled of compression later
Mistake, and the step of completing the building of assembling skeleton one has shared general 1 CPU hours.And Sparc (k-mer is used later
Length 1, step-length g=1) gene order assembling skeleton is modified.Take again 18 CPU hours it is final to group to complete
Fill the makeover process of result.In the whole process, maximum memory usage amount is 6GB.
In comparison, due to needing more than 1,000 CPU of time-consuming when existing software is the problem of handling identical scale
Hour (Berlin, K.et al.Assembling Large Genomes with Single-Molecule Sequencing
And Locality Sensitive Hashing. (2014)), the following are the data comparison of collection (Lee, H.et al.Error
correction and assembly complexity of single molecule sequencing reads.
(2014)).For the assembling of assessment sequence, the assembling result that the present processes are obtained and the PacBio using high coverage rate
It sequencing sequence collection and is compared through Quiver revised HGAP assembling result, calculates error correction Orders Corrected.
Table 2.A.thaliana genome assembling property compares (Genome Size: 118M bp)
Embodiment 3:
In the present embodiment, sequence sets to be assembled are the mankind of the 54x coverage rate measured based on PacBio microarray dataset
(H.sapiens) sequence sets for the longer 30x coverage rate sequence that (Genome Size: 3.0Gbp) sequencing sequence is concentrated, the survey
Sequence average length to be assembled is 14.5kbp in sequence sequence sets.Anchor point sequence comes from: utilizing the Illumina of 80x coverage rate
Miseq sequencing sequence collection, in the jump k-mer method of Sparse Assembler, (k-mer length is 51bp;Sparse step-length g
=20) it is assembled, obtains about 2,200,000 anchor point sequences, anchor point sequence average length is about 1.1kbp, anchor point sequence
Total length is about 2.6Gbp, and the coverage rate to testing gene group sequence is about 85%.Above-mentioned pre-assembled 80x coverage rate
The step of sequencing sequence collection of Illumina (second generation sequencing) obtains anchor point sequence spends 34 CPU hours.
Later based on the DBG2OLC of the application, using the pre-assembled anchor point sequence of second generation Illumina, with k-mer long
The parameter that degree is 17bp, threshold value is (every anchor point sequence length of 1%*) covers anchor point sequence alignment to above-mentioned longer 30x
In the sequence sets of rate sequence (average length 14.5kbp), thus the set of the sequence index table to be assembled compressed, this step
Suddenly 3 are taken CPU hours, and the storage of 17-mer occupies the memory of 70GB altogether.The results show, even in 54x
The sufficient sequence concentration of coverage rate also need to only spend the process on anchor point sequence alignment to all sequences to be assembled less than 6 hours
(data are not displayed in Table 3).The finally step that gene order assembling skeleton is modified of (k-mer length 1, step-length g=1)
Suddenly about 2000 are taken CPU hours.In initial report (http://blog.pacificbiosciences.com/
In 2014/02/data-release-54x-long-read-coverage-for.html), the process flower of building overlapping map
405000 CPU hours, and local susceptibility splicing (locality sensitive hash) can be used in this process
(Berlin,K.et al.Assembling Large Genomes with Single-Molecule Sequencing and
Locality Sensitive Hashing. (2014)) method reduce approximately half of time.And it is directly high at another
Spend (Myers, G.in Algorithms in Bioinformatics, Vol.8701. in the embodiment of optimization
(eds.D.Brown&B.Morgenstern), p52-67 (Springer Berlin Heidelberg, 2014)), it can be in benefit
With the time required to being further reduced on the basis of more memories to 15600CPU hours.
The quality of DBG2OLC final assembling result can be obtained with currently advanced assembling tool using a greater amount of resources
Result compare favourably.For the accuracy of assessment result, result will be assembled in below table and the mankind refer to genome chromosome
14 are compared, to obtain identity and correct the comparison result of NG50.
Table 3.H.sapiens genome assembling property compares (Genome Size: 3.0G bp)
The problem of due to sequencing sequence collection scale, in table identity and amendment NG50 be used only the sequence sets of chromosome 14 into
Row calculates.
Embodiment 4:
Third generation sequencing sequence collection based on Oxford Nanopore platform is assembled
DBG2OLC can be normally used for calculating the sequence of 30% or more error rate.However, in order to guarantee accurately to integrate knot
Fruit then needs higher sequencing depth.In order to avoid this problem, in an experiment to high-precision degree series (such as the second generation sequencing in
Contig) be assigned with higher weight.
In the present embodiment, sequence sets to be assembled are the Escherichia coli of the 30x coverage rate based on Nanopore microarray dataset
K12 genome (Genome Size: 4.6Mbp) sequencing sequence collection, sequencing sequence concentration sequence average length to be assembled are
6.5kbp.Anchor point sequence comes from: using the Illumina Miseq sequencing sequence collection of 30x coverage rate, with Sparse
The jump k-mer method of Assembler is assembled, and about 2500 anchor point sequences are obtained, and anchor point sequence average length is about
1.7kbp, anchor point sequence total length is about 115Mbp, and the coverage rate to testing gene group sequence is about 97%.Based on the application's
DBG2OLC assembles the Nanopore sequencing sequence collection of 30x coverage rate, using above-mentioned assembling result as anchor point with k-
The parameter that mer length is 15bp, threshold value is (every anchor point sequence length of 2%*) by anchor point sequence alignment to sequence to be assembled from
And the set of the sequence index table to be assembled compressed, the mistake in the sequence to be assembled of compression is corrected later, and is completed
Assemble the building of skeleton.And 3 wheels have been run with Sparc (k-mer length 1, step-length g=1) later, skeleton is assembled to gene order
The step of being modified, the results are shown in Table 4.
Table 4.E.coli K12 genome assembling property compares (Genome Size: 4,6M bp.)
Embodiment 5:
Second generation sequencing data based on Illumina platform is assembled
In order to assess short sequence sets assembling tool in the long sequence that processing is obtained by accurate second generation sequencing technologies
Performance, downloaded for E. coli K-12 bacterial strain MG1655 clone (DNAnexus database (specifically refers to
Http:// sra.anexus.com) accession number: SRA073308) 50x coverage rate Illumina Miseq sequencing sequence
Collection.In this experiment, result assembling tool described herein assembled longer Illumina sequencing sequence collection
With other several common assembling tools (including SGA (Simpson, J.T.&Durbin, R.Efficient de novo
assembly of large genomes using compressed data structures.Genome research22,
549-556(2012)),SOAPdenovo2(Luo,R.et al.SOAPdenovo2:an empirically improved
memory-efficient short-read de novo assembler.GigaScience1,18(2012)),SPAdes3
(Bankevich,A.et al.SPAdes:a new genome assembly algorithm and its
applications to single-cell sequencing.Journal of computational biology:a
Journal of computational molecular cell biology19,455-477 (2012)) assembling result into
It has gone and has compared.
Ironically, for the relatively long sequencing sequence collection generated in third generation sequencing technologies, traditional tool
There is the assembling result of the DBG of fixed k-mer length and non-optimal, has exposed and be difficult to differentiate repeat region and when making
The shortcomings that sequencing sequence collection of higher precision and more high coverage rate is needed when with biggish k value.This makes contradictory requirement
It is obtained to be difficult to obtain the optimum combination of limited computing resource.Iteration DBG is more to calculate time, expense and more complicated calculation
Method design and implementation is cost, partially solves the above problem.For example, must be generated using an error-correcting program it is long and
Correct k-mer word string.SGA finds accurate match using FM-index indexing means, this also brings the limit to sequence quality
System.
Opposite, the algorithm of the application is stable and looser to the limitation of quality, thus its can it is more effective and
Accurately find overlapping region.K=31 is arranged using Sparse Assembler assembly program to assemble in the DBG2OLC of the application
Initial anchor point sequence.Compressed sequence is calculated using the anchor point sequence and sequence sets to be assembled.
Wherein anchor point sequence comes from: the Illumina Miseq sequencing sequence that the length using 50x coverage rate is 150bp
Collection, is assembled in the jump k-mer method of SparseAssembler, obtains about 400 anchor point sequences, anchor point sequence average
Length is about 13kbp, and anchor point sequence total length is about 4.6Mbp, and the coverage rate to testing gene group sequence is about 100%.It is based on
The DBG2OLC of the application, using above-mentioned assembling result as anchor point sequence, to the Illumina sequencing sequence collection of 50x coverage rate
It is assembled, is pressed anchor point sequence alignment to sequence to be assembled with the parameter that k-mer length is 31bp, threshold value is 0
The set of the sequence index table to be assembled of contracting, and complete the building of assembling skeleton.For Illumina data, due to precision
Height, assembling skeleton can be used as final result.Anchor point sequence and sequence to be assembled come from same data set in this example.It is main
It wants not can solve than k long in genome the reason is that traditional k-mer method only can solve the repeat region shorter than k, but than reading
All repeat regions of length.So while anchor point sequence average length is very long, but still there are a large amount of repeat regions that cannot unlock.
DBG2OLC then compresses sequence to be assembled with anchor point, and has utmostly dissolved these repeat regions than reading length.
In this sequence sets, the building of the overlapping map in herein described method only takes the general 1 second time, and
Wherein the compression of sequence takes when most calculating m- about two minutes.
5. pairs of the table assembling properties using the Miseq E.coli genome sequence read compare (Genome Size: 4.6M
bp)
Embodiment 6:
In the present embodiment, Sparc integration algorithm described in this application is various from bacterial genomes to mammal
It is verified in the data set of Genome Size.Sparc integration algorithm PacBio data set (http: //
Schatzlab.cshl.edu/data/ectools/ in) and Oxford Nanopore data set (http: //
Gigadb.org/dataset/100102 the verification result in) is as follows.
It is PBdagcon with the most similar existing algorithm of Sparc integration algorithm, will be shown below in above-mentioned two data
Concentrate the comparison result of PBdagcon algorithm and Sparc integration algorithm.It is inputted based on identical data, uses such as the application
The method generates assembling skeleton first with DBG2OLC, finally uses MUMmer 3 (Kurtz, S.et
al.Versatile and open software for comparing large genomes.Genome Biology5,
R12 (2004)) dnadiff function calculate final integration identity.All experiments are with AMD Opteron
It is carried out in the work station of 2425 HE CPUs processors (under the frequency of 800MHz).
Sparc is designed to allow using a kind of (such as second generation or the third generation are sequenced) or a variety of mixing sequencing datas are (such as
Simultaneously using second generation sequencing and third generation sequencing sequence) and assign corresponding path higher weight.By the knot of integration
Fruit is re-used as input and is iterated to rerun integrating operation and helping to improve identity.In the integration experiment of mixed data set,
The Illumina assembled using Sparse Assembler assembling tool is assembled anchor point sequence and the weight on side is increased to 5.
In PacBio data set, two-wheeled integration algorithm, existing after the first round and the second wheel by base accuracy rate are run with k=1, g=1
It is indicated in table 6 with identity 1 and identity 2.In first experiment, E.coli sequencing data obtained in PacBio is used
Collect and identity is detected under different sequencing depth profiles.It is used in the sequencing data of 10x/30x depth
The longest assembling skeleton that DBG2OLC is obtained is respectively 1.3Mbp and 4.6Mbp.Registration number can be found in existing database is
NC_000913, the E.coli that length is 4.6Mbp refer to genome.It is concentrated in blended data, using only the data of 10x depth
Sparc integration algorithm can reach 99.91% identity, and work as and the data Shi Qineng of 30x depth is used to reach better
Integrate quality (99.98% identity).And the identity obtained in PacBio data set is only slightly below in mixed data set
Obtained in result.
The calculated result that the E.coli sequencing data obtained in PacBio of table 6. is concentrated
It is sequenced using the A.thaliana (Genome Size 120Mbp) that sequencing depth is 20x obtained in PacBio
Data set detects performance of the Sparc integration algorithm when calculating more large data capacity.It is assembled using the longest that DBG2OLC is obtained
Skeleton is 7.1Mbp.With HGAP (Chin, C.S.et al.Nonhybrid, finished microbial genome
assemblies from long-read SMRT sequencing data.Nature methods 10,563-569
(2013)) the pure PacBio data set genome assembling that method obtains is by as referring to calculate identity.
The calculated result that the A.thaliana sequencing data obtained in PacBio of table 7. is concentrated
In Oxford Nanopore data set, it is based on its higher error rate, four-wheel integration is run with k=2, g=2
It is indicated after algorithm, the first round and fourth round by base accuracy rate in table 8 with identity 1 and identity 2.In blended data
The results are shown in Table 8 with PBdagcon integration algorithm with Sparc is concentrated use in Oxford Nanopore data for concentration.Fortune
It can be higher than in the Oxford Nanopore data set with about 40% original error rate with Sparc integration algorithm
99.5% identity is higher than the result identity obtained in mixed data set.In this experiment, obtained longest group
Dress skeleton is 4.6Mbp.
The calculated result that the E.coli sequencing data obtained in Oxford Nanopore of table 8. is concentrated
It concentrates in the blended data of 30x PacBio E.coli and memory and quality is compared using different parameters,
As shown in table 9, in addition to other than setting higher g value and can reduce the consumption of memory, the variation of other parameters fails to draw
Play the larger change of result.
Table 9. is compared memory and quality using different parameters
Above-mentioned experiment as the result is shown when using Sparc integration algorithm, can generally set k=1-3, g=1-5.It uses
Mixed data set can reduce the demand to third generation sequencing data depth, this will substantially reduce sequencing cost.Due to the second generation
Sequencing has higher accuracy rate, and the result that the second generation is sequenced is used to integrate the quality that will be helpful to improve gene assembling.
Although the various aspects and embodiment of the application are disclosed, other aspect and embodiment are for this field skill
It is also obvious for art personnel.Various aspects and embodiment disclosed herein are for illustration purposes only, rather than limit mesh
's.The protection scope and purport of the application is only determined by appended claims.
Equally, each chart can show the exemplary architecture or other configurations of disclosed method and system, help
It may include the feature and function in disclosed method and system in understanding.Claimed invention is shown shown in being not limited to
Example property framework or configuration, and desired feature can be realized with various alternative architectures and configuration.In addition to this, for process
Figure, functional descriptions and claim to a method, box sequence described herein, which should not necessarily be limited by, to be implemented with similarly sequence to hold
The various embodiments of the row function, unless explicitly pointing out within a context.
Unless otherwise expressly stated, term and phrase and its variant used herein are interpreted as open,
Rather than it is restrictive.In some instances, such as " one or more ", " at least ", scalability word as " but being not limited to "
It converges and the appearance of phrase or other similar term should not be construed as to be intended in the example without this scalability term
Or need the case where indicating constriction.