CN103699818A - Bidirectional edge expanding method for multistep bidirectional De Bruijn image-based elongating kmer inquiry - Google Patents

Bidirectional edge expanding method for multistep bidirectional De Bruijn image-based elongating kmer inquiry Download PDF

Info

Publication number
CN103699818A
CN103699818A CN201310670740.1A CN201310670740A CN103699818A CN 103699818 A CN103699818 A CN 103699818A CN 201310670740 A CN201310670740 A CN 201310670740A CN 103699818 A CN103699818 A CN 103699818A
Authority
CN
China
Prior art keywords
kmer
bruijn
sequence
fragment
way limit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310670740.1A
Other languages
Chinese (zh)
Other versions
CN103699818B (en
Inventor
孟金涛
张慧琳
彭丰斌
魏彦杰
冯圣中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201310670740.1A priority Critical patent/CN103699818B/en
Publication of CN103699818A publication Critical patent/CN103699818A/en
Application granted granted Critical
Publication of CN103699818B publication Critical patent/CN103699818B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to the technical field of gene sequencing and provides a bidirectional edge expanding method for multistep bidirectional De Bruijn image-based elongating kmer inquiry. The bidirectional edge expanding method comprises the following steps: A) reading a sequencing data source file and constructing a multistep bidirectional De Bruijn image; B) counting forked bidirectional edges in the multistep bidirectional De Bruijn image; C) expanding the bidirectional edges in the multistep bidirectional De Bruijn image-based on the elongating kmer inquiry. According to the bidirectional edge expanding method, an elongating kmer combination is only constructed for the forked edges already existing on the De Bruijn image, the appearing times of the forked edges are inquired in an input sequence, and the combination of the bidirectional edges is selected according to the times; the optimal combination of the bidirectional edges is selected, and the possibility of fault combination of the bidirectional edges is reduced to the lowest; the length of contigs is obviously increased, and the quality loss of contig is reduced to the least.

Description

The two-way limit extended method of the elongated kmer inquiry based on the two-way De Bruijn of multistep figure
[technical field]
The present invention relates to gene sequencing technical field, particularly relate to a kind of two-way limit extended method of the elongated kmer inquiry based on the two-way De Bruijn of multistep figure.
[background technology]
It is core that gene sequencing be take algorithm and mathematical model, research contents relates to many aspects, mainly comprises: the storage of gene data with obtain, sequence alignment, order-checking and splicing, predictive genes, biological evolution and Phylogenetic Analysis, protein structure prediction, RNA structure prediction, MOLECULE DESIGN and drug design, metabolism network analysis, genetic chip, DNA calculating etc.Present combining closely of biotechnology and Computerized Information Processing Tech, has accelerated to process the speed of biological data, makes within the short time of trying one's best, biological significance to be made as far as possible and being annotated accurately, has accelerated the development of bioinformatics.At present, Bioinformatics becomes one of huge challenge that current information technical field faces.
Gene sequencing is that magnanimity gene sequence data is analyzed, thereby extracts and excavate new biological information knowledge.Wherein, relate to the knowledge such as machine learning, pattern-recognition, books analysis and excavation in computer technology, combinatorics, probabilistic model, character string, pattern algorithm, Distributed Calculation, high-performance calculation, parallel computation.Wherein, the research of full genomics is one of core of current bioinformatics research.
Gene is the most basic genetic codes of the mankind, is representing everyone life-information.In gene order, exist the nuance of genetic locus, the polymorphism of these genetic codes and the mankind's health, pathogenesis, therapeutic treatment have suitable close relationship.Wherein, DNA sequencing is one of research whole genome sequence substance that need to complete.
Since Sanger sequencing technologies in 1977 comes out, development through three more than ten years, DNA sequencing technical development is advanced by leaps and bounds, the second generation sequencing technologies that high flux, short sequence be feature of take dominates the market gradually, the third generation sequencing technologies that the single-molecule sequencing of take is feature also engenders, they occupy respectively different advantages in order-checking feature.The data of traditional gene order surveying method are extracted and the research and development of analysis software process over nearly 10 years, comparatively perfect at present.But, the development of sequencing technologies, the variation that has brought sequencing data, makes the data processing software of current existence can not meet the demand of current biomedical research.
The application of a new generation's high-flux sequence method, can complete the mensuration of whole genomic data at short notice.Making rapid progress of high-flux sequence method also proposed challenge to the analysis and processing method of the gene data obtaining simultaneously.In this current very powerful and exceedingly arrogant research field, in the urgent need to exploitation, can meet the wide bioinformatics platform of the mass data processing of high throughput sequencing technologies.In the face of individual genome plan and following personalized medicine prospect, the sequencing technologies of high efficiency, low cost becomes inevitable trend.Meanwhile, simplifying the one-stop complete complete order-checking solutions such as bioinformatic data analysis platform efficiently, is also very important indispensable developing direction.
Although yet the high-flux sequence method sequencing throughput of a new generation is high, but but can introduce sequencing error, the order-checking mistake of sample itself checks order simultaneously, check order inhomogeneous, SNP, and the repetitive sequence Repeat of genome itself, and introduce some wrong two-way limit or summits in the two-way De Bruijn of the multistep figure that these sequencing errors, SNP, repetitive sequence will be constructed when genome is assembled, and make a lot of two-way limits cannot continue expansion.Simultaneously, due to fixing kmer length, make the effective information loss of sequencing sequence, cannot decoupling zero all length over the repetitive sequence of kmer length.These situations make De Bruijn figure cannot continue to shrink above, and contig cannot expand, and finally makes the length of contig and quality all very low.
The assembling of the short genetic fragment that the high-flux sequence method of a new generation produces has caused a large amount of order-checking mistakes, and this has strengthened the calculated amount of packing algorithm greatly.A large amount of order-checking mistakes, increases assembly defect rate, has had a strong impact on assembling result.Can addresses this problem, and becomes the key of evaluating a packing algorithm quality.
The strategy of packing algorithm is mainly divided into two classes at present, and one is exactly the algorithm based on Overlap-Layout-Consensus (OLC), and another one is exactly the algorithm based on De Bruijn figure.Wherein the software based on the exploitation of OLC packing algorithm as SSAKE, VCAKE, SHARCGS etc., more takes advantage in gene length sequence assembling, but is also not exclusively applicable to the short sequence assembling of a new generation.Different from OLC packing algorithm, De Bruijn algorithm no longer be take read as unit organization data, but take k-mers, carry out data assembling as unit, its advantage mainly contains the following aspects: first, the k-mers of take carries out sequence assembling as unit, do not affect the quality of node, reduced redundant data amount.Secondly, repeat region only occurs once in the drawings, is convenient to identification, can avoid wrong assembling, reduces error rate.Finally, taking to have overlapping region to be mapped to the strategy on same arc, thereby has simplified searching route.At present, a lot of short sequence assembling algorithms are all used this framework, as Velvet, IDBA, SOAPdenovo, ABySS etc.
Wherein Velvet has effectively utilized De Bruijn figure, has realized efficient short sequence assembling.Velvet be take k-mer as base unit builds De Bruijn figure, utilizes the structure of figure, and in conjunction with corresponding sequence signature, the structure of reduced graph, finally finds an optimal path to complete assembling process.Velvet concentrates on focus in three kinds of structures that wrong data produce, i.e. tip, bubble, and erroneous connection.It is less than length all removals of 2k according to length principle and minority principle; Utilize the depth-first search strategy in Tour Bus algorithm to merge bubble, finally utilize coverage threshold method to remove erroneous connection.The method also takes full advantage of paired-end both end information, further solves repeat problem, has optimized assembling effect.Velvet makes full use of the structural property of figure, has simplified data redundancy, and speed algorithm has before had very large improvement.Although it does not carry out error correction at pretreatment stage to sequence, it has made up the defect of this respect to a great extent to wrong prevention mechanism.This is better applied in the assembling of large-scale genome sequence it.
IDBA is also based on De Bruijn figure, has realized easy and efficient short sequence assembling.IDBA be take k-mer as base unit, and for a change, it adopts the k codomain (Kmin-Kmax) of a variation, replaces obtaining by fixing k value the length of k-mers.Because gene assembling be take k-mers as unit, conventionally can form a lot of overlapped elements, this makes assembling be faced with errors present assembling, summit disappearance and the low problem of coverage.The size of correct selection k value becomes a key factor of assembling.The generation of the reads that some are wrong, also causes having produced a large amount of branching.K value is less, and branching problem is more serious, and k value is larger, and the repeat region occurring tails off, and this has directly affected the quality of assembling.IDBA adopts unfixed k value to assemble, and can solve well branching problem, thereby has improved the quality of assembling.IDBA obviously reduces the memory usage of IDBA by deleting the wrong k-mers of low coverage rate in addition, has also promoted the processing speed of IDBA simultaneously.
SOAPdenovo can complete the assembling of hundreds of millions of reads high-effect high-quality.SOAPdenovo has inherited the advantage of OLC algorithm and De Bruijn nomography, and its assembling quality is greatly improved.SOAP, by the method for preset k-mer threshold value, takes the mode of filtration, error correction to reduce the generation of faulty sequence.Meanwhile, it has been used for reference the method for Velvet software and has successfully processed bubble, and its average coverage is increased.In addition, SOAPdenovo has utilized both end information to carry out overlapping region coupling, and merges read generation contig fragment, generate the graph structure based on contig, thereby SOAPdenovo has simplified the complicacy of contig figure greatly.
ABySS introduces the thought of parallel computation, has built a Linux system in cluster, has set up a distributed De Bruijn graph structure on cluster, by data distributed storage on each node.It adopts MPI communication mechanism to complete the intercommunication mutually between node.From design of graphics, correction process, to fixed point below, merge, finally complete the reproduction of whole genome sequence, it is occupying very large advantage aspect working time and memory consumption, and its error rate is extremely low, aspect of performance particularly in cluster unit internal memory on using, all have greatly improved, obtaining applying more and more widely.
Existing main sequence composite software, SOAPdenovo for example, Velvet, ABySS, Ray etc., are that the kmer based on given length carries out De Bruijn structure, then shrink.The way of its optimization also only goes to select a best kmer length.This packaging strategy based on regular length kmer cannot decoupling zero for the repetitive sequence of the about kmer length of all length.Although IDBA can carry out iterative shrinkage De Bruijn figure to multiple kmer length, it need to go all sequences decompose storage and calculate to every kind of kmer length, and this strategy will expend huge internal memory and computing time.
Given this, overcoming the existing defect of the prior art is the art problem demanding prompt solution.
[summary of the invention]
The technical problem to be solved in the present invention is to provide a kind of two-way limit extended method of the elongated kmer inquiry based on the two-way De Bruijn of multistep figure.
The present invention adopts following technical scheme:
A two-way limit extended method for elongated kmer inquiry based on the two-way De Bruijn of multistep figure, comprising:
Steps A: read sequencing data source file, the two-way De Bruijn figure of structure multistep;
Step B: add up intersecting two-way limit in the two-way De Bruijn of described multistep figure;
Step C: the two-way limit expansion based on elongated kmer inquiry in the two-way De Bruijn of described multistep figure.
Further, in described step B, the institute that the two-way limit of all the right and lefts summit bifurcated in the two-way De Bruijn of described multistep figure and the bifurcated limit on described two-way limit are merged is situation likely, be assembled into a longer kmer, and the inquiry occurrence number of longer kmer correspondingly, the size of described occurrence number of take is weight, carries out choosing and union operation of bifurcated limit trend.
Further, in described step C, the all possible bifurcated on the two-way limit of each bifurcated is being converted to after longer kmer, height according to described occurrence number, using the highest bifurcated path of occurrence number as final bifurcated Merge Scenarios, and the two-way limit on corresponding bifurcated path is deleted together after merging.
Further, described step B further comprises:
Step B1: each the two-way limit e in traversal De Bruijn figure;
Step B2: add up the maximum length of the sequencing sequence of all input file, and assignment is to Lmax;
Step B3: if the length on two-way limit is greater than the length of Lmax-kmer, return to execution step B1, otherwise execution step B4;
Step B4: the initial vertex of two-way limit e is u, initial direction is Direc_u, and termination summit is v, and termination direction is Direc_v;
Step B5: allly enter summit u, stop limit that direction is Direc_u and get last k+1 the character on its two-way limit and put into character string array m, allly leave two-way limit that vertex v, initial direction are Direc_v and get its last character and put into character string array n;
Step B6: by left and right, two-way limit respectively with m and n array in all combinations of structure of character string, construct elongated kmer array, wherein each element in elongated kmer array is elongated kmer=m+e+n.
Further, described step C further comprises:
Step C1: open sequencing sequence file, read one by one every sequence;
Step C2: the sequence (comprising the coupling to its reverse sequence) that each coupling of elongated kmer set group is read in, and to each elongated kmer counting;
Step C3: each the two-way limit e in traversal De Bruijn figure;
Step C4: add up the maximum length of the sequencing sequence of all input file, and assignment is to Lmax;
Step C5: if the length on two-way limit is greater than the length of Lmax-kmer, return to execution step C3, otherwise execution step C6;
Step C6: the initial vertex of two-way limit e is u, initial direction is Direc_u, and termination summit is v, and termination direction is Direc_v;
Step C7: allly enter summit u, stop limit that direction is Direc_u and get last k+1 the character of its last two-way limit e_i and put into character string array m, allly leave two-way limit e_j that vertex v, initial direction are Direc_v and get its last character and put into character string array n;
Step C8: by two-way limit e left and right respectively with m and n array in all combinations of structure of character string, construct elongated kmer array, wherein each element in elongated kmer array is elongated kmer=m+e+n.In the elongated kmer of counting, inquire about these elongated kmer, and take out one of count value maximum, and by corresponding e_i, e, tri-limits of e_j merge, and then delete e_i, e, e_j.
The sequence of in described step C2, each coupling of elongated kmer set group being read in further, comprises the coupling to reverse sequence.
Further, described steps A further comprises:
Compression storing step, is specially
A11, read a sequence s;
A12, sequence s is cut into a plurality of fragment t with moving window;
A13, to each fragment t, use nucleic acid coding table to encode, and be expressed as the integer a of 64;
A14, fragment t is reversed, use symmetrical complement table that the fragment complementation of reversion is processed, obtain complementary fragment v, and the nucleic acid coding table reusing in steps A 13 encodes complementary fragment, and be expressed as the integer b of 64;
The maximum number of A15, round numbers a and integer b, as the conventional number of the k molecule of fragment t and complementary fragment v;
A16, repeating step A11-A15, until all sequences completes;
With De Bruijn figure constitution step, be specially
A21, read a sequence s;
A22, sequence s is cut into a plurality of fragment t with moving window, the conventional number of choosing its conventional number of fragment t and be cur its forward and backward fragment of mark is respectively pre, lat;
If the coding of A23 t is less than its complementary fragment coding, exchange pre, the value of lat;
A24, in the corresponding bit position 1 of the forward position mapping table of cur, represent to point to the limit of pre;
A25, in the corresponding bit position 1 of the reverse position mapping table of cur, represent to point to the limit of lat;
A26, repeating step A22-A25, process other fragments t of sequence s, until complete whole fragment t of sequence s, execution step A27;
A27, read a new sequence s, repeating step A22-A26; Until handle all sequences, execution step A28;
A28, complete the structure of two-way multistep De Bruijn figure.
Further, the moving window in described steps A 12, A22 is that length is the moving window of k, and wherein 0<k<32 and k are odd number;
Nucleic acid coding table in described steps A 13 is { A:00, C:01, G:10, T:11};
Symmetrical complement table in described steps A 14 is { A->T, C->G, G->C, T->A};
Described steps A 14 is specially, the character string of fragment t is reversed, use symmetrical complement table that each character in the character string of reversion is become to its complementary character, obtain the character string v of complementary character, and the nucleic acid coding table reusing in steps A 13 encodes character string v, and be expressed as the integer b of 64.
Further, in described steps A 22, if the fragment of fragment t before or after not having composed as sky or NULL pre or lat value;
In steps A 24, forward position mapping table is { A:0, C:1, G:2, T:3}, last character that position enquiring character is pre;
In steps A 25, reverse position mapping table is { A:4, C:5, G:6, T:7}, the complementary character of the first character that position enquiring character is lat.
Compared with prior art, beneficial effect of the present invention is:
(1) only for the bifurcated limit having existed on De Bruijn figure, construct elongated kmer combination, and in list entries, inquire about its occurrence number, then according to the merging on its two-way limit of selection of times; IDBA is by all kmer length of iteration, after all possible kmer of each kmer length need to being constructed, then shrinks De Bruijn figure, and its method will cause larger memory consumption and computing time to consume;
(2) select the merging on optimum bifurcated limit, may dropping to that the mistake on two-way limit is merged is minimum;
(3) can significantly improve the length of contigs, also the mass loss of contig can be dropped to minimum; Than other existing methods, improve contig length and will sacrifice contig quality, the present invention has had control and improvement to a certain extent.
[accompanying drawing explanation]
Fig. 1 is the two-way limit extended method process flow diagram of the elongated kmer inquiry of the embodiment of the present invention based on the two-way De Bruijn of multistep figure;
Fig. 2 is the compression storing step process flow diagram in Fig. 1 steps A;
Fig. 3 is De Bruijn figure constitution step process flow diagram in Fig. 1 steps A;
Fig. 4 be step B in the two-way De Bruijn of described multistep figure to intersecting the process flow diagram that two-way limit adds up;
Fig. 5 is the process flow diagram of step C two-way limit expansion based on elongated kmer inquiry in the two-way De Bruijn of described multistep figure.
[embodiment]
In order to make object of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.
In addition,, in each embodiment of described the present invention, involved technical characterictic just can not combine mutually as long as do not form each other conflict.
The object of the invention is to design a kind of extended method of the two-way limit of intersection based on elongated kmer inquiry, it will make De Bruijn figure continue to shrink, and contigs continues expansion, can not introduce mistake simultaneously, causes the decline of contig quality, and accuracy reduces.
The two-way limit extended method that the invention provides a kind of elongated kmer inquiry based on the two-way De Bruijn of multistep figure, as shown in Figure 1, the method comprises:
Steps A: read sequencing data source file, the two-way De Bruijn figure of structure multistep;
Step B: add up intersecting two-way limit in the two-way De Bruijn of multistep figure;
Step C: the two-way limit expansion based on elongated kmer inquiry in the two-way De Bruijn of multistep figure.
Wherein, steps A specific implementation in the following way:
Compression storing step, required raw data comprises the first generation, the FASTA formatted file that the second generation and order-checking instrument of new generation generate, cuts into one by one k molecule by the sequence in FASTA file and by binary coding, compresses the conventional number of the long k molecule that is stored as 64.
As shown in Figure 2, be specially
A11, read a sequence s; Wherein, sequence s takes from FASTA formatted file;
A12, sequence s is cut into a plurality of fragment t with moving window;
A13, to each fragment t, use nucleic acid coding table to encode, and be expressed as the integer a of 64;
A14, fragment t is reversed, use symmetrical complement table that the fragment complementation of reversion is processed, obtain complementary fragment, and the nucleic acid coding table reusing in steps A 13 encodes complementary fragment, and be expressed as the integer b of 64;
The maximum number of A15, round numbers a and integer b, as the conventional number of the k molecule of fragment t and complementary fragment v;
A16, repeating step A11-A15, until all sequences completes.
By above-mentioned steps, by the kmer in two traditional De Brujin figure, the conventional number that is converted into the k molecule of 64 is stored.This step can by other softwares for example the compression of two in velvet, IDBA, SOAPdenovo kmer be stored as the conventional number of a compression k molecule, and after the conventional number that obtains k molecule, also can obtain conversely fragment t that the length of this k molecule is k and its complementary fragment v.
Moving window in steps A 12, A22 is that length is the moving window of k, and wherein 0<k<32 and k are odd number; Nucleic acid coding table in steps A 13 is { A:00, C:01, G:10, T:11}; Symmetrical complement table in steps A 14 is { A->T, C->G, G->C, T->A}; Steps A 14 is specially, the character string of fragment t is reversed, use symmetrical complement table that each character in the character string of reversion is become to its complementary character, obtain the character string v of complementary character, and the nucleic acid coding table reusing in steps A 13 encodes character string v, and be expressed as the integer b of 64.
With De Bruijn figure constitution step, 1, use the conventional number that calculates k molecule in above-mentioned compression storing step, 2, using each fragment and with its before and after adjacent fragment escape character (ESC) as this k molecule with its before and after the limit of corresponding k molecule the limit of initialization k molecular data structure of adjacent fragment; 3, the k molecular data structure after initialization being take to the conventional number of k molecule deposits hash_map in as key value.
As shown in Figure 3, be specially
A21, read a sequence s;
A22, sequence s is cut into a plurality of fragment t with moving window, the conventional number of choosing its conventional number of fragment t and be cur its forward and backward fragment of mark is respectively pre, lat;
If the coding of A23 t is less than its complementary fragment coding, exchange pre, the value of lat;
A24, in the corresponding bit position 1 of the forward position mapping table of cur, represent to point to the limit of pre;
A25, in the corresponding bit position 1 of the reverse position mapping table of cur, represent to point to the limit of lat;
A26, repeating step A22-A25, process other fragments t of sequence s, until complete whole fragment t of sequence s, execution step S27;
A27, read a new sequence s, repeating step A22-A26; Until handle all sequences, execution step A28;
A28, complete the structure of two-way multistep De Bruijn figure.
In steps A 22, if the fragment of fragment t before or after not having composed as sky or NULL pre or lat value; In steps A 24, forward position mapping table is { A:0, C:1, G:2, T:3}, last character that position enquiring character is pre; In steps A 25, reverse position mapping table is { A:4, C:5, G:6, T:7}, the complementary character of the first character that position enquiring character is lat.
In step B, the institute that the two-way limit of all the right and lefts summit bifurcated in the two-way De Bruijn of described multistep figure and the bifurcated limit on described two-way limit are merged is situation likely, be assembled into a longer kmer, and the inquiry occurrence number of longer kmer correspondingly, the size of described occurrence number of take is weight, carries out choosing and union operation of bifurcated limit trend.
The data structure of setting each summit in the two-way De Bruijn of multistep figure in this method is:
As shown in Figure 4, step B further comprises:
Step B1: each the two-way limit e in traversal De Bruijn figure;
Step B2: add up the maximum length of the sequencing sequence of all input file, and assignment is to Lmax;
Step B3: if the length on two-way limit is greater than the length of Lmax-kmer, return to execution step B1, otherwise execution step B4;
Step B4: the initial vertex of two-way limit e is u, initial direction is Direc_u, and termination summit is v, and termination direction is Direc_v;
Step B5: allly enter summit u, stop limit that direction is Direc_u and get last k+1 the character on its two-way limit and put into character string array m, allly leave two-way limit that vertex v, initial direction are Direc_v and get its last character and put into character string array n;
Step B6: by left and right, two-way limit respectively with m and n array in all combinations of structure of character string, construct elongated kmer array, wherein each element in elongated kmer array is elongated kmer=m+e+n.When also having two-way limit to process, return to execution step B1; When not having two-way limit to process, finish.
In step C, the all possible bifurcated on the two-way limit of each bifurcated is being converted to after longer kmer, according to the height of described occurrence number, using the highest bifurcated path of occurrence number as final bifurcated Merge Scenarios, and the two-way limit on corresponding bifurcated path is deleted together after merging.
As shown in Figure 5, step C further comprises:
Step C1: open sequencing sequence file, read one by one every sequence;
Step C2: the sequence (comprising the coupling to its reverse sequence) that each coupling of elongated kmer set group is read in, and to each elongated kmer counting;
Step C3: each the two-way limit e in traversal De Bruijn figure;
Step C4: add up the maximum length of the sequencing sequence of all input file, and assignment is to Lmax;
Step C5: if the length on two-way limit is greater than the length of Lmax-kmer, return to execution step C3, otherwise execution step C6;
Step C6: the initial vertex of two-way limit e is u, initial direction is Direc_u, and termination summit is v, and termination direction is Direc_v;
Step C7: allly enter summit u, stop limit that direction is Direc_u and get last k+1 the character of its last two-way limit e_i and put into character string array m, allly leave two-way limit e_j that vertex v, initial direction are Direc_v and get its last character and put into character string array n;
Step C8: by left and right, two-way limit respectively with m and n array in all combinations of structure of character string, construct elongated kmer array, wherein each element in elongated kmer array is elongated kmer=m+e+n.In the elongated kmer of counting, inquire about these elongated kmer, and take out one of count value maximum, and by corresponding e_i, e, tri-limits of e_j merge, and then delete e_i, e, e_j.
When also having two-way limit to process, return to execution step C3; When also having sequence to process, return to execution step C1; When not having two-way limit to need to process, while also not having sequence to process, finish.
Compared with prior art, beneficial effect of the present invention is:
(1) only for the bifurcated limit having existed on De Bruijn figure, construct elongated kmer combination, and in list entries, inquire about its occurrence number, then according to the merging on its two-way limit of selection of times; IDBA is by all kmer length of iteration, after all possible kmer of each kmer length need to being constructed, then shrinks De Bruijn figure, and its method will cause larger memory consumption and computing time to consume;
(2) select the merging on optimum bifurcated limit, may dropping to that the mistake on two-way limit is merged is minimum;
(3) can significantly improve the length of contigs, also the mass loss of contig can be dropped to minimum; Than other existing methods, improve contig length and will sacrifice contig quality, the present invention has had control and improvement to a certain extent.
One of ordinary skill in the art will appreciate that all or part of step in the whole bag of tricks of embodiment is to come the hardware that instruction is relevant to complete by program, this program can be stored in a computer-readable recording medium, storage medium can comprise: ROM (read-only memory) (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD etc.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any modifications of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims (9)

1. a two-way limit extended method for the elongated kmer inquiry based on the two-way De Bruijn of multistep figure, is characterized in that, comprising:
Steps A: read sequencing data source file, the two-way De Bruijn figure of structure multistep;
Step B: add up intersecting two-way limit in the two-way De Bruijn of described multistep figure;
Step C: the two-way limit expansion based on elongated kmer inquiry in the two-way De Bruijn of described multistep figure.
2. the method for claim 1, it is characterized in that, in described step B, the institute that the two-way limit of all the right and lefts summit bifurcated in the two-way De Bruijn of described multistep figure and the bifurcated limit on described two-way limit are merged is situation likely, be assembled into a longer kmer, and the inquiry occurrence number of longer kmer correspondingly, the size of described occurrence number of take is weight, carries out choosing and union operation of bifurcated limit trend.
3. method as claimed in claim 2, it is characterized in that, in described step C, the all possible bifurcated on the two-way limit of each bifurcated is being converted to after longer kmer, height according to described occurrence number, using the highest bifurcated path of occurrence number as final bifurcated Merge Scenarios, and the two-way limit on corresponding bifurcated path is deleted together after merging.
4. the method for claim 1, is characterized in that, described step B further comprises:
Step B1: each the two-way limit e in traversal De Bruijn figure;
Step B2: add up the maximum length of the sequencing sequence of all input file, and assignment is to Lmax;
Step B3: if the length on two-way limit is greater than the length of Lmax-kmer, return to execution step B1, otherwise execution step B4;
Step B4: the initial vertex of two-way limit e is u, initial direction is Direc_u, and termination summit is v, and termination direction is Direc_v;
Step B5: allly enter summit u, stop limit that direction is Direc_u and get last k+1 the character on its two-way limit and put into character string array m, allly leave two-way limit that vertex v, initial direction are Direc_v and get its last character and put into character string array n;
Step B6: by left and right, two-way limit respectively with m and n array in all combinations of structure of character string, construct elongated kmer array, wherein each element in elongated kmer array is elongated kmer=m+e+n.
5. the method for claim 1, is characterized in that, described step C further comprises:
Step C1: open sequencing sequence file, read one by one every sequence;
Step C2: the sequence (comprising the coupling to its reverse sequence) that each coupling of elongated kmer set group is read in, and to each elongated kmer counting;
Step C3: each the two-way limit e in traversal De Bruijn figure;
Step C4: add up the maximum length of the sequencing sequence of all input file, and assignment is to Lmax;
Step C5: if the length on two-way limit is greater than the length of Lmax-kmer, return to execution step C3, otherwise execution step C6;
Step C6: the initial vertex of two-way limit e is u, initial direction is Direc_u, and termination summit is v, and termination direction is Direc_v;
Step C7: allly enter summit u, stop limit that direction is Direc_u and get last k+1 the character of its last two-way limit e_i and put into character string array m, allly leave two-way limit e_j that vertex v, initial direction are Direc_v and get its last character and put into character string array n;
Step C8: by left and right, two-way limit respectively with m and n array in all combinations of structure of character string, construct elongated kmer array, wherein each element in elongated kmer array is elongated kmer=m+e+n, inquires about these elongated kmer, and take out one of count value maximum in the elongated kmer of counting, and by corresponding e_i, e, tri-limits of e_j merge, and then delete e_i, e, e_j.
6. method as claimed in claim 5, is characterized in that, the sequence of in described step C2, each coupling of elongated kmer set group being read in comprises the coupling to reverse sequence.
7. the method for claim 1, is characterized in that, described steps A further comprises:
Compression storing step, is specially
A11, read a sequence s;
A12, sequence s is cut into a plurality of fragment t with moving window;
A13, to each fragment t, use nucleic acid coding table to encode, and be expressed as the integer a of 64;
A14, fragment t is reversed, use symmetrical complement table that the fragment complementation of reversion is processed, obtain complementary fragment v, and the nucleic acid coding table reusing in steps A 13 encodes complementary fragment, and be expressed as the integer b of 64;
The maximum number of A15, round numbers a and integer b, as the conventional number of the k molecule of fragment t and complementary fragment v;
A16, repeating step A11-A15, until all sequences completes;
With De Bruijn figure constitution step, be specially
A21, read a sequence s;
A22, sequence s is cut into a plurality of fragment t with moving window, the conventional number of choosing its conventional number of fragment t and be cur its forward and backward fragment of mark is respectively pre, lat;
If the coding of A23 t is less than its complementary fragment coding, exchange pre, the value of lat;
A24, in the corresponding bit position 1 of the forward position mapping table of cur, represent to point to the limit of pre;
A25, in the corresponding bit position 1 of the reverse position mapping table of cur, represent to point to the limit of lat;
A26, repeating step A22-A25, process other fragments t of sequence s, until complete whole fragment t of sequence s, execution step A27;
A27, read a new sequence s, repeating step A22-A26; Until handle all sequences, execution step A28;
A28, complete the structure of two-way multistep De Bruijn figure.
8. method as claimed in claim 7, is characterized in that, the moving window in described steps A 12, A22 is that length is the moving window of k, and wherein 0<k<32 and k are odd number;
Nucleic acid coding table in described steps A 13 is { A:00, C:01, G:10, T:11};
Symmetrical complement table in described steps A 14 is { A->T, C->G, G->C, T->A};
Described steps A 14 is specially, the character string of fragment t is reversed, use symmetrical complement table that each character in the character string of reversion is become to its complementary character, obtain the character string v of complementary character, and the nucleic acid coding table reusing in steps A 13 encodes character string v, and be expressed as the integer b of 64.
9. method as claimed in claim 7, is characterized in that, in described steps A 22, if the fragment of fragment t before or after not having composed as sky or NULL pre or lat value;
In steps A 24, forward position mapping table is { A:0, C:1, G:2, T:3}, last character that position enquiring character is pre;
In steps A 25, reverse position mapping table is { A:4, C:5, G:6, T:7}, the complementary character of the first character that position enquiring character is lat.
CN201310670740.1A 2013-12-10 2013-12-10 Two-way side extended method based on the elongated kmer inquiries of the two-way De Bruijns of multistep Active CN103699818B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310670740.1A CN103699818B (en) 2013-12-10 2013-12-10 Two-way side extended method based on the elongated kmer inquiries of the two-way De Bruijns of multistep

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310670740.1A CN103699818B (en) 2013-12-10 2013-12-10 Two-way side extended method based on the elongated kmer inquiries of the two-way De Bruijns of multistep

Publications (2)

Publication Number Publication Date
CN103699818A true CN103699818A (en) 2014-04-02
CN103699818B CN103699818B (en) 2017-04-05

Family

ID=50361345

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310670740.1A Active CN103699818B (en) 2013-12-10 2013-12-10 Two-way side extended method based on the elongated kmer inquiries of the two-way De Bruijns of multistep

Country Status (1)

Country Link
CN (1) CN103699818B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460246A (en) * 2018-03-08 2018-08-28 北京希望组生物科技有限公司 A kind of HLA methods of genotyping based on three generations's microarray dataset
WO2019076177A1 (en) * 2017-10-20 2019-04-25 人和未来生物科技(长沙)有限公司 Gene sequencing data compression preprocessing, compression and decompression method, system, and computer-readable medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100063742A1 (en) * 2008-09-10 2010-03-11 Hart Christopher E Multi-scale short read assembly
CN102682225A (en) * 2011-01-21 2012-09-19 国际商业机器公司 Assembly error detection method and system
CN103093121A (en) * 2012-12-28 2013-05-08 深圳先进技术研究院 Compressed storage and construction method of two-way multi-step deBruijn graph
CN103258145A (en) * 2012-12-22 2013-08-21 中国科学院深圳先进技术研究院 Parallel gene splicing method based on De Bruijn graph

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100063742A1 (en) * 2008-09-10 2010-03-11 Hart Christopher E Multi-scale short read assembly
CN102682225A (en) * 2011-01-21 2012-09-19 国际商业机器公司 Assembly error detection method and system
CN103258145A (en) * 2012-12-22 2013-08-21 中国科学院深圳先进技术研究院 Parallel gene splicing method based on De Bruijn graph
CN103093121A (en) * 2012-12-28 2013-05-08 深圳先进技术研究院 Compressed storage and construction method of two-way multi-step deBruijn graph

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
孟金涛 等: "基于De Bruijn图的De Novo序列组装软件性能分析", 《科研信息化技术与应用》 *
曾培龙 等: "《全基因组序列拼接研究进展》", 《智能计算机与应用》 *
苑建蕊: "基于双向de Bruijn图的序列拼接并行化研究与实现", 《中国优秀硕士学位论文全文数据库 基础科学辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019076177A1 (en) * 2017-10-20 2019-04-25 人和未来生物科技(长沙)有限公司 Gene sequencing data compression preprocessing, compression and decompression method, system, and computer-readable medium
US11551785B2 (en) 2017-10-20 2023-01-10 Genetalks Bio-Tech (Changsha) Co., Ltd. Gene sequencing data compression preprocessing, compression and decompression method, system, and computer-readable medium
CN108460246A (en) * 2018-03-08 2018-08-28 北京希望组生物科技有限公司 A kind of HLA methods of genotyping based on three generations's microarray dataset
CN108460246B (en) * 2018-03-08 2022-02-22 北京希望组生物科技有限公司 HLA genotyping method based on third-generation sequencing platform

Also Published As

Publication number Publication date
CN103699818B (en) 2017-04-05

Similar Documents

Publication Publication Date Title
CN103093121B (en) The compression storage of two-way multistep deBruijn figure and building method
US10600217B2 (en) Methods for the graphical representation of genomic sequence data
Stamatakis et al. Time and memory efficient likelihood-based tree searches on phylogenomic alignments with missing data
Koren et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads
Chapman et al. Meraculous: de novo genome assembly with short paired-end reads
Ma et al. Reconstructing contiguous regions of an ancestral genome
Siddharthan et al. PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny
Lin et al. AGORA: assembly guided by optical restriction alignment
WO2017147124A1 (en) Systems and methods for genotyping with graph reference
US20180247016A1 (en) Systems and methods for providing assisted local alignment
CN103258145A (en) Parallel gene splicing method based on De Bruijn graph
CN103699819A (en) Peak expanding method for multistep bidirectional De Bruijn image-based elongating kmer inquiry
CN103699818A (en) Bidirectional edge expanding method for multistep bidirectional De Bruijn image-based elongating kmer inquiry
Minkin et al. Scalable pairwise whole-genome homology mapping of long genomes with BubbZ
CN103699813B (en) Method for identifying and removing repeated bidirectional edges of bidirectional multistep De Bruijn graph
Saggese et al. STAble: a novel approach to de novo assembly of RNA-seq data and its application in a metabolic model network based metatranscriptomic workflow
Petri et al. isONform: reference-free transcriptome reconstruction from Oxford Nanopore data
Sebastião et al. Hardware accelerator architecture for simultaneous short-read DNA sequences alignment with enhanced traceback phase
CN103699814B (en) Method for identifying and removing tips of bidirectional multistep De Bruijn graph
CN103714263B (en) The wrong two-way side identification of two-way multistep De Bruijns and minimizing technology
CN103699817B (en) Method for identifying and removing self-loop bidirectional edges of bidirectional multistep De Bruijn graph
Wang et al. DUHI: dynamically updated hash index clustering method for DNA storage
Jain et al. GAMS: genome assembly on Multi-GPU using string graph
Köster et al. Massively parallel read mapping on GPUs with the q-group index and PEANUT
Si et al. Survey of gene splicing algorithms based on reads

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant