CN110059228A

CN110059228A - A kind of DNA data set implantation die body searching method and its device and storage medium

Info

Publication number: CN110059228A
Application number: CN201910181475.8A
Authority: CN
Inventors: 于强; 张晓�
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-03-11
Filing date: 2019-03-11
Publication date: 2019-07-26
Anticipated expiration: 2039-03-11
Also published as: CN110059228B

Abstract

The present invention relates to a kind of DNA data set implantation die body searching method and its device and storage medium, method includes: the implantation die body search parameter for obtaining DNA sequence dna large data sets, obtaining the DNA sequence dna large data sets；The first k-mer collection is obtained according to the DNA sequence dna large data sets, the implantation die body search parameter, the first l-mer collection is obtained according to the first k-mer collection, the 2nd l-mer collection is obtained according to the first l-mer collection；The implantation die body is determined from the 2nd l-mer collection according to the first score model.The present invention can not only search out implantation die body by APMS method from DNA sequence dna large data sets, while with finding out the runing time order of magnitude of implantation die body quickly in other implantation die body searching methods.

Description

A kind of DNA data set implantation die body searching method and its device and storage medium

Technical field

The invention belongs to DNA sequence dna big data processing fields, and in particular to a kind of DNA data set implantation die body searching method And its device and storage medium.

Background technique

DNA is the carrier of heritage information, and hereditary information is stored in the sequence of tetra- kinds of characters of DNA composition, the life of organism Long development essence is exactly the transmitting and expression of hereditary information.As the first step of hereditary information expression, transcription is regulatory mechanism Center.The specific site (length is about 5~20 base-pairs) that transcription factor is incorporated in DNA sequence dna, the transcription of promotor gene With the transcriptional efficiency of control gene.These sites are known as Binding site for transcription factor (Transcription Factor Binding Sites, abbreviation TFBS), transcriptional control important in inhibiting of the positioning TFBS to research gene.

Quorum implantation die body search (Quorum Protein Motifs Sequences, abbreviation qPMS) be for One of the famous computation model of TFBS is positioned in DNA sequence dna.Common qPMS method includes the exact method of sample mode driving With the exact method of suffix tree, wherein the exact method based on sample mode driving, such as PMSprune, StemFinder, QPMS7, TravStrR, PMS8 and qPMS9, comprising sample driving and two stages of mode activated, the sample driving stage is with choosing Take it is some generate candidate die body as few as possible as constraint with reference to DNA sequence dna, the mode activated stage be to candidate die body into Row verifying；Exact method based on suffix tree, such as Weeder, RISOTTO and FMotif establish the suffix tree rope of list entries Attract the verifying for accelerating candidate die body.The target of approximate qPMS method be find out in a relatively short period of time it is optimal or close to optimal Die body, most typical approximation qPMS method include expectation maximization, Gibbs sampling and genetic method etc., are carried out to initial die body Refinement, in these methods, the method MEME-ChIP based on expectation maximization are most notable one of die body discovery methods.For Efficient process large data sets have also been proposed some die bodys discovery methods based on new strategy, such as the side PairMotifChIP Method, PairMotifChIP method are that similar substring is excavated and merged from the DNA sequence dna of input to obtaining die body.

However, qPMS method and approximation qPMS method, PairMotifChIP method there is a problem of it is common: computational problem, Cause runing time too long, there is bottlenecks when handling DNA sequence dna large data sets.

Summary of the invention

In order to solve the above-mentioned problems in the prior art, the present invention provides a kind of DNA data set implantation die bodys to search Rope method and device thereof and storage medium.

The embodiment of the invention provides a kind of DNA data sets to be implanted into die body searching method, this method comprises:

The implantation die body search parameter for obtaining DNA sequence dna large data sets, obtaining the DNA sequence dna large data sets；

The first k-mer collection is obtained according to the DNA sequence dna large data sets, the implantation die body search parameter, according to described First k-mer collection obtains the first l-mer collection, obtains the 2nd l-mer collection according to the first l-mer collection；

The implantation die body is determined from the 2nd l-mer collection according to the first score model.

In one embodiment of the invention, it is obtained according to the DNA sequence dna large data sets, the implantation die body search parameter To the first k-mer collection, comprising:

Length k is obtained, several k-mer are obtained from the DNA sequence dna large data sets according to the length k；

First threshold is obtained, the first k-mer collection is obtained according to the first threshold, the k-mer.

In one embodiment of the invention, the length k is obtained, comprising:

The first desired value is obtained according to the DNA sequence dna large data sets；

The second desired value is obtained according to the DNA sequence dna large data sets, the implantation die body search parameter；

It is worth to obtain the length k according to first desired value, second expectation.

In one embodiment of the invention, the first threshold is obtained, comprising:

The quantity of DNA sequence dna is obtained from the DNA sequence dna large data sets；

The first threshold is obtained according to second desired value, the DNA sequence dna quantity.

In one embodiment of the invention, the first l-mer collection is obtained according to the first k-mer collection, comprising:

It is concentrated from the first k-mer and obtains k-mer；

Processing is extended to each of the DNA sequence dna large data sets k-mer, the first k- being expanded Mer collection；

De-redundancy processing, second to be expanded are carried out according to first k-mer collection of second score model to the extension K-mer collection；

Intercepting process is carried out to the 2nd k-mer collection of the extension, obtains the first l-mer；

According to the first l-mer, the first l-mer collection is obtained.

In one embodiment of the invention, intercepting process is carried out to the 2nd k-mer collection of extension, obtains the first l- Mer, comprising:

Aligned sequences are obtained according to the 2nd k-mer collection of extension；

Intercepting process is carried out to the aligned sequences according to preset rules, obtains the first l-mer.

In one embodiment of the invention, the 2nd l-mer collection is obtained according to the first l-mer collection, comprising:

Binomial trees are constructed to the first l-mer that the first l-mer is concentrated；

Score is calculated according to all nodes of first score model to the Binomial trees of building, most by the score High node is as the 2nd l-mer；

De-redundancy processing is carried out to the first k-mer collection according to the 2nd l-mer, obtains the 2nd k-mer collection；

The first l-mer collection is handled according to the 2nd k-mer collection, obtains the 2nd l-mer collection.

In one embodiment of the invention, the first k-mer collection is carried out at de-redundancy according to the 2nd l-mer Reason, obtains the 2nd k-mer collection, comprising:

The 4th l-mer is obtained from the DNA sequence dna large data sets；

Obtain the third desired value between the k-mer of the k-mer and the 4th l-mer of the 2nd l-mer；

It whether is redundancy according to the k-mer that the third desired value judges that the first k-mer is concentrated, as the first k- The Hamming distances d for the k-mer in k-mer and the 2nd l-mer that mer is concentrated is less than or equal to the third desired value, described The k-mer that first k-mer is concentrated is redundancy, and k-mer is concentrated from the first k-mer and is deleted, the 2nd k-mer collection is obtained, no K-mer is then retained in the first k-mer collection, obtains the 2nd k-mer collection.

Another embodiment of the present invention provides a kind of DNA data sets to be implanted into die body searcher, which includes:

Data acquisition module, the implantation mould for obtaining the DNA sequence dna large data sets, obtaining the DNA sequence dna large data sets Body search parameter；

Data processing module obtains described first according to the DNA sequence dna large data sets, the implantation die body search parameter K-mer collection obtains the first l-mer collection according to the first k-mer collection, obtains described according to the first l-mer collection Two l-mer collection；

Data determining module determines the implantation die body from the 2nd l-mer collection according to first score model.

Yet another embodiment of the present invention provides a kind of computer readable storage medium, and the computer program is processed Device realizes method described in any of the above embodiments when executing.

Compared with prior art, beneficial effects of the present invention:

The present invention can not only search out implantation die body, while looking for by APMS method from DNA sequence dna large data sets With being implanted into the runing time order of magnitude of die body out quickly in other implantation die body searching methods.

Detailed description of the invention

Fig. 1 is the flow diagram that a kind of DNA data set provided in an embodiment of the present invention is implanted into die body searching method；

Fig. 2 is that the implantation die body of traditional Binomial trees provided in an embodiment of the present invention searches for schematic diagram；

Fig. 3 is the structural schematic diagram that a kind of DNA data set provided in an embodiment of the present invention is implanted into die body searcher；

Fig. 4 is APMS, PairMotifChIP and MEME-ChIP method provided in an embodiment of the present invention in analogue data Comparison result under different DNA sequence dnas is intended to；

Fig. 5 is that a kind of Efficient Solution DNA sequence dna large data sets implantation die body searching method provided in an embodiment of the present invention exists The experimental result schematic diagram of truthful data.

Specific embodiment

Further detailed description is done to the present invention combined with specific embodiments below, but embodiments of the present invention are not limited to This.

Embodiment one

Referring to Figure 1, Fig. 1 is the process that a kind of DNA data set provided in an embodiment of the present invention is implanted into die body searching method Schematic diagram.The embodiment of the invention provides a kind of DNA data sets to be implanted into die body searching method, and this method comprises the following steps:

Step 1, the implantation die body search parameter for obtaining DNA sequence dna large data sets, obtaining DNA sequence dna large data sets.

Step 1.1 obtains DNA sequence dna large data sets.

Specifically, the DNA sequence dna large data sets D obtained in the present embodiment, including t DNA sequence dna, the then big number of DNA sequence dna D={ s can be expressed as according to collection D₁,s₂,…s_t, wherein s_iIndicate i-th DNA sequence dna；Every DNA sequence dna includes n character. Wherein, every DNA sequence dna s_iIt is a character string on character list Σ={ A, C, G, T }, i.e., every DNA sequence dna is by A, C, G, T Form the character string that length is n.s_i[j] indicates j-th of character of i-th DNA sequence dna, s_i[j..j'] indicates i-th DNA sequence dna In originate in the character string that position j terminates at position j'.Wherein, the value of i is 0~t-1, and the value of j is 0~n-1.

Step 1.2, the implantation die body search parameter for obtaining DNA sequence dna large data sets.

Specifically, in the present embodiment, implantation die body (l, d) search parameter includes the length l of implantation die body (l, d), implantation Accounting q, conservative parameter g are searched in the Hamming distances d of die body (l, d), implantation die body (l, d).

In the present embodiment, for being implanted into die body (l, d), APMS method solves the problems, such as be: given t length for n DNA Sequence large data sets D={ s₁,s₂,…,s_tAnd meeting three parameters l, d and q of 0 < l < n, 0≤d < l and 0 < q≤1, target is A l-mer (character string of a length of l) m is found, so that at least qt (q≤t) DNA sequence dna s_iIn all contain one and l-mer There is the l-mer m of at most d position difference (mutation) in m_i, which is defined as Hamming distances: d_H(m,m_i) =| { i:1≤i≤l, m [i]！=m_i[i]}|.Wherein, l-mer m is known as an implantation die body (l, d), DNA sequence dna big data The l-mer m concentrated_iReferred to as die body example, the sequence that above-mentioned Hamming distances are unsatisfactory in DNA sequence dna large data sets are known as Background sequence.Wherein, APMS method is that a kind of DNA data set of the invention is implanted into die body searching method.

DNA sequence dna large data sets are conducive to find out high quality implantation die body (l, d), but most of existing qPMS methods Implantation die body (l, d) is found out in calculating too time-consuming and that qPMS cannot be completed within the reasonable time.And APMS method in the present embodiment On the basis of qPMS method, reply can not only find out implantation die body (l, d), and runing time in DNA sequence dna large data sets The order of magnitude quickly in existing die body searching method.

Step 2 obtains the first k-mer collection according to DNA sequence dna large data sets, implantation die body search parameter, according to the first k- Mer collection obtains the first l-mer collection, obtains the 2nd l-mer collection according to the first l-mer collection.

Step 2.1 obtains the first k-mer collection according to DNA sequence dna large data sets, implantation die body (l, d) search parameter, and first K-mer collection includes several k-mer, and each k-mer includes k character.

Specifically, the first k-mer collection is obtained according to DNA sequence dna large data sets, implantation die body (l, d) search parameter, comprising:

Length k is obtained, several k-mer are obtained from DNA sequence dna large data sets according to length k；

Obtain first thresholdAccording to first thresholdK-mer obtains the first k-mer collection.

Further, length k is obtained, comprising:

The first desired value is obtained according to DNA sequence dna large data sets；

The second desired value is obtained according to DNA sequence dna large data sets, implantation die body (l, d) search parameter；

It is worth to obtain length k according to the first desired value, the second expectation.

Specifically, the present embodiment determines suitable k value using probability analysis method, so as to preferably distinguish background K-mer in sequence and die body example.Enable f_rIt (k) is the first desired value, the first desired value f_r(k) it indicates in any background sequence K-mer frequency of occurrence in DNA sequence dna large data sets D desired value；Enable f_mIt (k) is the second desired value, the second desired value f_m (k) desired value of k-mer frequency of occurrence in DNA sequence dna large data sets D in any die body example is indicated.Wherein, the second phase Prestige value f_m(k) with the first desired value f_r(k) ratio is bigger, then the k-mer in background sequence and in die body example is from frequency of occurrence From the point of view of, more have ga s safety degree.Therefore, the present embodiment determines the value of k using following formula:

Wherein, k_minIndicate the minimum value of k, ε is for coping with the first desired value f_r(k) less than 1 the factor of the case where. k_minPreferably 5, because when k value very little, it is difficult to distinguish the k-mer in background sequence and die body example.ε is rule of thumb set It is set to 1.

In the present embodiment, the first desired value f in formula (1) is obtained according to DNA sequence dna large data sets_r(k), specific design It is as follows:

It is assumed that the implantation die body (l, d) of search is m, there are die body example m₁With die body example m₂, in DNA sequence dna big data Collect in D, die body example m any for one₁In any initial position a k-mer x₁With another any die body example m₂ In identical initial position a k-mer x₂, enable p_kIndicate k-mer x₁With k-mer x₂Equal probability, then in formula (1) The second desired value f_m(k) it designs as follows:

For in formula (3), p_kIndicate k-mer x₁With k-mer x₂Equal probability, according to total probability formula, p_kDesign It is as follows:

Wherein, Pr_iIndicate implantation die body (l, d) m and die body example m₁Hamming distances d_H(m,m₁)=i's (0≤i≤d) Probability, Pr_jImplantation die body (l, d) m and die body example m respectively₂Hamming distances d_H(m,m₂The probability of)=j (0≤j≤d), Pr_i It designs as follows:

Wherein, g indicates that conservative parameter, value range are 0≤g≤1.

Similarly, Pr_jIt designs as follows:

And p_ijIt indicates in d_H(m,m₁)=i and d_H(m,m₂Under conditions of)=j, k-mer x₁With k-mer x₂Equal is general Rate, p_ijIt designs as follows:

By formula (7) as it can be seen that p_ijIt is to take in a 0 to being added up in the range of min { i, j } to the product of three fac-tors. Wherein, first factor representation die body example m₁In any k-mer x₁In have a mutation probability；Second factor table Show k-mer x₂With k-mer x₁The identical probability of mutated site；Third factor representation is in k-mer x₂With k-mer x₁Mutation In the identical situation in position, the duplicate probability of the base being mutated into.

The first desired value f is calculated by above-mentioned formula (2)~(3)_r(k) and the second desired value f_m(k), the value range of k For 0~l, the second desired value f is calculated further according to formula (1)_m(k) with the first desired value f_r(k) maximum value is as this in ratio The first k-mer of embodiment concentrates the length k of each k-mer.

From DNA sequence dna large data sets D, the k-mer that several length are k is obtained.

Further, first threshold is obtainedInclude:

The quantity of DNA sequence dna is obtained from DNA sequence dna large data sets；

First threshold is obtained according to the second desired value, the quantity of DNA sequence dna

Specifically, the present embodiment obtains all length not from DNA sequence dna large data sets D as the k-mer of k, but takes With in DNA sequence dna large data sets D frequency of occurrence be more than or equal to first thresholdK-mer as high frequency k-mer, generate the One k-mer collection.As described above, f_m(k) indicate any k-mer in an arbitrary die body example in the big number of DNA sequence dna According to the expectation of frequency of occurrence in collection D, if willDirectly it is set as f_m(k), it would be possible that can obtain multiple corresponding to same mould The high frequency k-mer of body.Therefore, first thresholdDesign be in f_m(k) added on the basis of one with DNA sequence dna item number t at just The variable of ratio, to avoid the high frequency k-mer that more redundancies are obtained.The present embodiment first thresholdDesign it is as follows:

Further, the first threshold obtained according to formula (8)It obtains to meet from every DNA sequence dna and be more than or equal to The k-mer of the first threshold generates the first k-mer collection as high frequency k-mer.

Step 2.2 obtains the first l-mer collection according to the first k-mer collection, and the first l-mer collection includes several first l-mer, Each first l-mer includes l character.

Specifically, the first l-mer collection is obtained according to the first k-mer collection, comprising:

It is concentrated from the first k-mer and obtains k-mer；

Processing is extended to each k-mer in DNA sequence dna large data sets, the first k-mer collection being expanded, extension The first k-mer concentrate each extension k-mer length be 2l-k；

De-redundancy processing, the 2nd k-mer being expanded are carried out according to first k-mer collection of second score model to extension Collection；

Intercepting process is carried out to the 2nd k-mer collection of extension, obtains the first l-mer；

According to the first l-mer, the first l-mer collection is obtained.

Further, processing is extended to the k-mer in DNA sequence dna large data sets, the first k-mer being expanded Collection, the k-mer length for each extension that the first k-mer of extension is concentrated are 2l-k.

Specifically, implantation die body (l, d) is searched for by the first k-mer collection, is concentrated from the first k-mer obtain k-mer first X, because initial position of the k-mer x in implantation die body (l, d) is unknown, the present embodiment is in DNA sequence dna large data sets D In find k-mer x after, extension l-k characters respectively to the left and to the right, extension in DNA sequence dna large data sets D by k-mer x K-mer x become length be 2l-k character string.It handles in this way, the k-mer x of extension is in DNA sequence dna large data sets D In die body example can cover implantation die body (l, d).

For example, it is assumed that s_i[j..j+k -1] is an accurate appearance of the k-mer x in DNA sequence dna large data sets D, then The die body example for the k-mer x that thus obtained k-mer x extends in DNA sequence dna large data sets D is s_i[j–l+k..j+l– 1]。

Further, processing is extended to each k-mer x in DNA sequence dna large data sets D, first to be expanded K-mer collection.

Further, de-redundancy processing is carried out according to first k-mer collection of second score model to extension, be expanded 2nd k-mer collection, the k-mer length for each extension that the 2nd k-mer of extension is concentrated are 2l-k.

Specifically, if the k-mer x for the extension that the first k-mer of extension is concentrated is free of in DNA sequence dna large data sets D There is a die body example, i.e., it is made of background sequence completely, and the k-mer x extended in this way will affect the quality of the first l-mer collection.Cause This, the present embodiment is before generating the first l-mer collection, according to the second score model score of design_i(y), to the k- of extension Mer x is assessed, and whether the k-mer x for assessing extension is made of background sequence.It can be seen from the above, because the first desired value fr (k) expectation for indicating k-mer frequency of occurrence in DNA sequence dna large data sets D in an arbitrary background sequence, so this reality Example is applied, the design of the second score model is as follows:

By formula (9) as it can be seen that the second score model score_i(y) score is smaller, the k-mer x of extension more may be by Background sequence composition is expanded to concentrate the k-mer x for filtering out the smallest extension of score from the first k-mer of extension 2nd k-mer collection.

For the present embodiment by the second score model of design, filtering out from the first k-mer of extension concentration may be background sequence Extension k-mer x, reduce it is subsequent implantation die body (l, d) search calculation amount, reduce the runing time of APMS method.

Further, intercepting process is carried out to the 2nd k-mer collection of extension, obtains the first l-mer, comprising:

Intercepting process is carried out to aligned sequences according to preset rules, obtains the first l-mer.

Specifically, it after the present embodiment is to the first k-mer collection de-redundancy processing of the extension in DNA sequence dna large data sets D, remains The k-mer of remaining extension forms the 2nd k-mer collection of extension, is by the k-mer formation length that the 2nd k-mer of extension concentrates extension The aligned sequences align, r (align [i]) of 2l-k indicates the information content of the i-th column in aligned sequences align, then according to default Rule is intercepted, and the first l-mer is obtained.Wherein, information content is using position weight matrix (Position Weight Matrices, abbreviation PWM), each accounting for being classified as four characters in the k-mer of extension in the weight matrix of position, four characters Respectively A, C, G, T.

Wherein, the preset rules in the present embodiment are that the extension k-mer Right Aligns for concentrating the 2nd k-mer of extension is formed After comparing sequence align, according to the information content of each column r (align [i]) in comparison sequence align, acquisition length first is 2l-k Concensus sequence, then repeatedly comparison removal concensus sequence in the lesser column r of left and right ends information content (align [i]), until must The concensus sequence for being l to a length, the length are that the concensus sequence of l is the first l-mer.

For example, in the present embodiment, if it is 3 that implantation die body (l, d) length l, which is length k in 6, k-mer, wherein DNA sequence Column large data sets include the k-mer of 6 extensions, respectively { AGATTGCAG }, { CGATTGCAG }, and { CGATTGCAC }, { CGCTTGCAG }, { CGCTTGCAG }, { CTATTGTAG }, the k-mer Right Aligns for first extending this 6 arrange:

{AGATTGCAG,

CGATTGCAG,

CGATTGCAC,

CGCTTGCAC,

CGCTTGCAG,

CTATTGTAG }, form aligned sequences align, wherein each column r's (align [i]) of comparison sequence align Information content are as follows:

{ A:0.17,0.00,0.67,0.17,0.00,0.17,0.00,1.00,0.00

C:0.83,0.00,0.33,0.00,0.00,0.00,0.83,0.00,0.33

G:0.00,0.83,0.00,0.00,0.00,0.66,0.00,0.00,0.67

T:0.00,0.17,0.00,083,1.00,0.17,0.17,0.00,0.00 }, then according to each column r (align [i]) information content, concensus sequence is obtained, which is { CGATTGCAG }.Since the left side, concensus sequence is observed The accounting of each column character A, C, G, the T of { CGATTGCAG }, the accounting of C is maximum in the first row of the left side, and the left side selects character C, so The accounting of G is maximum in the first row of the right afterwards, and the right selects character G, compares the accounting and the right first row of left side first row character C The accounting of character G, the accounting of first row character C are greater than the accounting of first row character G, then retain left side first row character C, delete All characters of the right first row；Then, the first column selection of the left side retain character C, then the right first row in A accounting most Greatly, the right selects character A, compares the accounting of left side first row character C and the accounting of the right first row character A, first row character C Accounting be less than first row character A accounting, then retain the right first row character A, delete all characters of left side first row；With this Analogize, until the l-mer that concensus sequence interception is length l, which is { ATTGCA } and the l-mer is the first l-mer.

Further, the k-mer that the first k-mer of traversal is concentrated, finds out each k-mer in DNA sequence dna large data sets First l-mer forms the first l-mer collection.

Step 2.3 obtains the 2nd l-mer collection according to the first l-mer collection, and the 2nd l-mer collection includes several 2nd l-mer, Each 2nd l-mer includes l character.

Specifically, the 2nd l-mer collection is obtained according to the first l-mer collection, comprising:

Score is calculated to all nodes of Binomial trees according to the first score model, using the node of highest scoring as the 2nd l- mer；

The first k-mer collection de-redundancy is handled according to the 2nd l-mer, obtains the 2nd k-mer collection；

Further, Binomial trees are constructed to the first l-mer that the first l-mer is concentrated, comprising:

Choose root node of the first l-mer as Binomial trees；

The i+1 layer that Binomial trees are successively generated according to the i-th of Binomial trees layer, judges the node of the i+1 layer of Binomial trees Whether quantity is greater than second threshold, if the quantity of the node of i+1 layer is greater than second threshold, is obtained most according to the first score model The quantity of the node of the i+1 layer of whole Binomial trees, the node of i+1 layer is equal to second threshold, if the quantity of the node of i+1 layer Less than or equal to second threshold, the node of the i+1 layer of Binomial trees is kept, the value of i is 0 < i < d；

Whether i-th layer of node for judging Binomial trees is implantation die body (l, d), if the node is implantation die body (l, d), The node is stored in the first array M, if the node is not implantation die body (l, d), does not need to store the node in first In array M, the value of i is 0 < i < d；

According to the node score in the first array M, using the node of highest scoring as the 2nd l-mer.

Specifically, Fig. 2 is referred to, Fig. 2 is that the implantation die body of traditional Binomial trees provided in an embodiment of the present invention searches for signal Figure.From Figure 2 it can be seen that the method for conventional construction Binomial trees, the first l-mer that the root node of Binomial trees is concentrated for the first l-mer, two I-th layer of the internal node or leaf node of Xiang Shu is the node for being i with the Hamming distances of the first l-mer of root node, the value of i Range is 0 < i≤d, and the depth of the Binomial trees is d.Each layer of several extension nodes of correspondence of Binomial trees, several extension nodes are The d neighbours of the first l-mer of root node, they and the first l-mer are from root node to internal node or the path subscript of leaf node There is differences on position out.In this way, in Binomial trees each node illustrate with the first l-mer Hamming distances be i (0≤i≤ D) d neighbours.Wherein, extension node is the l-mer that length is l.

And the present embodiment constructs Binomial trees, root node is the first l-mer that the first l-mer is concentrated, then successively according to two The i+1 layer of i-th layer of generation Binomial trees of Xiang Shu, judges whether the quantity of the node of the i+1 layer of Binomial trees is greater than the second threshold Value obtains the node of the i+1 layer of Binomial trees according to the first score model if the quantity of this layer of node is greater than second threshold, should The quantity of layer node is equal to second threshold, if the quantity of this layer of node is less than or equal to second threshold, keeps the i+1 layer of Binomial trees Node, the value of i is 0 < i < d.

Specifically, enabling second threshold is N_mm(i), N_mm(i) indicate Binomial trees i-th (0 < i < d) layer node quantity, be The extension node for avoiding losing each layer of Binomial trees, calculates N_mm(i) when, to the quantity of i-th layer of node multiplied by one it is safe because Sub- α (α >=1).In the realization of APMS method, it is 2 that α, which is rule of thumb preferably provided with value, then N_mm(i) it designs as follows:

For example, when the present embodiment constructs Binomial trees, it is known that implantation die body long (l, d) degree is 5, and Hamming distances d is 3, wherein Binomial trees root node is the first l-mer, and it is 1 that the node of Binomial trees first layer, which is with the Hamming distances of the first l-mer of root node, L-mer, then the number of node totally 15 because implantation die body (l, d) is the l-mer that length is 5, each position have 3 kinds it is prominent Become situation, the node of the present embodiment Binomial trees first layer takes all catastrophes of the first l-mer of root node, i.e. Binomial trees the The number of one layer of node is 15；The node of the Binomial trees second layer be Binomial trees first layer node be implantation die body (l, d) On the basis of, the node of the Binomial trees first layer is extended, the Hamming distances of the node and the extension node of the node are 1, and pass through the total C of number that formula (11) determine Binomial trees second layer node₃ ²* 2=6；Similarly, Binomial trees third layer node be The node (number of network nodes 6) of the Binomial trees second layer is implanted on the basis of die body (l, d), to the node of the Binomial trees second layer It is extended, the Hamming distances of the extension node of the node and the node are 1, and determine Binomial trees third layer by formula (11) The total C of the number of node₃ ³* 2=2.The Binomial trees then finally constructed are using the first l-mer as root node, and Binomial trees first layer is 15 A node, the Binomial trees second layer are 6 nodes, and Binomial trees third layer is the tree of 2 nodes.

Further, the node of the i+1 layer of final Binomial trees, the quantity of this layer of node are obtained according to the first score model Equal to second threshold.

Specifically, the present embodiment is under qPMS model, each of the Binomial trees for designing the first score model to assess building The score of node y.Wherein, D'(y) it is that each node y for being used to calculate Binomial trees selected from DNA sequence dna large data sets D is obtained Point the set containing qt DNA sequence dna, s is in a certain DNA sequence dna and the smallest l-mer of Hamming distances of node y.Generally For, the score of the node y of Binomial trees is higher, and node y is closer to implantation die body (l, d).The present embodiment the first score mould Type design is as follows:

By formula (11) it is found that conventional method assesses Binomial trees in the Binomial trees of any one the first l-mer building In each node y score, be all the score for first calculating node y t DNA sequence dna in DNA sequence dna large data sets, from every A score with the score with the first the smallest l-mer of l-mer Hamming distances as this article of DNA sequence dna is found in DNA sequence dna, A point highest preceding qt DNA sequence dna is obtained again, the score of the qt DNA sequence dna is added, obtained final score conduct The score of node y.For each node y, correspondence has score_n(y), the node y of highest scoring in these nodes y is chosen As the 2nd l-mer.

But when traditional method has the drawback that calculating node y score every time, a time DNA of scanning will be removed again It is big to calculate cost for sequence large data sets.The present embodiment in order to solve this problem, by all l-mer in every DNA sequence dna, root Ascending order arrangement is carried out from small to large with the Hamming distances of the first l-mer according to the l-mer, queue is obtained, according to such row Team's sequence, it can be determined that earlier l-mer is particularly likely that the 2nd l-mer finally acquired in queue.Pass through this The queue of sample is sought being calculated as this reduction in the 2nd l-mer again, substantially need to only scan preceding several l-mer in queue The l-mer that score is best in this DNA sequence dna will be found.Wherein, the D'(y in formula (1)) it is from DNA sequence dna large data sets That chooses in D is used to calculate the set containing qt DNA sequence dna of the score of node y, in the present embodiment, D'(y) set expression Are as follows:

Because the Hamming distances of all l-mer and the first l-mer in every DNA sequence dna carry out ascending order row from small to large Column, after obtaining queue, all l-mer and the first the smallest l-mer of l-mer Hamming distances in every DNA sequence dna are Queue foremost is come.The smallest l-mer of score that will be obtained from every DNA sequence dna presses Hamming distances again Ascending order arrangement is carried out from small to large, new queue is obtained after arrangement, certain a line in the new queue is called C_i, then In the present embodiment, for a d neighbours y of a first l-mer m' and the first l-mer m', there are C_iAnd C_iIn a position j(1≤j≤|C_i|), if d_H(C_i[j],m')–d_H(y, m') >=0, then d_H(C_i[j],m')–d_H(y, m') is d_H(y,C_i [j]) the smallest possible value.Therefore, when scanning on the basis of new queue and calculate score, in new queuing Certain a line C in sequence_i, when encountering d_H(C_i[j],m')–d_H(y,m')≥dis(y,C_i[j]) such case when, current row can be completed C_iScanning, current line C_iMinimum Hamming distances be dis (y, C_i[j]), by dis (y, C_i[j]) formula (11) are substituted into, it is tied Point y in the C_iCapable score score_n(y), and start next line C_iThe scanning of+1 row, until owning in new queue Row is scanned, using top score in the score of a line every in new queue as the score score of node y_n(y)。

Score score as above is carried out respectively to all nodes of the i+1 layer of Binomial trees_n(y) calculating is obtained to what is obtained Divide and carry out ascending sort from small to large, the node of the biggish score of preceding second threshold is as final Binomial trees in selected and sorted The quantity of the node of i+1 layer, this layer of node is equal to second threshold.

The present embodiment is calculated by the first score model and is selected in the first l-mer building Binomial trees that the first l-mer is concentrated The high node of score goes to generate extension node, because the high node of score is it is more likely that implantation die body (l, d), the present embodiment It is to generate extension node from the direction of implantation die body (l, d), to reduce the calculation amount of subsequent implantation die body (l, d), reduces The runing time of APMS method.

Further, whether i-th layer of node for judging Binomial trees is implantation die body (l, d), if the node is implantation mould The node is then stored in the first array M by body (l, d), if the node is not implantation die body (l, d), not needing storage should Node is in the first array M；

Specifically, the present embodiment takes all d neighbor node removal search implantation of Binomial trees not as conventional method Die body (l, d), but take the similar node removal search with implantation die body (l, d).It is in i-th layer of node for judging Binomial trees When no die body (l, d) for implantation, it is that this node is updated in DNA sequence dna large data sets, judges whether at least there is qt item Hamming distances in DNA sequence dna all comprising a l-mer and the node are less than or equal to d, if it is present determining that the node is to plant Enter die body (l, d), which is stored in the first array M, if it does not exist, then the node is not implantation die body (l, d), no It needs for the node to be stored in the first array M.Wherein, the hamming of the extension node of i-th layer of node and the node i+1 layer Distance is 1.Wherein, the value of i is 0 < i < d.

Further, according to the node score in the first array M, using the node of highest scoring as the 2nd l-mer.

Specifically, the node in the first array M is the node close to implantation die body (l, d) selected the first l-mer Set selects the node of wherein highest scoring for the implantation die body most likely searched for, by the highest scoring from the first array M Node as the 2nd l-mer.

Further, the first l-mer of traversal concentrates each first l-mer, constructs binomial tree model as described above and obtains the Two l-mer, obtain the 2nd l-mer collection according to the 2nd l-mer, obtain final implantation die body (l, d) by the 2nd l-mer collection.

Specifically, each first l-mer is concentrated to construct binomial tree model as described above first l-mer, by the first score Model calculate each using the first l-mer as the first array M of the binomial tree model of root node, select the first array M in score most Twoth l-mer of the high node as the first l-mer, the first l-mer is obtained each of is then concentrated to the first l-mer the Two l-mer constitute the 2nd l-mer collection by the 2nd l-mer, and the 2nd l-mer the 2nd l-mer concentrated is pressed the first score model again Score is calculated, these scores are re-started into sequence from high to low, exports the node set of the rearrangement as finally It is implanted into die body (l, d).

In conclusion the present embodiment, which is based on binomial tree method search implantation die body (l, d), to be opened from the first l-mer of root node Beginning scans for layer by layer.For the first l-mer of root node, first determine whether the first l-mer of root node is an implantation mould Body (l, d), and by be 1 with the Hamming distances of the first l-mer of root node all nodes as the 1st layer of extension node.For I-th (0 < i < d) layer, selects N from the extension node of this layer first_mm(i) the high node of a score node final as this layer, By respectively with this N_mm(i) node of the extension node that the Hamming distances of a node chosen are 1 as i+1 layer.For D layers, directly judge whether this layer of node is an implantation die body (l, d).Judge each layer each extension node whether be One implantation die body (l, d) is stored in the first array M, if the extension node is implantation die body (l, d) if the expansion Exhibition node is not implantation die body (l, d), then does not need to be stored in the first array M.In this search process, if the first l- There are multiple implantation die bodys (l, d) in the first array M in the Binomial trees of mer building, then selects highest scoring from the first array M Node is as the 2nd l-mer.2nd l-mer is obtained to the first l-mer of each of the first l-mer concentration, by these the 2nd l-mer The 2nd l-mer collection is obtained, the 2nd l-mer the 2nd l-mer concentrated is re-started into sequence from high to low by its score again, it is defeated The node set of the rearrangement is as final implantation die body (l, d) out.

Further, de-redundancy processing is carried out to the first k-mer collection according to the 2nd l-mer, obtains the 2nd k-mer collection, wrapped It includes:

The 4th l-mer is obtained from DNA sequence dna large data sets；

It whether is redundancy according to the k-mer that third desired value judges that the first k-mer is concentrated, as the k- that the first k-mer is concentrated The Hamming distances d of mer and the k-mer in the 2nd l-mer are less than or equal to third desired value, and the k-mer that the first k-mer is concentrated is superfluous It is remaining, k-mer is concentrated from the first k-mer and is deleted, the 2nd k-mer collection is obtained, k-mer is otherwise retained in the first k-mer collection, is obtained To the 2nd k-mer collection.

Specifically, for the first k-mer collection, it is the 2nd l- that the first k-mer, which is concentrated there may be the k-mer of redundancy, k-mer, There is a length of k'(k for the substring or k-mer and the 2nd l-mer of same initial position in mer_min≤ k' < k) it is overlapping.Base In this, the present embodiment is generated using last first when the k-mer concentrated every time by the first k-mer obtains a l-mer The 2nd l-mer come differentiate the first k-mer concentrate k-mer whether be a redundancy k-mer, if the k-mer be redundancy, K-mer is concentrated from the first k-mer and is deleted, the 2nd k-mer collection is obtained；If the k-mer is not redundancy, k-mer is retained in First k-mer collection obtains the 2nd k-mer collection.

Enable k-mer and the implantation of any initial position in third desired value e (k) one any die body example of expression The desired value of the Hamming distances of the k-mer of identical initial position in die body (l, d).The present embodiment is from DNA sequence dna large data sets D The 4th l-mer is obtained, the die body example that the 4th l-mer is calculated as third desired value e (k), the 2nd l-mer is as the third phase The implantation die body (l, d) that prestige value e (k) is calculated.E (l) is based on total probability formula and calculates and obtain, and appoints and takes the 4th l-mer and the 2nd l- A mutated site between mer, it is assumed that this mutation randomly comes across on a position in l position, then third Desired value e (k) is equal to e (l) multiplied by k/l.The present embodiment third desired value e (k) design is as follows:

It is the k-mer of a redundancy for the first k-mer k-mer x concentrated is defined as: the 2nd l- in the present embodiment There are a k-mer z to make d in mer_HIn the k-mer x and the 2nd l-mer that (z, x)≤e (k), i.e. the first k-mer are concentrated The Hamming distances d of k-mer z is less than or equal to third desired value e (k), then the k-mer that the first k-mer is concentrated is redundancy, by k-mer It concentrates and deletes from the first k-mer, do not need to carry out the k-mer as above implantation die body (l, d) search routine, otherwise by k-mer It is retained in the first k-mer collection, carries out as above implantation die body (l, d) search routine.Wherein, the k-mer concentrated for the first k-mer X is that the k-mer of a redundancy can be with is defined as: pf (x, k') and sf (x, k') is enabled to respectively indicate a character string k-mer x Length be k' prefix and length be k' suffix, there are k_min≤ k' < k makes d_H(pf(z,k'),sf(x,k'))≤e (k') or d_H(sf(z,k'),pf(x,k'))≤e(k')。

In the present embodiment, by designing third desired value e (k), de-redundancy processing is carried out to the first k-mer collection, is reduced The calculation amount of subsequent implantation die body (l, d), reduces the runing time of APMS method.

Further, the first l-mer collection is handled according to the 2nd k-mer collection, obtains the 2nd l-mer collection.

Specifically, after by the above-mentioned progress de-redundancy processing to the first k-mer collection, the 2nd k-mer collection has been obtained, with second K-mer collection updates the first k-mer collection.Because of the 2nd k-mer collection, by the k-mer of redundancy after the first k-mer concentration deletion, no The k-mer of the redundancy need to be obtained from the first k-mer collection, and then obtains the first l-mer operation, so the present embodiment APMS method is every All it is once to obtain k-mer from the first k-mer collection, the first l-mer is obtained by the k-mer, constructs two again by the first l-mer Xiang Shu obtains the 2nd l-mer by Binomial trees, is then removed the k-mer of redundancy from the first k-mer collection by the 2nd l-mer It removes, obtains the 2nd k-mer collection, update the first k-mer collection with the 2nd k-mer collection, and then obtain from updated first k-mer concentration K-mer is taken, the first l-mer is obtained by the k-mer, carries out process as above repeatedly.For the first l-mer collection, the first l-mer The first l-mer of each of concentration constructs Binomial trees, the score of each node in Binomial trees is calculated, by highest scoring in Binomial trees Node is as corresponding 2nd l-mer of the first l-mer, and the first l-mer that each first l-mer is concentrated is corresponding, and there are one the Two l-mer obtain the 2nd l-mer collection.

Step 3 determines implantation die body (l, d) from the 2nd l-mer collection according to the first score model.

Specifically, the score that the first score model is pressed to the 2nd l-mer that the 2nd l-mer is concentrated, is arranged from high to low Sequence, the 2nd l-mer collection after exporting the rearrangement, to obtain implantation die body (l, d).

Fig. 3 is referred to, Fig. 3 is the structure that a kind of DNA data set provided in an embodiment of the present invention is implanted into die body searcher Schematic diagram.Another embodiment of the present invention provides a kind of DNA data set be implanted into die body searcher, which includes:

Data acquisition module, the implantation die body search ginseng for obtaining DNA sequence dna large data sets, obtaining DNA sequence dna large data sets Number；

Data processing module obtains the first k-mer collection according to DNA sequence dna large data sets, implantation die body search parameter, according to First k-mer collection obtains the first l-mer collection, obtains the 2nd l-mer collection according to the first l-mer collection；

Data determining module determines implantation die body from the 2nd l-mer collection according to the first score model.

DNA data set provided in an embodiment of the present invention is implanted into die body searcher device, can execute above method implementation Example, it is similar that the realization principle and technical effect are similar, and details are not described herein.

A kind of computer readable storage medium that yet another embodiment of the invention provides, is stored thereon with computer program, on It states when computer program is executed by processor and performs the steps of

The implantation die body search parameter for obtaining DNA sequence dna large data sets, obtaining DNA sequence dna large data sets；

The first k-mer collection is obtained according to DNA sequence dna large data sets, implantation die body search parameter, is obtained according to the first k-mer collection To the first l-mer collection, the 2nd l-mer collection is obtained according to the first l-mer collection；

Implantation die body is determined from the 2nd l-mer collection according to the first score model.

Computer readable storage medium provided in an embodiment of the present invention can execute above method embodiment, realize former Reason is similar with technical effect, and details are not described herein.

In order to illustrate advantage of the invention, the present embodiment verifies APMS of the present invention in analogue data and truthful data respectively The advantage of method.Analogue data be mainly used for by with existing method at runtime between compared with test the effect of APMS method Rate, at the same verify APMS method can find implantation die body (l, d)；Truthful data is mainly used for verifying the effective of APMS method Property, can verifying APMS method efficiently find true die body in the biological data of real world.

Wherein, in analogue data, in order to carry out integrative test, three groups of simulated data sets are generated in the present embodiment, Under three groups of simulated data sets compared with the conventional method, the advantage of this method APMS is verified.Wherein, the existing method compared is chosen It is the highest essence of efficiency for coping with DNA sequence dna large data sets including FMotif, PairMotifChIP and MEME-ChIP:FMotif True PMS method；PairMotifChIP is the approximate PMS method for coping with DNA sequence dna large data sets of newest proposition；MEME- ChIP is most notable one of die body discovery method.

The present embodiment measures prediction die body (l, d) m using coefficient of performance mPC_pWith implantation die body (l, d) m_kSimilitude. Wherein, len_overlap(m_p,m_k) indicate prediction die body (l, d) m_pWith implantation die body (l, d) m_kOverlapping character number, mPC are calculated It is as follows:

(1) first group of simulated data sets is for carrying out validation test in the data with different die bodys (l, d), wherein In DNA sequence dna large data sets, DNA sequence dna item number t=3000, the number of characters n=200 of every DNA sequence dna, first group of simulation number According to machine test in implantation die body (l, d) search accounting be q=0.5, the i.e. DNA sequence dna needed in first group of simulated data sets test Item number be 3000*0.5=1500, conservative parameter g=0.5, then under different l and d values, compare APMS, FMotif, PairMotifChIP and MEME-ChIP method.

Comparison result on 1, first group of simulated data sets of table

In table 1, time indicates runing time, and s indicates the second, and m indicates minute, and h indicates hour, and N indicates that runing time is more than 48 hours and can not make prediction.Seen from table 1, t, n, q, g, under the value of different l and d, APMS method runing time are given It is faster than APMS, FMotif, PairMotifChIP and MEME-ChIP method.When l and d value is bigger, FMotif The case where method is more than 48 hours there are runing time and can not make prediction；PairMotifChIP and MEME-ChIP method is in l When increased with d, runing time is relatively stable, although the runing time of APMS method is increased as l and d increases Add, but still is s rank, it is faster than PairMotifChIP method runing time, faster than MEME-ChIP method runing time.

(2) second groups of simulated data sets are for carrying out validation test in the different data of die body signal strength: where In DNA sequence dna large data sets, DNA sequence dna item number t=3000, the number of characters n=200 of every DNA sequence dna are implanted into die body (l, d) =(15,5), implantation die body (l, d) search accounting q and conservative parameter g is in different values in second group of analogue data test Under, compare APMS, FMotif, PairMotifChIP and MEME-ChIP method.Wherein, die body signal strength depends on q and g, q When value is small and g value is big, die body signal strength is small；Q value is big and g value hour, die body signal strength are big.

Comparison result on 2, second groups of simulated data sets of table

In table 2, time indicates runing time, and s indicates the second, and m indicates minute, and h indicates hour, and N indicates that runing time is more than 48 hours and can not make prediction.As can be seen from Table 2, given t, n, l, d, under the value of different q and g, APMS method runing time It is faster than APMS, FMotif, PairMotifChIP and MEME-ChIP method.When die body signal strength is smaller, FMotif The case where method is more than 48 hours there are runing time and can not make prediction；APMS, PairMotifChIP, MEME-ChIP method Runing time is relatively stable, and APMS ratio PairMotifChIP method runing time is fast, when than the operation of MEME-ChIP method Between faster.

(3) third group simulated data sets on the DNA sequence dna large data sets of different scales for carrying out validation test: every The number of characters n=200 of DNA sequence dna, be implanted into die body (l, d)=(15,5), third group analogue data test in be implanted into die body (l, D) search for accounting q=0.5 and conservative parameter g=0.5, then in DNA sequence dna item number t under different values, compare APMS, FMotif, PairMotifChIP and MEME-ChIP method.

Comparison result on table 3, third group simulated data sets

In table 3, time indicates runing time, and s indicates the second, and m indicates minute, and h indicates hour, and N indicates that runing time is more than 48 hours and can not make prediction.Seen from table 3, n, q, g, l, d are given, under the value of different t, APMS method runing time is equal It is faster than APMS, FMotif, PairMotifChIP and MEME-ChIP method.It is bigger in the data of DNA sequence dna large data sets When, MEME-ChIP method is more than 48 hours there are runing time and can not make the case where predicting, PairMotifChIP method The rank that runing time increases is greater than APMS method.Wherein, because the maximum DNA sequence number quantity set that FMotif limits processing is 3000, so FMotif is not engaged in the comparison on third group data set.

By table 1, table 2 and table 3 as it can be seen that APMS method can complete implantation mould within the shortest time in all cases The prediction of body (l, d), the order of magnitude quickly in FMotif, PairMotifChIP and MEME-ChIP method.Wherein, for all Method, the value of coefficient of performance mPC are 1, illustrate that they can accurately find out implantation die body (l, d), mainly three groups of reason Analogue data concentrates the die body information content contained quite sufficient, even if when die body signal strength very little, it still can essence Implantation die body (l, d) is found out quasi-ly.

Fig. 4 is referred to, Fig. 4 is that APMS, PairMotifChIP and MEME-ChIP method provided in an embodiment of the present invention exists Comparison result under the different DNA sequence dnas of analogue data is intended to.As it can be seen that the runing time of APMS method is with DNA sequence dna quantity The increase of collection and it is about linear increase, and the runing time of PairMotifChIP with DNA sequence dna quantity collection increase about Increase in square grade, and MEME-ChIP method DNA sequence dna item number be 12000 have existed runing time be more than 48 hours and The case where can not making prediction.

Wherein, on truthful data, the present embodiment uses mouse embryo stem cell (Mouse Embryonic Stem Cell, abbreviation mESC) ChIP-seq data, the ChIP-seq data be widely used to the most verify die body searching method The data of validity.MESC data include 12 group data sets (c-Myc, CTCF, Esrrb, Klf4, Nanog, n-Myc, Oct4, Smad1, Sox2, STAT3, Tcfcp2I1, Zfx), wherein each group data set is named by ChIP-ed transcription factor.In APMS Method search for die body when, to 12 groups of different data sets use unified implantation die body (l, d) search parameter, implantation die body (l, D) accounting q=0.3, conservative parameter g=0.5, for each data set, before taking are searched for in=(13,4), implantation die body (l, d) Input of 3000 DNA sequence dnas as APMS method.

Fig. 5 is referred to, Fig. 5 is that a kind of Efficient Solution DNA sequence dna large data sets provided in an embodiment of the present invention are implanted into die body Experimental result schematic diagram of the searching method in truthful data.As seen from the figure, for each data set, illustrated in figure containing DNA sequence dna quantity, runing time, the announcement die body of sequence logo form and prediction die body, wherein top is in sequence logo Die body is announced, is below prediction die body.For each data set, by comparing prediction die body and die body is announced, it can be found that APMS method can find prediction die body similar with die body is announced on 12 group data sets；And the fortune on all data sets The row time is all within 6 minutes.

As it can be seen that APMS method can be used for efficiently and effectively handling true DNA sequence dna large data sets.

In conclusion APMS method is regardless of can efficiently and effectively locate in simulated data sets or real data set DNA sequence dna large data sets are managed, APMS method can not only successfully find out implantation die body (l, d) or true die body, and compare The operation of existing implantation die body (l, d) searching method quickly much, is concentrated in analogue data, it is seen then that APMS method runing time with The increase of DNA sequence data collection scale linearly increase.

The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be said that Specific implementation of the invention is only limited to these instructions.For those of ordinary skill in the art to which the present invention belongs, exist Under the premise of not departing from present inventive concept, a number of simple deductions or replacements can also be made, all shall be regarded as belonging to of the invention Protection scope.

Claims

1. a kind of DNA data set is implanted into die body searching method characterized by comprising

2. the method according to claim 1, wherein according to the DNA sequence dna large data sets, the implantation die body Search parameter obtains the first k-mer collection, comprising:

3. according to the method described in claim 2, it is characterized in that, obtaining the length k, comprising:

4. according to the method described in claim 3, it is characterized in that, obtaining the first threshold, comprising:

5. according to the method described in claim 4, it is characterized in that, obtaining the first l-mer according to the first k-mer collection Collection, comprising:

It is concentrated from the first k-mer and obtains k-mer；

Processing is extended to each of the DNA sequence dna large data sets k-mer, the first k-mer collection being expanded；

De-redundancy processing, the 2nd k-mer being expanded are carried out according to first k-mer collection of second score model to the extension Collection；

According to the first l-mer, the first l-mer collection is obtained.

6. according to the method described in claim 5, it is characterized in that, carry out intercepting process to the 2nd k-mer collection of the extension, Obtain the first l-mer, comprising:

Aligned sequences are obtained according to the 2nd k-mer collection of the extension；

7. according to the method described in claim 6, it is characterized in that, obtain the 2nd l-mer collection according to the first l-mer collection, Include:

Score is calculated according to all nodes of first score model to the Binomial trees of building, by the highest scoring Node is as the 2nd l-mer；

8. the method according to the description of claim 7 is characterized in that according to the 2nd l-mer to the first k-mer collection into The processing of row de-redundancy, obtains the 2nd k-mer collection, comprising:

The 4th l-mer is obtained from the DNA sequence dna large data sets；

It whether is redundancy according to the k-mer that the third desired value judges that the first k-mer is concentrated, when the k-mer is concentrated K-mer and the 2nd l-mer in k-mer Hamming distances d be less than or equal to the third desired value, the first k- The k-mer that mer is concentrated is redundancy, and k-mer is concentrated from the first k-mer and is deleted, the 2nd k-mer collection is obtained, otherwise by k- Mer is retained in the first k-mer collection, obtains the 2nd k-mer collection.

9. a kind of DNA data set is implanted into die body searcher, which is characterized in that described device includes:

Data acquisition module obtains the DNA sequence dna large data sets, the implantation die body of the acquisition DNA sequence dna large data sets is searched Rope parameter；

Data processing module obtains the first k- according to the DNA sequence dna large data sets, the implantation die body search parameter Mer collection, obtains the first l-mer collection according to the first k-mer collection, obtains described second according to the first l-mer collection L-mer collection；

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program Method described in any item of the claim 1 to 8 is realized when being executed by processor.