CN110059228A - A kind of DNA data set implantation die body searching method and its device and storage medium - Google Patents
A kind of DNA data set implantation die body searching method and its device and storage medium Download PDFInfo
- Publication number
- CN110059228A CN110059228A CN201910181475.8A CN201910181475A CN110059228A CN 110059228 A CN110059228 A CN 110059228A CN 201910181475 A CN201910181475 A CN 201910181475A CN 110059228 A CN110059228 A CN 110059228A
- Authority
- CN
- China
- Prior art keywords
- mer
- collection
- die body
- dna sequence
- data sets
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9027—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90348—Query processing by searching ordered data, e.g. alpha-numerically ordered data
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention relates to a kind of DNA data set implantation die body searching method and its device and storage medium, method includes: the implantation die body search parameter for obtaining DNA sequence dna large data sets, obtaining the DNA sequence dna large data sets;The first k-mer collection is obtained according to the DNA sequence dna large data sets, the implantation die body search parameter, the first l-mer collection is obtained according to the first k-mer collection, the 2nd l-mer collection is obtained according to the first l-mer collection;The implantation die body is determined from the 2nd l-mer collection according to the first score model.The present invention can not only search out implantation die body by APMS method from DNA sequence dna large data sets, while with finding out the runing time order of magnitude of implantation die body quickly in other implantation die body searching methods.
Description
Technical field
The invention belongs to DNA sequence dna big data processing fields, and in particular to a kind of DNA data set implantation die body searching method
And its device and storage medium.
Background technique
DNA is the carrier of heritage information, and hereditary information is stored in the sequence of tetra- kinds of characters of DNA composition, the life of organism
Long development essence is exactly the transmitting and expression of hereditary information.As the first step of hereditary information expression, transcription is regulatory mechanism
Center.The specific site (length is about 5~20 base-pairs) that transcription factor is incorporated in DNA sequence dna, the transcription of promotor gene
With the transcriptional efficiency of control gene.These sites are known as Binding site for transcription factor (Transcription Factor
Binding Sites, abbreviation TFBS), transcriptional control important in inhibiting of the positioning TFBS to research gene.
Quorum implantation die body search (Quorum Protein Motifs Sequences, abbreviation qPMS) be for
One of the famous computation model of TFBS is positioned in DNA sequence dna.Common qPMS method includes the exact method of sample mode driving
With the exact method of suffix tree, wherein the exact method based on sample mode driving, such as PMSprune, StemFinder,
QPMS7, TravStrR, PMS8 and qPMS9, comprising sample driving and two stages of mode activated, the sample driving stage is with choosing
Take it is some generate candidate die body as few as possible as constraint with reference to DNA sequence dna, the mode activated stage be to candidate die body into
Row verifying;Exact method based on suffix tree, such as Weeder, RISOTTO and FMotif establish the suffix tree rope of list entries
Attract the verifying for accelerating candidate die body.The target of approximate qPMS method be find out in a relatively short period of time it is optimal or close to optimal
Die body, most typical approximation qPMS method include expectation maximization, Gibbs sampling and genetic method etc., are carried out to initial die body
Refinement, in these methods, the method MEME-ChIP based on expectation maximization are most notable one of die body discovery methods.For
Efficient process large data sets have also been proposed some die bodys discovery methods based on new strategy, such as the side PairMotifChIP
Method, PairMotifChIP method are that similar substring is excavated and merged from the DNA sequence dna of input to obtaining die body.
However, qPMS method and approximation qPMS method, PairMotifChIP method there is a problem of it is common: computational problem,
Cause runing time too long, there is bottlenecks when handling DNA sequence dna large data sets.
Summary of the invention
In order to solve the above-mentioned problems in the prior art, the present invention provides a kind of DNA data set implantation die bodys to search
Rope method and device thereof and storage medium.
The embodiment of the invention provides a kind of DNA data sets to be implanted into die body searching method, this method comprises:
The implantation die body search parameter for obtaining DNA sequence dna large data sets, obtaining the DNA sequence dna large data sets;
The first k-mer collection is obtained according to the DNA sequence dna large data sets, the implantation die body search parameter, according to described
First k-mer collection obtains the first l-mer collection, obtains the 2nd l-mer collection according to the first l-mer collection;
The implantation die body is determined from the 2nd l-mer collection according to the first score model.
In one embodiment of the invention, it is obtained according to the DNA sequence dna large data sets, the implantation die body search parameter
To the first k-mer collection, comprising:
Length k is obtained, several k-mer are obtained from the DNA sequence dna large data sets according to the length k;
First threshold is obtained, the first k-mer collection is obtained according to the first threshold, the k-mer.
In one embodiment of the invention, the length k is obtained, comprising:
The first desired value is obtained according to the DNA sequence dna large data sets;
The second desired value is obtained according to the DNA sequence dna large data sets, the implantation die body search parameter;
It is worth to obtain the length k according to first desired value, second expectation.
In one embodiment of the invention, the first threshold is obtained, comprising:
The quantity of DNA sequence dna is obtained from the DNA sequence dna large data sets;
The first threshold is obtained according to second desired value, the DNA sequence dna quantity.
In one embodiment of the invention, the first l-mer collection is obtained according to the first k-mer collection, comprising:
It is concentrated from the first k-mer and obtains k-mer;
Processing is extended to each of the DNA sequence dna large data sets k-mer, the first k- being expanded
Mer collection;
De-redundancy processing, second to be expanded are carried out according to first k-mer collection of second score model to the extension
K-mer collection;
Intercepting process is carried out to the 2nd k-mer collection of the extension, obtains the first l-mer;
According to the first l-mer, the first l-mer collection is obtained.
In one embodiment of the invention, intercepting process is carried out to the 2nd k-mer collection of extension, obtains the first l-
Mer, comprising:
Aligned sequences are obtained according to the 2nd k-mer collection of extension;
Intercepting process is carried out to the aligned sequences according to preset rules, obtains the first l-mer.
In one embodiment of the invention, the 2nd l-mer collection is obtained according to the first l-mer collection, comprising:
Binomial trees are constructed to the first l-mer that the first l-mer is concentrated;
Score is calculated according to all nodes of first score model to the Binomial trees of building, most by the score
High node is as the 2nd l-mer;
De-redundancy processing is carried out to the first k-mer collection according to the 2nd l-mer, obtains the 2nd k-mer collection;
The first l-mer collection is handled according to the 2nd k-mer collection, obtains the 2nd l-mer collection.
In one embodiment of the invention, the first k-mer collection is carried out at de-redundancy according to the 2nd l-mer
Reason, obtains the 2nd k-mer collection, comprising:
The 4th l-mer is obtained from the DNA sequence dna large data sets;
Obtain the third desired value between the k-mer of the k-mer and the 4th l-mer of the 2nd l-mer;
It whether is redundancy according to the k-mer that the third desired value judges that the first k-mer is concentrated, as the first k-
The Hamming distances d for the k-mer in k-mer and the 2nd l-mer that mer is concentrated is less than or equal to the third desired value, described
The k-mer that first k-mer is concentrated is redundancy, and k-mer is concentrated from the first k-mer and is deleted, the 2nd k-mer collection is obtained, no
K-mer is then retained in the first k-mer collection, obtains the 2nd k-mer collection.
Another embodiment of the present invention provides a kind of DNA data sets to be implanted into die body searcher, which includes:
Data acquisition module, the implantation mould for obtaining the DNA sequence dna large data sets, obtaining the DNA sequence dna large data sets
Body search parameter;
Data processing module obtains described first according to the DNA sequence dna large data sets, the implantation die body search parameter
K-mer collection obtains the first l-mer collection according to the first k-mer collection, obtains described according to the first l-mer collection
Two l-mer collection;
Data determining module determines the implantation die body from the 2nd l-mer collection according to first score model.
Yet another embodiment of the present invention provides a kind of computer readable storage medium, and the computer program is processed
Device realizes method described in any of the above embodiments when executing.
Compared with prior art, beneficial effects of the present invention:
The present invention can not only search out implantation die body, while looking for by APMS method from DNA sequence dna large data sets
With being implanted into the runing time order of magnitude of die body out quickly in other implantation die body searching methods.
Detailed description of the invention
Fig. 1 is the flow diagram that a kind of DNA data set provided in an embodiment of the present invention is implanted into die body searching method;
Fig. 2 is that the implantation die body of traditional Binomial trees provided in an embodiment of the present invention searches for schematic diagram;
Fig. 3 is the structural schematic diagram that a kind of DNA data set provided in an embodiment of the present invention is implanted into die body searcher;
Fig. 4 is APMS, PairMotifChIP and MEME-ChIP method provided in an embodiment of the present invention in analogue data
Comparison result under different DNA sequence dnas is intended to;
Fig. 5 is that a kind of Efficient Solution DNA sequence dna large data sets implantation die body searching method provided in an embodiment of the present invention exists
The experimental result schematic diagram of truthful data.
Specific embodiment
Further detailed description is done to the present invention combined with specific embodiments below, but embodiments of the present invention are not limited to
This.
Embodiment one
Referring to Figure 1, Fig. 1 is the process that a kind of DNA data set provided in an embodiment of the present invention is implanted into die body searching method
Schematic diagram.The embodiment of the invention provides a kind of DNA data sets to be implanted into die body searching method, and this method comprises the following steps:
Step 1, the implantation die body search parameter for obtaining DNA sequence dna large data sets, obtaining DNA sequence dna large data sets.
Step 1.1 obtains DNA sequence dna large data sets.
Specifically, the DNA sequence dna large data sets D obtained in the present embodiment, including t DNA sequence dna, the then big number of DNA sequence dna
D={ s can be expressed as according to collection D1,s2,…st, wherein siIndicate i-th DNA sequence dna;Every DNA sequence dna includes n character.
Wherein, every DNA sequence dna siIt is a character string on character list Σ={ A, C, G, T }, i.e., every DNA sequence dna is by A, C, G, T
Form the character string that length is n.si[j] indicates j-th of character of i-th DNA sequence dna, si[j..j'] indicates i-th DNA sequence dna
In originate in the character string that position j terminates at position j'.Wherein, the value of i is 0~t-1, and the value of j is 0~n-1.
Step 1.2, the implantation die body search parameter for obtaining DNA sequence dna large data sets.
Specifically, in the present embodiment, implantation die body (l, d) search parameter includes the length l of implantation die body (l, d), implantation
Accounting q, conservative parameter g are searched in the Hamming distances d of die body (l, d), implantation die body (l, d).
In the present embodiment, for being implanted into die body (l, d), APMS method solves the problems, such as be: given t length for n DNA
Sequence large data sets D={ s1,s2,…,stAnd meeting three parameters l, d and q of 0 < l < n, 0≤d < l and 0 < q≤1, target is
A l-mer (character string of a length of l) m is found, so that at least qt (q≤t) DNA sequence dna siIn all contain one and l-mer
There is the l-mer m of at most d position difference (mutation) in mi, which is defined as Hamming distances: dH(m,mi)
=| { i:1≤i≤l, m [i]!=mi[i]}|.Wherein, l-mer m is known as an implantation die body (l, d), DNA sequence dna big data
The l-mer m concentratediReferred to as die body example, the sequence that above-mentioned Hamming distances are unsatisfactory in DNA sequence dna large data sets are known as
Background sequence.Wherein, APMS method is that a kind of DNA data set of the invention is implanted into die body searching method.
DNA sequence dna large data sets are conducive to find out high quality implantation die body (l, d), but most of existing qPMS methods
Implantation die body (l, d) is found out in calculating too time-consuming and that qPMS cannot be completed within the reasonable time.And APMS method in the present embodiment
On the basis of qPMS method, reply can not only find out implantation die body (l, d), and runing time in DNA sequence dna large data sets
The order of magnitude quickly in existing die body searching method.
Step 2 obtains the first k-mer collection according to DNA sequence dna large data sets, implantation die body search parameter, according to the first k-
Mer collection obtains the first l-mer collection, obtains the 2nd l-mer collection according to the first l-mer collection.
Step 2.1 obtains the first k-mer collection according to DNA sequence dna large data sets, implantation die body (l, d) search parameter, and first
K-mer collection includes several k-mer, and each k-mer includes k character.
Specifically, the first k-mer collection is obtained according to DNA sequence dna large data sets, implantation die body (l, d) search parameter, comprising:
Length k is obtained, several k-mer are obtained from DNA sequence dna large data sets according to length k;
Obtain first thresholdAccording to first thresholdK-mer obtains the first k-mer collection.
Further, length k is obtained, comprising:
The first desired value is obtained according to DNA sequence dna large data sets;
The second desired value is obtained according to DNA sequence dna large data sets, implantation die body (l, d) search parameter;
It is worth to obtain length k according to the first desired value, the second expectation.
Specifically, the present embodiment determines suitable k value using probability analysis method, so as to preferably distinguish background
K-mer in sequence and die body example.Enable frIt (k) is the first desired value, the first desired value fr(k) it indicates in any background sequence
K-mer frequency of occurrence in DNA sequence dna large data sets D desired value;Enable fmIt (k) is the second desired value, the second desired value fm
(k) desired value of k-mer frequency of occurrence in DNA sequence dna large data sets D in any die body example is indicated.Wherein, the second phase
Prestige value fm(k) with the first desired value fr(k) ratio is bigger, then the k-mer in background sequence and in die body example is from frequency of occurrence
From the point of view of, more have ga s safety degree.Therefore, the present embodiment determines the value of k using following formula:
Wherein, kminIndicate the minimum value of k, ε is for coping with the first desired value fr(k) less than 1 the factor of the case where.
kminPreferably 5, because when k value very little, it is difficult to distinguish the k-mer in background sequence and die body example.ε is rule of thumb set
It is set to 1.
In the present embodiment, the first desired value f in formula (1) is obtained according to DNA sequence dna large data setsr(k), specific design
It is as follows:
It is assumed that the implantation die body (l, d) of search is m, there are die body example m1With die body example m2, in DNA sequence dna big data
Collect in D, die body example m any for one1In any initial position a k-mer x1With another any die body example m2
In identical initial position a k-mer x2, enable pkIndicate k-mer x1With k-mer x2Equal probability, then in formula (1)
The second desired value fm(k) it designs as follows:
For in formula (3), pkIndicate k-mer x1With k-mer x2Equal probability, according to total probability formula, pkDesign
It is as follows:
Wherein, PriIndicate implantation die body (l, d) m and die body example m1Hamming distances dH(m,m1)=i's (0≤i≤d)
Probability, PrjImplantation die body (l, d) m and die body example m respectively2Hamming distances dH(m,m2The probability of)=j (0≤j≤d), Pri
It designs as follows:
Wherein, g indicates that conservative parameter, value range are 0≤g≤1.
Similarly, PrjIt designs as follows:
And pijIt indicates in dH(m,m1)=i and dH(m,m2Under conditions of)=j, k-mer x1With k-mer x2Equal is general
Rate, pijIt designs as follows:
By formula (7) as it can be seen that pijIt is to take in a 0 to being added up in the range of min { i, j } to the product of three fac-tors.
Wherein, first factor representation die body example m1In any k-mer x1In have a mutation probability;Second factor table
Show k-mer x2With k-mer x1The identical probability of mutated site;Third factor representation is in k-mer x2With k-mer x1Mutation
In the identical situation in position, the duplicate probability of the base being mutated into.
The first desired value f is calculated by above-mentioned formula (2)~(3)r(k) and the second desired value fm(k), the value range of k
For 0~l, the second desired value f is calculated further according to formula (1)m(k) with the first desired value fr(k) maximum value is as this in ratio
The first k-mer of embodiment concentrates the length k of each k-mer.
From DNA sequence dna large data sets D, the k-mer that several length are k is obtained.
Further, first threshold is obtainedInclude:
The quantity of DNA sequence dna is obtained from DNA sequence dna large data sets;
First threshold is obtained according to the second desired value, the quantity of DNA sequence dna
Specifically, the present embodiment obtains all length not from DNA sequence dna large data sets D as the k-mer of k, but takes
With in DNA sequence dna large data sets D frequency of occurrence be more than or equal to first thresholdK-mer as high frequency k-mer, generate the
One k-mer collection.As described above, fm(k) indicate any k-mer in an arbitrary die body example in the big number of DNA sequence dna
According to the expectation of frequency of occurrence in collection D, if willDirectly it is set as fm(k), it would be possible that can obtain multiple corresponding to same mould
The high frequency k-mer of body.Therefore, first thresholdDesign be in fm(k) added on the basis of one with DNA sequence dna item number t at just
The variable of ratio, to avoid the high frequency k-mer that more redundancies are obtained.The present embodiment first thresholdDesign it is as follows:
Further, the first threshold obtained according to formula (8)It obtains to meet from every DNA sequence dna and be more than or equal to
The k-mer of the first threshold generates the first k-mer collection as high frequency k-mer.
Step 2.2 obtains the first l-mer collection according to the first k-mer collection, and the first l-mer collection includes several first l-mer,
Each first l-mer includes l character.
Specifically, the first l-mer collection is obtained according to the first k-mer collection, comprising:
It is concentrated from the first k-mer and obtains k-mer;
Processing is extended to each k-mer in DNA sequence dna large data sets, the first k-mer collection being expanded, extension
The first k-mer concentrate each extension k-mer length be 2l-k;
De-redundancy processing, the 2nd k-mer being expanded are carried out according to first k-mer collection of second score model to extension
Collection;
Intercepting process is carried out to the 2nd k-mer collection of extension, obtains the first l-mer;
According to the first l-mer, the first l-mer collection is obtained.
Further, processing is extended to the k-mer in DNA sequence dna large data sets, the first k-mer being expanded
Collection, the k-mer length for each extension that the first k-mer of extension is concentrated are 2l-k.
Specifically, implantation die body (l, d) is searched for by the first k-mer collection, is concentrated from the first k-mer obtain k-mer first
X, because initial position of the k-mer x in implantation die body (l, d) is unknown, the present embodiment is in DNA sequence dna large data sets D
In find k-mer x after, extension l-k characters respectively to the left and to the right, extension in DNA sequence dna large data sets D by k-mer x
K-mer x become length be 2l-k character string.It handles in this way, the k-mer x of extension is in DNA sequence dna large data sets D
In die body example can cover implantation die body (l, d).
For example, it is assumed that si[j..j+k -1] is an accurate appearance of the k-mer x in DNA sequence dna large data sets D, then
The die body example for the k-mer x that thus obtained k-mer x extends in DNA sequence dna large data sets D is si[j–l+k..j+l–
1]。
Further, processing is extended to each k-mer x in DNA sequence dna large data sets D, first to be expanded
K-mer collection.
Further, de-redundancy processing is carried out according to first k-mer collection of second score model to extension, be expanded
2nd k-mer collection, the k-mer length for each extension that the 2nd k-mer of extension is concentrated are 2l-k.
Specifically, if the k-mer x for the extension that the first k-mer of extension is concentrated is free of in DNA sequence dna large data sets D
There is a die body example, i.e., it is made of background sequence completely, and the k-mer x extended in this way will affect the quality of the first l-mer collection.Cause
This, the present embodiment is before generating the first l-mer collection, according to the second score model score of designi(y), to the k- of extension
Mer x is assessed, and whether the k-mer x for assessing extension is made of background sequence.It can be seen from the above, because the first desired value fr
(k) expectation for indicating k-mer frequency of occurrence in DNA sequence dna large data sets D in an arbitrary background sequence, so this reality
Example is applied, the design of the second score model is as follows:
By formula (9) as it can be seen that the second score model scorei(y) score is smaller, the k-mer x of extension more may be by
Background sequence composition is expanded to concentrate the k-mer x for filtering out the smallest extension of score from the first k-mer of extension
2nd k-mer collection.
For the present embodiment by the second score model of design, filtering out from the first k-mer of extension concentration may be background sequence
Extension k-mer x, reduce it is subsequent implantation die body (l, d) search calculation amount, reduce the runing time of APMS method.
Further, intercepting process is carried out to the 2nd k-mer collection of extension, obtains the first l-mer, comprising:
Aligned sequences are obtained according to the 2nd k-mer collection of extension;
Intercepting process is carried out to aligned sequences according to preset rules, obtains the first l-mer.
Specifically, it after the present embodiment is to the first k-mer collection de-redundancy processing of the extension in DNA sequence dna large data sets D, remains
The k-mer of remaining extension forms the 2nd k-mer collection of extension, is by the k-mer formation length that the 2nd k-mer of extension concentrates extension
The aligned sequences align, r (align [i]) of 2l-k indicates the information content of the i-th column in aligned sequences align, then according to default
Rule is intercepted, and the first l-mer is obtained.Wherein, information content is using position weight matrix (Position Weight
Matrices, abbreviation PWM), each accounting for being classified as four characters in the k-mer of extension in the weight matrix of position, four characters
Respectively A, C, G, T.
Wherein, the preset rules in the present embodiment are that the extension k-mer Right Aligns for concentrating the 2nd k-mer of extension is formed
After comparing sequence align, according to the information content of each column r (align [i]) in comparison sequence align, acquisition length first is 2l-k
Concensus sequence, then repeatedly comparison removal concensus sequence in the lesser column r of left and right ends information content (align [i]), until must
The concensus sequence for being l to a length, the length are that the concensus sequence of l is the first l-mer.
For example, in the present embodiment, if it is 3 that implantation die body (l, d) length l, which is length k in 6, k-mer, wherein DNA sequence
Column large data sets include the k-mer of 6 extensions, respectively { AGATTGCAG }, { CGATTGCAG }, and { CGATTGCAC },
{ CGCTTGCAG }, { CGCTTGCAG }, { CTATTGTAG }, the k-mer Right Aligns for first extending this 6 arrange:
{AGATTGCAG,
CGATTGCAG,
CGATTGCAC,
CGCTTGCAC,
CGCTTGCAG,
CTATTGTAG }, form aligned sequences align, wherein each column r's (align [i]) of comparison sequence align
Information content are as follows:
{ A:0.17,0.00,0.67,0.17,0.00,0.17,0.00,1.00,0.00
C:0.83,0.00,0.33,0.00,0.00,0.00,0.83,0.00,0.33
G:0.00,0.83,0.00,0.00,0.00,0.66,0.00,0.00,0.67
T:0.00,0.17,0.00,083,1.00,0.17,0.17,0.00,0.00 }, then according to each column r (align
[i]) information content, concensus sequence is obtained, which is { CGATTGCAG }.Since the left side, concensus sequence is observed
The accounting of each column character A, C, G, the T of { CGATTGCAG }, the accounting of C is maximum in the first row of the left side, and the left side selects character C, so
The accounting of G is maximum in the first row of the right afterwards, and the right selects character G, compares the accounting and the right first row of left side first row character C
The accounting of character G, the accounting of first row character C are greater than the accounting of first row character G, then retain left side first row character C, delete
All characters of the right first row;Then, the first column selection of the left side retain character C, then the right first row in A accounting most
Greatly, the right selects character A, compares the accounting of left side first row character C and the accounting of the right first row character A, first row character C
Accounting be less than first row character A accounting, then retain the right first row character A, delete all characters of left side first row;With this
Analogize, until the l-mer that concensus sequence interception is length l, which is { ATTGCA } and the l-mer is the first l-mer.
Further, the k-mer that the first k-mer of traversal is concentrated, finds out each k-mer in DNA sequence dna large data sets
First l-mer forms the first l-mer collection.
Step 2.3 obtains the 2nd l-mer collection according to the first l-mer collection, and the 2nd l-mer collection includes several 2nd l-mer,
Each 2nd l-mer includes l character.
Specifically, the 2nd l-mer collection is obtained according to the first l-mer collection, comprising:
Binomial trees are constructed to the first l-mer that the first l-mer is concentrated;
Score is calculated to all nodes of Binomial trees according to the first score model, using the node of highest scoring as the 2nd l-
mer;
The first k-mer collection de-redundancy is handled according to the 2nd l-mer, obtains the 2nd k-mer collection;
The first l-mer collection is handled according to the 2nd k-mer collection, obtains the 2nd l-mer collection.
Further, Binomial trees are constructed to the first l-mer that the first l-mer is concentrated, comprising:
Choose root node of the first l-mer as Binomial trees;
The i+1 layer that Binomial trees are successively generated according to the i-th of Binomial trees layer, judges the node of the i+1 layer of Binomial trees
Whether quantity is greater than second threshold, if the quantity of the node of i+1 layer is greater than second threshold, is obtained most according to the first score model
The quantity of the node of the i+1 layer of whole Binomial trees, the node of i+1 layer is equal to second threshold, if the quantity of the node of i+1 layer
Less than or equal to second threshold, the node of the i+1 layer of Binomial trees is kept, the value of i is 0 < i < d;
Whether i-th layer of node for judging Binomial trees is implantation die body (l, d), if the node is implantation die body (l, d),
The node is stored in the first array M, if the node is not implantation die body (l, d), does not need to store the node in first
In array M, the value of i is 0 < i < d;
According to the node score in the first array M, using the node of highest scoring as the 2nd l-mer.
Specifically, Fig. 2 is referred to, Fig. 2 is that the implantation die body of traditional Binomial trees provided in an embodiment of the present invention searches for signal
Figure.From Figure 2 it can be seen that the method for conventional construction Binomial trees, the first l-mer that the root node of Binomial trees is concentrated for the first l-mer, two
I-th layer of the internal node or leaf node of Xiang Shu is the node for being i with the Hamming distances of the first l-mer of root node, the value of i
Range is 0 < i≤d, and the depth of the Binomial trees is d.Each layer of several extension nodes of correspondence of Binomial trees, several extension nodes are
The d neighbours of the first l-mer of root node, they and the first l-mer are from root node to internal node or the path subscript of leaf node
There is differences on position out.In this way, in Binomial trees each node illustrate with the first l-mer Hamming distances be i (0≤i≤
D) d neighbours.Wherein, extension node is the l-mer that length is l.
And the present embodiment constructs Binomial trees, root node is the first l-mer that the first l-mer is concentrated, then successively according to two
The i+1 layer of i-th layer of generation Binomial trees of Xiang Shu, judges whether the quantity of the node of the i+1 layer of Binomial trees is greater than the second threshold
Value obtains the node of the i+1 layer of Binomial trees according to the first score model if the quantity of this layer of node is greater than second threshold, should
The quantity of layer node is equal to second threshold, if the quantity of this layer of node is less than or equal to second threshold, keeps the i+1 layer of Binomial trees
Node, the value of i is 0 < i < d.
Specifically, enabling second threshold is Nmm(i), Nmm(i) indicate Binomial trees i-th (0 < i < d) layer node quantity, be
The extension node for avoiding losing each layer of Binomial trees, calculates Nmm(i) when, to the quantity of i-th layer of node multiplied by one it is safe because
Sub- α (α >=1).In the realization of APMS method, it is 2 that α, which is rule of thumb preferably provided with value, then Nmm(i) it designs as follows:
For example, when the present embodiment constructs Binomial trees, it is known that implantation die body long (l, d) degree is 5, and Hamming distances d is 3, wherein
Binomial trees root node is the first l-mer, and it is 1 that the node of Binomial trees first layer, which is with the Hamming distances of the first l-mer of root node,
L-mer, then the number of node totally 15 because implantation die body (l, d) is the l-mer that length is 5, each position have 3 kinds it is prominent
Become situation, the node of the present embodiment Binomial trees first layer takes all catastrophes of the first l-mer of root node, i.e. Binomial trees the
The number of one layer of node is 15;The node of the Binomial trees second layer be Binomial trees first layer node be implantation die body (l, d)
On the basis of, the node of the Binomial trees first layer is extended, the Hamming distances of the node and the extension node of the node are
1, and pass through the total C of number that formula (11) determine Binomial trees second layer node3 2* 2=6;Similarly, Binomial trees third layer node be
The node (number of network nodes 6) of the Binomial trees second layer is implanted on the basis of die body (l, d), to the node of the Binomial trees second layer
It is extended, the Hamming distances of the extension node of the node and the node are 1, and determine Binomial trees third layer by formula (11)
The total C of the number of node3 3* 2=2.The Binomial trees then finally constructed are using the first l-mer as root node, and Binomial trees first layer is 15
A node, the Binomial trees second layer are 6 nodes, and Binomial trees third layer is the tree of 2 nodes.
Further, the node of the i+1 layer of final Binomial trees, the quantity of this layer of node are obtained according to the first score model
Equal to second threshold.
Specifically, the present embodiment is under qPMS model, each of the Binomial trees for designing the first score model to assess building
The score of node y.Wherein, D'(y) it is that each node y for being used to calculate Binomial trees selected from DNA sequence dna large data sets D is obtained
Point the set containing qt DNA sequence dna, s is in a certain DNA sequence dna and the smallest l-mer of Hamming distances of node y.Generally
For, the score of the node y of Binomial trees is higher, and node y is closer to implantation die body (l, d).The present embodiment the first score mould
Type design is as follows:
By formula (11) it is found that conventional method assesses Binomial trees in the Binomial trees of any one the first l-mer building
In each node y score, be all the score for first calculating node y t DNA sequence dna in DNA sequence dna large data sets, from every
A score with the score with the first the smallest l-mer of l-mer Hamming distances as this article of DNA sequence dna is found in DNA sequence dna,
A point highest preceding qt DNA sequence dna is obtained again, the score of the qt DNA sequence dna is added, obtained final score conduct
The score of node y.For each node y, correspondence has scoren(y), the node y of highest scoring in these nodes y is chosen
As the 2nd l-mer.
But when traditional method has the drawback that calculating node y score every time, a time DNA of scanning will be removed again
It is big to calculate cost for sequence large data sets.The present embodiment in order to solve this problem, by all l-mer in every DNA sequence dna, root
Ascending order arrangement is carried out from small to large with the Hamming distances of the first l-mer according to the l-mer, queue is obtained, according to such row
Team's sequence, it can be determined that earlier l-mer is particularly likely that the 2nd l-mer finally acquired in queue.Pass through this
The queue of sample is sought being calculated as this reduction in the 2nd l-mer again, substantially need to only scan preceding several l-mer in queue
The l-mer that score is best in this DNA sequence dna will be found.Wherein, the D'(y in formula (1)) it is from DNA sequence dna large data sets
That chooses in D is used to calculate the set containing qt DNA sequence dna of the score of node y, in the present embodiment, D'(y) set expression
Are as follows:
Because the Hamming distances of all l-mer and the first l-mer in every DNA sequence dna carry out ascending order row from small to large
Column, after obtaining queue, all l-mer and the first the smallest l-mer of l-mer Hamming distances in every DNA sequence dna are
Queue foremost is come.The smallest l-mer of score that will be obtained from every DNA sequence dna presses Hamming distances again
Ascending order arrangement is carried out from small to large, new queue is obtained after arrangement, certain a line in the new queue is called Ci, then
In the present embodiment, for a d neighbours y of a first l-mer m' and the first l-mer m', there are CiAnd CiIn a position
j(1≤j≤|Ci|), if dH(Ci[j],m')–dH(y, m') >=0, then dH(Ci[j],m')–dH(y, m') is dH(y,Ci
[j]) the smallest possible value.Therefore, when scanning on the basis of new queue and calculate score, in new queuing
Certain a line C in sequencei, when encountering dH(Ci[j],m')–dH(y,m')≥dis(y,Ci[j]) such case when, current row can be completed
CiScanning, current line CiMinimum Hamming distances be dis (y, Ci[j]), by dis (y, Ci[j]) formula (11) are substituted into, it is tied
Point y in the CiCapable score scoren(y), and start next line CiThe scanning of+1 row, until owning in new queue
Row is scanned, using top score in the score of a line every in new queue as the score score of node yn(y)。
Score score as above is carried out respectively to all nodes of the i+1 layer of Binomial treesn(y) calculating is obtained to what is obtained
Divide and carry out ascending sort from small to large, the node of the biggish score of preceding second threshold is as final Binomial trees in selected and sorted
The quantity of the node of i+1 layer, this layer of node is equal to second threshold.
The present embodiment is calculated by the first score model and is selected in the first l-mer building Binomial trees that the first l-mer is concentrated
The high node of score goes to generate extension node, because the high node of score is it is more likely that implantation die body (l, d), the present embodiment
It is to generate extension node from the direction of implantation die body (l, d), to reduce the calculation amount of subsequent implantation die body (l, d), reduces
The runing time of APMS method.
Further, whether i-th layer of node for judging Binomial trees is implantation die body (l, d), if the node is implantation mould
The node is then stored in the first array M by body (l, d), if the node is not implantation die body (l, d), not needing storage should
Node is in the first array M;
Specifically, the present embodiment takes all d neighbor node removal search implantation of Binomial trees not as conventional method
Die body (l, d), but take the similar node removal search with implantation die body (l, d).It is in i-th layer of node for judging Binomial trees
When no die body (l, d) for implantation, it is that this node is updated in DNA sequence dna large data sets, judges whether at least there is qt item
Hamming distances in DNA sequence dna all comprising a l-mer and the node are less than or equal to d, if it is present determining that the node is to plant
Enter die body (l, d), which is stored in the first array M, if it does not exist, then the node is not implantation die body (l, d), no
It needs for the node to be stored in the first array M.Wherein, the hamming of the extension node of i-th layer of node and the node i+1 layer
Distance is 1.Wherein, the value of i is 0 < i < d.
Further, according to the node score in the first array M, using the node of highest scoring as the 2nd l-mer.
Specifically, the node in the first array M is the node close to implantation die body (l, d) selected the first l-mer
Set selects the node of wherein highest scoring for the implantation die body most likely searched for, by the highest scoring from the first array M
Node as the 2nd l-mer.
Further, the first l-mer of traversal concentrates each first l-mer, constructs binomial tree model as described above and obtains the
Two l-mer, obtain the 2nd l-mer collection according to the 2nd l-mer, obtain final implantation die body (l, d) by the 2nd l-mer collection.
Specifically, each first l-mer is concentrated to construct binomial tree model as described above first l-mer, by the first score
Model calculate each using the first l-mer as the first array M of the binomial tree model of root node, select the first array M in score most
Twoth l-mer of the high node as the first l-mer, the first l-mer is obtained each of is then concentrated to the first l-mer the
Two l-mer constitute the 2nd l-mer collection by the 2nd l-mer, and the 2nd l-mer the 2nd l-mer concentrated is pressed the first score model again
Score is calculated, these scores are re-started into sequence from high to low, exports the node set of the rearrangement as finally
It is implanted into die body (l, d).
In conclusion the present embodiment, which is based on binomial tree method search implantation die body (l, d), to be opened from the first l-mer of root node
Beginning scans for layer by layer.For the first l-mer of root node, first determine whether the first l-mer of root node is an implantation mould
Body (l, d), and by be 1 with the Hamming distances of the first l-mer of root node all nodes as the 1st layer of extension node.For
I-th (0 < i < d) layer, selects N from the extension node of this layer firstmm(i) the high node of a score node final as this layer,
By respectively with this Nmm(i) node of the extension node that the Hamming distances of a node chosen are 1 as i+1 layer.For
D layers, directly judge whether this layer of node is an implantation die body (l, d).Judge each layer each extension node whether be
One implantation die body (l, d) is stored in the first array M, if the extension node is implantation die body (l, d) if the expansion
Exhibition node is not implantation die body (l, d), then does not need to be stored in the first array M.In this search process, if the first l-
There are multiple implantation die bodys (l, d) in the first array M in the Binomial trees of mer building, then selects highest scoring from the first array M
Node is as the 2nd l-mer.2nd l-mer is obtained to the first l-mer of each of the first l-mer concentration, by these the 2nd l-mer
The 2nd l-mer collection is obtained, the 2nd l-mer the 2nd l-mer concentrated is re-started into sequence from high to low by its score again, it is defeated
The node set of the rearrangement is as final implantation die body (l, d) out.
Further, de-redundancy processing is carried out to the first k-mer collection according to the 2nd l-mer, obtains the 2nd k-mer collection, wrapped
It includes:
The 4th l-mer is obtained from DNA sequence dna large data sets;
Obtain the third desired value between the k-mer of the k-mer and the 4th l-mer of the 2nd l-mer;
It whether is redundancy according to the k-mer that third desired value judges that the first k-mer is concentrated, as the k- that the first k-mer is concentrated
The Hamming distances d of mer and the k-mer in the 2nd l-mer are less than or equal to third desired value, and the k-mer that the first k-mer is concentrated is superfluous
It is remaining, k-mer is concentrated from the first k-mer and is deleted, the 2nd k-mer collection is obtained, k-mer is otherwise retained in the first k-mer collection, is obtained
To the 2nd k-mer collection.
Specifically, for the first k-mer collection, it is the 2nd l- that the first k-mer, which is concentrated there may be the k-mer of redundancy, k-mer,
There is a length of k'(k for the substring or k-mer and the 2nd l-mer of same initial position in mermin≤ k' < k) it is overlapping.Base
In this, the present embodiment is generated using last first when the k-mer concentrated every time by the first k-mer obtains a l-mer
The 2nd l-mer come differentiate the first k-mer concentrate k-mer whether be a redundancy k-mer, if the k-mer be redundancy,
K-mer is concentrated from the first k-mer and is deleted, the 2nd k-mer collection is obtained;If the k-mer is not redundancy, k-mer is retained in
First k-mer collection obtains the 2nd k-mer collection.
Enable k-mer and the implantation of any initial position in third desired value e (k) one any die body example of expression
The desired value of the Hamming distances of the k-mer of identical initial position in die body (l, d).The present embodiment is from DNA sequence dna large data sets D
The 4th l-mer is obtained, the die body example that the 4th l-mer is calculated as third desired value e (k), the 2nd l-mer is as the third phase
The implantation die body (l, d) that prestige value e (k) is calculated.E (l) is based on total probability formula and calculates and obtain, and appoints and takes the 4th l-mer and the 2nd l-
A mutated site between mer, it is assumed that this mutation randomly comes across on a position in l position, then third
Desired value e (k) is equal to e (l) multiplied by k/l.The present embodiment third desired value e (k) design is as follows:
It is the k-mer of a redundancy for the first k-mer k-mer x concentrated is defined as: the 2nd l- in the present embodiment
There are a k-mer z to make d in merHIn the k-mer x and the 2nd l-mer that (z, x)≤e (k), i.e. the first k-mer are concentrated
The Hamming distances d of k-mer z is less than or equal to third desired value e (k), then the k-mer that the first k-mer is concentrated is redundancy, by k-mer
It concentrates and deletes from the first k-mer, do not need to carry out the k-mer as above implantation die body (l, d) search routine, otherwise by k-mer
It is retained in the first k-mer collection, carries out as above implantation die body (l, d) search routine.Wherein, the k-mer concentrated for the first k-mer
X is that the k-mer of a redundancy can be with is defined as: pf (x, k') and sf (x, k') is enabled to respectively indicate a character string k-mer x
Length be k' prefix and length be k' suffix, there are kmin≤ k' < k makes dH(pf(z,k'),sf(x,k'))≤e
(k') or dH(sf(z,k'),pf(x,k'))≤e(k')。
In the present embodiment, by designing third desired value e (k), de-redundancy processing is carried out to the first k-mer collection, is reduced
The calculation amount of subsequent implantation die body (l, d), reduces the runing time of APMS method.
Further, the first l-mer collection is handled according to the 2nd k-mer collection, obtains the 2nd l-mer collection.
Specifically, after by the above-mentioned progress de-redundancy processing to the first k-mer collection, the 2nd k-mer collection has been obtained, with second
K-mer collection updates the first k-mer collection.Because of the 2nd k-mer collection, by the k-mer of redundancy after the first k-mer concentration deletion, no
The k-mer of the redundancy need to be obtained from the first k-mer collection, and then obtains the first l-mer operation, so the present embodiment APMS method is every
All it is once to obtain k-mer from the first k-mer collection, the first l-mer is obtained by the k-mer, constructs two again by the first l-mer
Xiang Shu obtains the 2nd l-mer by Binomial trees, is then removed the k-mer of redundancy from the first k-mer collection by the 2nd l-mer
It removes, obtains the 2nd k-mer collection, update the first k-mer collection with the 2nd k-mer collection, and then obtain from updated first k-mer concentration
K-mer is taken, the first l-mer is obtained by the k-mer, carries out process as above repeatedly.For the first l-mer collection, the first l-mer
The first l-mer of each of concentration constructs Binomial trees, the score of each node in Binomial trees is calculated, by highest scoring in Binomial trees
Node is as corresponding 2nd l-mer of the first l-mer, and the first l-mer that each first l-mer is concentrated is corresponding, and there are one the
Two l-mer obtain the 2nd l-mer collection.
Step 3 determines implantation die body (l, d) from the 2nd l-mer collection according to the first score model.
Specifically, the score that the first score model is pressed to the 2nd l-mer that the 2nd l-mer is concentrated, is arranged from high to low
Sequence, the 2nd l-mer collection after exporting the rearrangement, to obtain implantation die body (l, d).
Fig. 3 is referred to, Fig. 3 is the structure that a kind of DNA data set provided in an embodiment of the present invention is implanted into die body searcher
Schematic diagram.Another embodiment of the present invention provides a kind of DNA data set be implanted into die body searcher, which includes:
Data acquisition module, the implantation die body search ginseng for obtaining DNA sequence dna large data sets, obtaining DNA sequence dna large data sets
Number;
Data processing module obtains the first k-mer collection according to DNA sequence dna large data sets, implantation die body search parameter, according to
First k-mer collection obtains the first l-mer collection, obtains the 2nd l-mer collection according to the first l-mer collection;
Data determining module determines implantation die body from the 2nd l-mer collection according to the first score model.
DNA data set provided in an embodiment of the present invention is implanted into die body searcher device, can execute above method implementation
Example, it is similar that the realization principle and technical effect are similar, and details are not described herein.
A kind of computer readable storage medium that yet another embodiment of the invention provides, is stored thereon with computer program, on
It states when computer program is executed by processor and performs the steps of
The implantation die body search parameter for obtaining DNA sequence dna large data sets, obtaining DNA sequence dna large data sets;
The first k-mer collection is obtained according to DNA sequence dna large data sets, implantation die body search parameter, is obtained according to the first k-mer collection
To the first l-mer collection, the 2nd l-mer collection is obtained according to the first l-mer collection;
Implantation die body is determined from the 2nd l-mer collection according to the first score model.
Computer readable storage medium provided in an embodiment of the present invention can execute above method embodiment, realize former
Reason is similar with technical effect, and details are not described herein.
In order to illustrate advantage of the invention, the present embodiment verifies APMS of the present invention in analogue data and truthful data respectively
The advantage of method.Analogue data be mainly used for by with existing method at runtime between compared with test the effect of APMS method
Rate, at the same verify APMS method can find implantation die body (l, d);Truthful data is mainly used for verifying the effective of APMS method
Property, can verifying APMS method efficiently find true die body in the biological data of real world.
Wherein, in analogue data, in order to carry out integrative test, three groups of simulated data sets are generated in the present embodiment,
Under three groups of simulated data sets compared with the conventional method, the advantage of this method APMS is verified.Wherein, the existing method compared is chosen
It is the highest essence of efficiency for coping with DNA sequence dna large data sets including FMotif, PairMotifChIP and MEME-ChIP:FMotif
True PMS method;PairMotifChIP is the approximate PMS method for coping with DNA sequence dna large data sets of newest proposition;MEME-
ChIP is most notable one of die body discovery method.
The present embodiment measures prediction die body (l, d) m using coefficient of performance mPCpWith implantation die body (l, d) mkSimilitude.
Wherein, lenoverlap(mp,mk) indicate prediction die body (l, d) mpWith implantation die body (l, d) mkOverlapping character number, mPC are calculated
It is as follows:
(1) first group of simulated data sets is for carrying out validation test in the data with different die bodys (l, d), wherein
In DNA sequence dna large data sets, DNA sequence dna item number t=3000, the number of characters n=200 of every DNA sequence dna, first group of simulation number
According to machine test in implantation die body (l, d) search accounting be q=0.5, the i.e. DNA sequence dna needed in first group of simulated data sets test
Item number be 3000*0.5=1500, conservative parameter g=0.5, then under different l and d values, compare APMS, FMotif,
PairMotifChIP and MEME-ChIP method.
Comparison result on 1, first group of simulated data sets of table
In table 1, time indicates runing time, and s indicates the second, and m indicates minute, and h indicates hour, and N indicates that runing time is more than
48 hours and can not make prediction.Seen from table 1, t, n, q, g, under the value of different l and d, APMS method runing time are given
It is faster than APMS, FMotif, PairMotifChIP and MEME-ChIP method.When l and d value is bigger, FMotif
The case where method is more than 48 hours there are runing time and can not make prediction;PairMotifChIP and MEME-ChIP method is in l
When increased with d, runing time is relatively stable, although the runing time of APMS method is increased as l and d increases
Add, but still is s rank, it is faster than PairMotifChIP method runing time, faster than MEME-ChIP method runing time.
(2) second groups of simulated data sets are for carrying out validation test in the different data of die body signal strength: where
In DNA sequence dna large data sets, DNA sequence dna item number t=3000, the number of characters n=200 of every DNA sequence dna are implanted into die body (l, d)
=(15,5), implantation die body (l, d) search accounting q and conservative parameter g is in different values in second group of analogue data test
Under, compare APMS, FMotif, PairMotifChIP and MEME-ChIP method.Wherein, die body signal strength depends on q and g, q
When value is small and g value is big, die body signal strength is small;Q value is big and g value hour, die body signal strength are big.
Comparison result on 2, second groups of simulated data sets of table
In table 2, time indicates runing time, and s indicates the second, and m indicates minute, and h indicates hour, and N indicates that runing time is more than
48 hours and can not make prediction.As can be seen from Table 2, given t, n, l, d, under the value of different q and g, APMS method runing time
It is faster than APMS, FMotif, PairMotifChIP and MEME-ChIP method.When die body signal strength is smaller, FMotif
The case where method is more than 48 hours there are runing time and can not make prediction;APMS, PairMotifChIP, MEME-ChIP method
Runing time is relatively stable, and APMS ratio PairMotifChIP method runing time is fast, when than the operation of MEME-ChIP method
Between faster.
(3) third group simulated data sets on the DNA sequence dna large data sets of different scales for carrying out validation test: every
The number of characters n=200 of DNA sequence dna, be implanted into die body (l, d)=(15,5), third group analogue data test in be implanted into die body (l,
D) search for accounting q=0.5 and conservative parameter g=0.5, then in DNA sequence dna item number t under different values, compare APMS,
FMotif, PairMotifChIP and MEME-ChIP method.
Comparison result on table 3, third group simulated data sets
In table 3, time indicates runing time, and s indicates the second, and m indicates minute, and h indicates hour, and N indicates that runing time is more than
48 hours and can not make prediction.Seen from table 3, n, q, g, l, d are given, under the value of different t, APMS method runing time is equal
It is faster than APMS, FMotif, PairMotifChIP and MEME-ChIP method.It is bigger in the data of DNA sequence dna large data sets
When, MEME-ChIP method is more than 48 hours there are runing time and can not make the case where predicting, PairMotifChIP method
The rank that runing time increases is greater than APMS method.Wherein, because the maximum DNA sequence number quantity set that FMotif limits processing is
3000, so FMotif is not engaged in the comparison on third group data set.
By table 1, table 2 and table 3 as it can be seen that APMS method can complete implantation mould within the shortest time in all cases
The prediction of body (l, d), the order of magnitude quickly in FMotif, PairMotifChIP and MEME-ChIP method.Wherein, for all
Method, the value of coefficient of performance mPC are 1, illustrate that they can accurately find out implantation die body (l, d), mainly three groups of reason
Analogue data concentrates the die body information content contained quite sufficient, even if when die body signal strength very little, it still can essence
Implantation die body (l, d) is found out quasi-ly.
Fig. 4 is referred to, Fig. 4 is that APMS, PairMotifChIP and MEME-ChIP method provided in an embodiment of the present invention exists
Comparison result under the different DNA sequence dnas of analogue data is intended to.As it can be seen that the runing time of APMS method is with DNA sequence dna quantity
The increase of collection and it is about linear increase, and the runing time of PairMotifChIP with DNA sequence dna quantity collection increase about
Increase in square grade, and MEME-ChIP method DNA sequence dna item number be 12000 have existed runing time be more than 48 hours and
The case where can not making prediction.
Wherein, on truthful data, the present embodiment uses mouse embryo stem cell (Mouse Embryonic Stem
Cell, abbreviation mESC) ChIP-seq data, the ChIP-seq data be widely used to the most verify die body searching method
The data of validity.MESC data include 12 group data sets (c-Myc, CTCF, Esrrb, Klf4, Nanog, n-Myc, Oct4,
Smad1, Sox2, STAT3, Tcfcp2I1, Zfx), wherein each group data set is named by ChIP-ed transcription factor.In APMS
Method search for die body when, to 12 groups of different data sets use unified implantation die body (l, d) search parameter, implantation die body (l,
D) accounting q=0.3, conservative parameter g=0.5, for each data set, before taking are searched for in=(13,4), implantation die body (l, d)
Input of 3000 DNA sequence dnas as APMS method.
Fig. 5 is referred to, Fig. 5 is that a kind of Efficient Solution DNA sequence dna large data sets provided in an embodiment of the present invention are implanted into die body
Experimental result schematic diagram of the searching method in truthful data.As seen from the figure, for each data set, illustrated in figure containing
DNA sequence dna quantity, runing time, the announcement die body of sequence logo form and prediction die body, wherein top is in sequence logo
Die body is announced, is below prediction die body.For each data set, by comparing prediction die body and die body is announced, it can be found that
APMS method can find prediction die body similar with die body is announced on 12 group data sets;And the fortune on all data sets
The row time is all within 6 minutes.
As it can be seen that APMS method can be used for efficiently and effectively handling true DNA sequence dna large data sets.
In conclusion APMS method is regardless of can efficiently and effectively locate in simulated data sets or real data set
DNA sequence dna large data sets are managed, APMS method can not only successfully find out implantation die body (l, d) or true die body, and compare
The operation of existing implantation die body (l, d) searching method quickly much, is concentrated in analogue data, it is seen then that APMS method runing time with
The increase of DNA sequence data collection scale linearly increase.
The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be said that
Specific implementation of the invention is only limited to these instructions.For those of ordinary skill in the art to which the present invention belongs, exist
Under the premise of not departing from present inventive concept, a number of simple deductions or replacements can also be made, all shall be regarded as belonging to of the invention
Protection scope.
Claims (10)
1. a kind of DNA data set is implanted into die body searching method characterized by comprising
The implantation die body search parameter for obtaining DNA sequence dna large data sets, obtaining the DNA sequence dna large data sets;
The first k-mer collection is obtained according to the DNA sequence dna large data sets, the implantation die body search parameter, according to described first
K-mer collection obtains the first l-mer collection, obtains the 2nd l-mer collection according to the first l-mer collection;
The implantation die body is determined from the 2nd l-mer collection according to the first score model.
2. the method according to claim 1, wherein according to the DNA sequence dna large data sets, the implantation die body
Search parameter obtains the first k-mer collection, comprising:
Length k is obtained, several k-mer are obtained from the DNA sequence dna large data sets according to the length k;
First threshold is obtained, the first k-mer collection is obtained according to the first threshold, the k-mer.
3. according to the method described in claim 2, it is characterized in that, obtaining the length k, comprising:
The first desired value is obtained according to the DNA sequence dna large data sets;
The second desired value is obtained according to the DNA sequence dna large data sets, the implantation die body search parameter;
It is worth to obtain the length k according to first desired value, second expectation.
4. according to the method described in claim 3, it is characterized in that, obtaining the first threshold, comprising:
The quantity of DNA sequence dna is obtained from the DNA sequence dna large data sets;
The first threshold is obtained according to second desired value, the DNA sequence dna quantity.
5. according to the method described in claim 4, it is characterized in that, obtaining the first l-mer according to the first k-mer collection
Collection, comprising:
It is concentrated from the first k-mer and obtains k-mer;
Processing is extended to each of the DNA sequence dna large data sets k-mer, the first k-mer collection being expanded;
De-redundancy processing, the 2nd k-mer being expanded are carried out according to first k-mer collection of second score model to the extension
Collection;
Intercepting process is carried out to the 2nd k-mer collection of the extension, obtains the first l-mer;
According to the first l-mer, the first l-mer collection is obtained.
6. according to the method described in claim 5, it is characterized in that, carry out intercepting process to the 2nd k-mer collection of the extension,
Obtain the first l-mer, comprising:
Aligned sequences are obtained according to the 2nd k-mer collection of the extension;
Intercepting process is carried out to the aligned sequences according to preset rules, obtains the first l-mer.
7. according to the method described in claim 6, it is characterized in that, obtain the 2nd l-mer collection according to the first l-mer collection,
Include:
Binomial trees are constructed to the first l-mer that the first l-mer is concentrated;
Score is calculated according to all nodes of first score model to the Binomial trees of building, by the highest scoring
Node is as the 2nd l-mer;
De-redundancy processing is carried out to the first k-mer collection according to the 2nd l-mer, obtains the 2nd k-mer collection;
The first l-mer collection is handled according to the 2nd k-mer collection, obtains the 2nd l-mer collection.
8. the method according to the description of claim 7 is characterized in that according to the 2nd l-mer to the first k-mer collection into
The processing of row de-redundancy, obtains the 2nd k-mer collection, comprising:
The 4th l-mer is obtained from the DNA sequence dna large data sets;
Obtain the third desired value between the k-mer of the k-mer and the 4th l-mer of the 2nd l-mer;
It whether is redundancy according to the k-mer that the third desired value judges that the first k-mer is concentrated, when the k-mer is concentrated
K-mer and the 2nd l-mer in k-mer Hamming distances d be less than or equal to the third desired value, the first k-
The k-mer that mer is concentrated is redundancy, and k-mer is concentrated from the first k-mer and is deleted, the 2nd k-mer collection is obtained, otherwise by k-
Mer is retained in the first k-mer collection, obtains the 2nd k-mer collection.
9. a kind of DNA data set is implanted into die body searcher, which is characterized in that described device includes:
Data acquisition module obtains the DNA sequence dna large data sets, the implantation die body of the acquisition DNA sequence dna large data sets is searched
Rope parameter;
Data processing module obtains the first k- according to the DNA sequence dna large data sets, the implantation die body search parameter
Mer collection, obtains the first l-mer collection according to the first k-mer collection, obtains described second according to the first l-mer collection
L-mer collection;
Data determining module determines the implantation die body from the 2nd l-mer collection according to first score model.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program
Method described in any item of the claim 1 to 8 is realized when being executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910181475.8A CN110059228B (en) | 2019-03-11 | 2019-03-11 | DNA data set implantation motif searching method and device and storage medium thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910181475.8A CN110059228B (en) | 2019-03-11 | 2019-03-11 | DNA data set implantation motif searching method and device and storage medium thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110059228A true CN110059228A (en) | 2019-07-26 |
CN110059228B CN110059228B (en) | 2021-11-30 |
Family
ID=67316070
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910181475.8A Active CN110059228B (en) | 2019-03-11 | 2019-03-11 | DNA data set implantation motif searching method and device and storage medium thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110059228B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111933215A (en) * | 2020-06-08 | 2020-11-13 | 西安电子科技大学 | Transcription factor binding site searching method, system, storage medium and terminal |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102651030A (en) * | 2012-04-09 | 2012-08-29 | 华中科技大学 | Social network association searching method based on graphics processing unit (GPU) multiple sequence alignment algorithm |
CN103425900A (en) * | 2012-05-21 | 2013-12-04 | 上海聚类生物科技有限公司 | Statistical-significance-based system capable of quickly identifying genome transcription factor binding sites |
CN103514381A (en) * | 2013-07-22 | 2014-01-15 | 湖南大学 | Protein biological network motif identification method integrating topological attributes and functions |
CN103995988A (en) * | 2014-05-30 | 2014-08-20 | 周家锐 | High-throughput DNA sequencing mass fraction lossless compression system and method |
US20170293612A1 (en) * | 2014-09-26 | 2017-10-12 | British Telecommunications Public Limited Company | Efficient pattern matching |
CN107729762A (en) * | 2017-08-31 | 2018-02-23 | 徐州医科大学 | A kind of DNA based on difference secret protection model closes frequent motif discovery method |
US20180253536A1 (en) * | 2017-03-01 | 2018-09-06 | Seven Bridges Genomics, Inc. | Watermarking for data security in bioinformatic sequence analysis |
CN108664807A (en) * | 2018-04-03 | 2018-10-16 | 徐州医科大学 | Method based on the difference privacy DNA motif discoveries that stochastical sampling and die body are compressed |
-
2019
- 2019-03-11 CN CN201910181475.8A patent/CN110059228B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102651030A (en) * | 2012-04-09 | 2012-08-29 | 华中科技大学 | Social network association searching method based on graphics processing unit (GPU) multiple sequence alignment algorithm |
CN103425900A (en) * | 2012-05-21 | 2013-12-04 | 上海聚类生物科技有限公司 | Statistical-significance-based system capable of quickly identifying genome transcription factor binding sites |
CN103514381A (en) * | 2013-07-22 | 2014-01-15 | 湖南大学 | Protein biological network motif identification method integrating topological attributes and functions |
CN103995988A (en) * | 2014-05-30 | 2014-08-20 | 周家锐 | High-throughput DNA sequencing mass fraction lossless compression system and method |
US20170293612A1 (en) * | 2014-09-26 | 2017-10-12 | British Telecommunications Public Limited Company | Efficient pattern matching |
US20180253536A1 (en) * | 2017-03-01 | 2018-09-06 | Seven Bridges Genomics, Inc. | Watermarking for data security in bioinformatic sequence analysis |
CN107729762A (en) * | 2017-08-31 | 2018-02-23 | 徐州医科大学 | A kind of DNA based on difference secret protection model closes frequent motif discovery method |
CN108664807A (en) * | 2018-04-03 | 2018-10-16 | 徐州医科大学 | Method based on the difference privacy DNA motif discoveries that stochastical sampling and die body are compressed |
Non-Patent Citations (2)
Title |
---|
FAISAL BIN ASHRAF等: "RPPMD (Randomly projected possible motif discovery): An efficient bucketing method for finding DNA planted Motif", 《2017 INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND COMMUNICATION ENGINEERING (ECCE)》 * |
张懿璞: "一种新的DNA模体发现聚类求精算法", 《西安电子科技大学学报》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111933215A (en) * | 2020-06-08 | 2020-11-13 | 西安电子科技大学 | Transcription factor binding site searching method, system, storage medium and terminal |
CN111933215B (en) * | 2020-06-08 | 2024-04-05 | 西安电子科技大学 | Transcription factor binding site searching method, system, storage medium and terminal |
Also Published As
Publication number | Publication date |
---|---|
CN110059228B (en) | 2021-11-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lee et al. | Genetic algorithm with ant colony optimization (GA-ACO) for multiple sequence alignment | |
CN103745258B (en) | Complex network community mining method based on the genetic algorithm of minimum spanning tree cluster | |
CA2424031C (en) | System and process for validating, aligning and reordering genetic sequence maps using ordered restriction map | |
Busa-Fekete et al. | Fast boosting using adversarial bandits | |
Kolpakov et al. | Searching for gapped palindromes | |
CN105893787A (en) | Prediction method for protein post-translational modification methylation loci | |
CN108520284A (en) | A kind of improved spectral clustering and parallel method | |
CN109816087B (en) | Strong convection weather discrimination method for rough set attribute reduction based on artificial fish swarm and frog swarm hybrid algorithm | |
Chakrabarty | A regression approach to distribution and trend analysis of quarterly foreign tourist arrivals in India | |
US20130158884A1 (en) | Method for identifying nucleotide sequence, method for acquiring secondary structure of nucleic acid molecule, apparatus for identifying nucleotide sequence, apparatus for acquiring secondary structure of nucleic acid molecule, program for identifying nucleotide sequence, and program for acquiring secondary structure of nucleic acid molecule | |
CN104268629A (en) | Complex network community detecting method based on prior information and network inherent information | |
Yap et al. | High performance computational methods for biological sequence analysis | |
CN103164631B (en) | A kind of intelligent coordinate expression gene analyser | |
CN104156635A (en) | OPSM mining method of gene chip expression data based on common sub-sequences | |
CN110059228A (en) | A kind of DNA data set implantation die body searching method and its device and storage medium | |
CN114512178A (en) | Codon optimization method based on Italian quantum annealing | |
CN110070908A (en) | A kind of die body searching method, device, equipment and the storage medium of binomial tree model | |
CN108694439A (en) | A kind of topological construction method of Bayesian network | |
CN109033746B (en) | Protein compound identification method based on node vector | |
CN111597400A (en) | Computer retrieval system and method based on way-finding algorithm | |
CN110955702A (en) | Pattern data mining method based on improved genetic algorithm | |
CN116054144A (en) | Distribution network reconstruction method, system and storage medium for distributed photovoltaic access | |
CN115511052A (en) | Neural network searching method, device, equipment and storage medium | |
CN108595910A (en) | A kind of group's protein conformation space optimization method based on diversity index | |
CN108388774A (en) | A kind of on-line analysis of polypeptide spectrum matched data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |