CN106339609A - Heuristic mining method of optimal comparing sequence mode of free interval constraint - Google Patents
Heuristic mining method of optimal comparing sequence mode of free interval constraint Download PDFInfo
- Publication number
- CN106339609A CN106339609A CN201610831506.6A CN201610831506A CN106339609A CN 106339609 A CN106339609 A CN 106339609A CN 201610831506 A CN201610831506 A CN 201610831506A CN 106339609 A CN106339609 A CN 106339609A
- Authority
- CN
- China
- Prior art keywords
- sequence
- spacing constraint
- pattern
- candidate
- gene
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Abstract
The invention discloses a heuristic mining method of an optimal comparing sequence mode of free interval constraints. The heuristic mining method comprises the following steps of: step S1, inputting a positive example sequence set, a negative example sequence set and the expected mining quantity of comparing sequence modes; step S2, randomly generating gene type candidate mode codes with the pre-set quantity; step S3, obtaining a candidate comparing sequence mode corresponding to each gene type candidate mode code; step S4, calculating the contrast ratio of each candidate comparing sequence mode; step S5, judging whether the current gene type candidate mode codes meet method finishing conditions or not; if so, taking the k candidate comparing sequence modes with the optimal contrast ratio as a final mining result; otherwise, executing step S6; step S6, selecting the current gene type candidate mode codes; and step S7, forming new gene type candidate mode codes and returning to the step S3. By adopting the comparing sequence mode mining method provided by the invention, a condition that a result is lost due to improper parameter setting under the condition that a user does not have priori knowledge is avoided.
Description
Technical field
The present invention relates to data mining technology field is and in particular to a kind of the optimum of free spacing constraint contrasts sequence pattern
Heuristic mining method.
Background technology
Sequential mode mining, as a vital task of data mining, has a wide range of applications.For example, Utilities Electric Co. leads to
Cross analysis of history electricity consumption data, improve the degree of accuracy to load forecast.For another example, health disease control department passes through to analyze infectious disease
The space-time Monitoring Data propagated is it is expected to find that infectious disease time-place clustering breaks out rule, and then provides reference for prevention and control.With
This simultaneously, sequential mode mining also receives the concern of numerous researchers, and different types of sequence pattern is proposed, for example successively
Frequent Sequential Patterns, closed sequential pattern, periodic sequence pattern, partial order sequence pattern etc..
The target of contrast sequential mode mining is to excavate in positive example arrangement set that frequently (support of sequence pattern is big
In specified threshold) and in negative example arrangement set infrequently (support of sequence pattern be less than specified threshold) contrast sequence mould
Formula.Contrast sequence pattern can identify comparative information between different classes of arrangement set, and identifies the feature of sample set of all categories,
It is applied to the sequence data analysis of multiple fields.For example, in medical domain, the dna sequence of analysis positive tumor and negative tumours,
By contrasting sequence pattern, it is possible to increase the precision of clinical diagnosis;In commercial field, contrast the different purchases of different age group client
Thing pattern, can improve the specific aim of commercial promotions activity.
Widely use the concept of spacing constraint in sequential mining, its object is to allow sequence pattern coupling more flexibly,
General.Spacing constraint is an interval being determined by two nonnegative integers, represents that in sequence pattern, two adjacent elements are in sequence
The middle minimum of a value of element number allowing interval and maximum.In the research having contrasted sequential mode mining method, interval
Constraint needs user to set.Practice have shown that, when not having enough prioris, user is difficult to set appropriate spacing constraint.Lose
When spacing constraint, much useful sequence pattern can be lost.And exhaustive all possible spacing constraint then can lead to algorithm to be held
Row overlong time, loses practicality.
Content of the invention
To be solved by this invention is that user sets inappropriate spacing constraint and leads to lose when not having enough prioris
Lose useful sequence pattern, exhaustive all possible spacing constraint leads to the long problem of algorithm execution time.
The present invention is achieved through the following technical solutions:
A kind of optimum contrast sequence pattern heuristic mining method of free spacing constraint, comprising: step s1, inputs positive example
The contrast sequence pattern number that arrangement set, negative example arrangement set and expectation are excavated;Step s2, randomly generates predetermined quantity
Genotype candidate pattern encodes, and described genotype candidate pattern coding includes the fixing gene of at least one length, described gene
Including head and afterbody, head includes the spacing constraint set randomly generating, and afterbody comes from the positive example arrangement set inputting
The contrast sequence pattern number excavated more than expectation with the character set of negative example arrangement set, described predetermined quantity;Step s3 is right
Each genotype candidate pattern coding is decoded operating to obtain the corresponding candidate contrast of each genotype candidate pattern coding
Sequence pattern;Step s4, the positive example arrangement set in conjunction with input and negative example arrangement set calculate each candidate contrast sequence pattern
Contrast;Step s5, judges whether current genotype candidate pattern coding meets method termination condition: if meeting, contrast
K optimum candidate's contrast sequence pattern is final Result, otherwise execution step s6, and wherein, k is the right of expectation excavation
Ratio sequence pattern number;Step s6, adopts roulette wheel selection to current according to the contrast that each candidate contrasts sequence pattern
Genotype candidate pattern coding is selected;Step s7, predefines to the partial genotype candidate pattern coding selected
Genetic manipulation form new genotype candidate pattern coding, and go to step s3.
The contrast sequential mode mining method that the present invention provides it is not necessary to user pre-sets spacing constraint, but automatically
Optimal spacing constraint is calculated to candidate pattern, it is to avoid lose useful sequence pattern.In this method for digging, for avoid by
The situation that cannot obtain solution in reasonable run time that may lead in the full search of high calculation cost, present invention introduces evolve
The mode calculating.Evolutionary computation is a class heuristic search optimized algorithm, mainly passes through these three behaviour that select, evaluate and make a variation
Realize the optimization of candidate solution.Evolutionary computation has the advantages that robustness, can be reasonable to various data sets to be excavated
Adapt to.It is to apply to contrast in sequential mode mining by the heuristic search mechanism of evolutionary computation, the present invention proposes new base
Because of type candidate pattern coding.With each gene internal between different genes, spacing constraint randomly generates, and each interval is about
Bundle allows to differ.Therefore, spacing constraint can be constantly updated so that candidate contrasts sequence pattern contrast court during evolution
Bigger direction to evolve, thus overcome user oneself setting monospace constraint and can not find optimal solution, or exhaustive
Be possible to spacing constraint leads to the irrational problem of run time.Further, the present invention adopts roulette wheel selection to current
Genotype candidate pattern coding is selected, and the big candidate's contrast selected probability of sequence pattern of contrast can be bigger, obtains
Result more accurate.
Optionally, define finite assemble of symbol ∑, the optional sign in assemble of symbol ∑ is referred to as item, by glossary of symbols
The ordered sequence closing the composition of the item in ∑ is referred to as sequence, is expressed as s=< e1,e2,…,ei,…em>, wherein, eiIt is referred to as
For element and 1≤i≤m, the length of | s | expression sequence s, the element number comprising in sequence s;For i-th in sequence s
Element s[i]With j-th element s[j](1≤i≤j≤| s |), gap (s, i, j) represents i-th element s[i]With j-th element s[j]
Between interval element number, i.e. gap (s, i, j)=j-i-1.
Optionally, for any two sequence s' and s ", meet condition 1 and condition 2, then claim < k1,k2,…,k|s”|> it is sequence
Row s " occurs in one of sequence s', is designated asWherein, condition 1 is: | s'| >=| s " |, that is, the length of sequence s' is not
Less than sequence s " length;Condition 2 is: there is set of number 1≤k1≤k2≤…≤k|s”|≤ | s'| so thatRight
In 1≤i≤| s " | permanent set up.
Optionally, spacing constraint γ is interval [γ .min, γ .max], and γ .min≤γ .max, wherein γ .min represent
The minimum interval element number allowing in spacing constraint, γ .max represents the largest interval element number allowing in spacing constraint,
γ .min and γ .max is all higher than equal to 0;It is referred to as being spaced by the ordered sequence that different or the multiple spacing constraint of identical form
Constrained sequence γ, its form is: γ=< γ1, γ2,…,γh>, wherein, h is the quantity of spacing constraint;For each sequence p,
Length | | γ | |=| | p | | -1 of its interval constrained sequence.
Optionally, under spacing constraint sequence γ, support in sequence sets d for sequence p be denoted as sup ((p, γ), d):Wherein, s is sequence, and | d | represents the number of sequence in sequence sets d, sequence p
Meet the subsequence of spacing constraint sequence γ in sequence s, be designated asUnder spacing constraint sequence γ, sequence p is in positive example
Arrangement set d+With negative example arrangement set d-Between contrast be denoted as cr ((p, γ), d+,d-): cr ((p, γ), d+,d_)=
sup((p,γ),d+)-sup((p,γ),d-).
Optionally, head is the spacing constraint set randomly generating, or head is by the spacing constraint set randomly generating
Character set composition with the positive example arrangement set coming from input and negative example arrangement set.
Optionally, the length of head presets, and the length of afterbody obtains according to t=h × (n-1)+1, and wherein, t is tail
The length in portion, h is the length of head, and n is the maximum operand needed for spacing constraint.
Optionally, described each genotype candidate pattern is encoded is decoded operation to obtain each genotype candidate's mould
Formula coding corresponding candidate contrast sequence pattern includes: sets up each gene from top to bottom, left to right corresponding
Binary tree, in described binary tree, each node is corresponding in turn to as one of gene element, when the leaf node of described binary tree
Being contributes during character completes;Root section using the root node as first binary tree for the new root node and second binary tree
Point father node with form first renewal binary tree, using new root node as first renewal binary tree root node and
The father node of the root node of the 3rd binary tree, to form second renewal binary tree, by that analogy, is made using new root node
The father node of the root node of the root node for xth -2 renewal binary trees and xth binary tree to form expression tree, wherein,
Each new root node is the spacing constraint randomly generating, and x is the number of gene in genotype candidate pattern coding;By middle sequence
Mode travels through described expression tree, produces candidate's contrast sequence pattern.
The new development gene expression programming based on evolutionary computation for the present invention, to realize the optimum of free spacing constraint
Contrast sequential mode mining.Genetic algorithm before comparing and genetic programming, gene expression programming make use of gene expression
Knowledge, even if adopt the same genotype candidate pattern coding of length in individual UVR exposure, can produce length different after the decoding
Expressing information, in the present invention produce is then free spacing constraint candidate contrast sequence pattern.
Optionally, described predefined genetic manipulation includes mutation operation, inserts string operation and reorganization operation;
Described mutation operation includes: one of gene spacing constraint is made a variation into a character or another interval
Constraint;One of gene character is made a variation into a spacing constraint or another character;
Described slotting string operation includes: randomly choose in gene y1 continuous element insertion except first element of head it
Front optional position, deletes the last y1 element of protocephalic region to keep head length constant, wherein, y1 >=1;Random in gene
Before selecting y2 element insertion first element of head that be continuous and starting with spacing constraint, delete the last y2 unit of protocephalic region
Element to keep head length constant, wherein, y2 >=1;Except first extragenic any one during genotype candidate pattern is encoded
Before individual gene moves to first gene;
Described reorganization operation includes: exchanges the element that two genotype candidate pattern encode same position;Exchange two
Individual genotype candidate pattern encodes at least two elements of same position;Exchange two genotype candidate pattern and encode same position
Gene.
Optionally, methods described termination condition be method execution time, method execute stablizing of number of times or eligible result
Property.
The present invention compared with prior art, has such advantages as and beneficial effect:
The optimum contrast sequence pattern heuristic mining method of the free spacing constraint that the present invention provides, digs in given expectation
Automatically obtain the sequence pattern of contrast optimization under the scene of contrast sequence pattern number of pick, be implemented without between user setup
Every constraint, and optimal spacing constraint is automatically calculated to candidate's contrast sequence pattern, it is to avoid user knows not possessing priori
Lose the situation of result because arrange parameter is incorrect in the case of knowledge.Meanwhile, by means of power in the heuristic search of evolutionary computation
Mechanism, it is long to overcome method of exhaustion run time, unpractical shortcoming, therefore should to the reality promoting contrast sequential mode mining
With there being positive role.
Brief description
Accompanying drawing described herein is used for providing the embodiment of the present invention is further understood, and constitutes of the application
Point, do not constitute the restriction to the embodiment of the present invention.In the accompanying drawings:
Fig. 1 is the defeated of the optimum contrast sequence pattern heuristic mining method of the free spacing constraint of the embodiment of the present invention
Enter-export schematic diagram;
Fig. 2 is the flow process of the optimum contrast sequence pattern heuristic mining method of the free spacing constraint of the embodiment of the present invention
Schematic diagram;
Fig. 3 is the schematic diagram of the genotype candidate pattern coding of the embodiment of the present invention;
Fig. 4 is the schematic diagram obtaining candidate's contrast sequence pattern of the embodiment of the present invention;
Fig. 5 a~Fig. 5 d is the schematic diagram that the embodiment of the present invention carries out mutation operation;
Fig. 6 a~Fig. 6 c is that the embodiment of the present invention carries out inserting the schematic diagram of string operation;
Fig. 7 a~Fig. 7 c is the schematic diagram that the embodiment of the present invention carries out reorganization operation.
Specific embodiment
Just as described in the background art, having contrasted sequential mode mining method needs user to preset interval about
Bundle.But without sufficient priori, user is difficult to set suitable spacing constraint, and then lead to find useful
Pattern.The invention provides a kind of optimum contrast sequence pattern heuristic mining method of free spacing constraint is it is only necessary to use
Family gives contrast sequence pattern number k that an expectation is excavated, and method does not need user to pre-set spacing constraint, and it can be to time
Choosing contrast sequence pattern calculates optimal spacing constraint automatically.Meanwhile, by means of power in the heuristic search mechanism of evolutionary computation, gram
Taken method of exhaustion run time long it is impossible to the shortcoming of practicality.
For making the object, technical solutions and advantages of the present invention become more apparent, with reference to embodiment and accompanying drawing, to this
Invention is described in further detail, and the exemplary embodiment of the present invention and its explanation are only used for explaining the present invention, do not make
For limitation of the invention.
Embodiment
Provide the related definition of contrast sequential mode mining: a given finite assemble of symbol ∑ first, we by its
Referred to as alphabet, the optional sign in alphabet is referred to as item.Sequence, table are referred to as by the ordered sequence that ∑ middle term is constituted
It is shown as s=< e1,e2,…,en>, wherein ei∈ ∑ (1≤i≤m) is referred to as element.We use | s | to represent the length of sequence s
Degree, the number of the element comprising in sequence s.We use s[i]To represent i-th element in sequence s (1≤i≤| s |), right
Two element s in sequence s[i]And s[j](1≤i≤j≤| s |), to be represented this two in sequence s using gap (s, i, j)
The element number at interval, i.e. gap (s, i, j)=j-i-1 between element.
For any two sequence s' and s ", meet following condition:
Condition 1:| s'| >=| s " |, that is, the length of sequence s' be not less than sequence s " length;
Condition 2: there is set of number 1≤k1≤k2≤…≤k|s”|≤ | s'| so thatFor 1≤i≤| s "
| permanent establishment;
So, we claim < k1, k2,…,k|s”|> be sequence s " and in the appearance of one of sequence s' it may also be said to sequence s'
Sequence s " supersequence, sequence s in other words " be sequence s' subsequence, be designated as
Spacing constraint γ is defined as an interval [γ .min, γ .max], γ .min≤γ .max, wherein γ .min
(γ .min >=0) and γ .max (γ .max >=0) represent the minimum and maximum space elements number allowing in spacing constraint respectively.
Spacing constraint sequence γ is referred to as by the ordered sequence that different or the multiple spacing constraint of identical form, its form is: γ=<
γ1, γ2,…,γh>, wherein, h is the quantity of spacing constraint;For each sequence p, it is spaced the length of constrained sequence | | γ |
|=| | p | | -1.Given two sequences s' and s ", makes < k1, k2,…,k|s”|> be sequence s " appearance in sequence s'.If having
γ.min≤gap(s',ki, ki+1)≤γ .max for any 1≤i≤| s " | all set up, then we claim sequence s " be in sequence
Meet the subsequence of spacing constraint sequence γ in row s', be expressed as
A given arrangement set d, be otherwise known as sequence sets d, and under spacing constraint sequence γ, sequence p is in sequence sets d
Support with sup ((p, γ), d) representing, its physical significance be in sequence sets d sequence p under spacing constraint sequence γ
In the number of supersequence and sequence sets d the number of sequence ratio it may be assumed that
Wherein, s is sequence, and | d | represents the number of sequence in sequence sets d, and sequence p meets spacing constraint sequence in sequence s
The subsequence of row γ, is designated as
Under spacing constraint sequence γ, give two arrangement sets, positive example arrangement set d+With negative example arrangement set d-, sequence
Row p is in positive example arrangement set d+With negative example arrangement set d-Between contrast cr ((p, γ), d+,d-) representing, its meaning
The difference of the support representing this sequence between two arrangement sets it may be assumed that
cr((p,γ),d+,d-)=sup ((p, γ), d+)-sup((p,γ),d-).
Fig. 1 is the input-defeated of the optimum contrast sequence pattern heuristic mining method of the free spacing constraint of the present embodiment
Go out schematic diagram.For given positive example arrangement set d+With negative example arrangement set d-And number k of sequence pattern, our target
It is to find out positive example arrangement set d+With negative example arrangement set d-Between contrast optimization k spaced constraint contrast sequence mould
Formula.Fig. 2 is the schematic flow sheet of the optimum contrast sequence pattern heuristic mining method of the free spacing constraint of the present embodiment, institute
The optimum contrast sequence pattern heuristic mining method stating free spacing constraint includes:
Step s1, the contrast sequence pattern number that input positive example arrangement set, negative example arrangement set and expectation are excavated.
Step s2, randomly generates the genotype candidate pattern coding of predetermined quantity.Specifically, described predetermined quantity is more than the phase
Hope the contrast sequence pattern number excavated, described genotype candidate pattern coding includes the fixing gene of at least one length, institute
State gene and include head and afterbody.Head includes the spacing constraint set randomly generating, and that is, head can only randomly generate
Spacing constraint set is it is also possible to by the spacing constraint set randomly generating and the positive example arrangement set coming from input and negative example sequence
The character set of row set collectively constitutes;Afterbody comes from the character set of the positive example arrangement set and negative example arrangement set inputting
Close.Further, the length of head presets according to the actual requirements, and the length of afterbody obtains according to t=h × (n-1)+1, its
In, t is the length of afterbody, and h is the length of head, and n is the maximum operand needed for spacing constraint.
So that head length is 3, the maximum operand needed for spacing constraint is 2 as a example, Fig. 3 is a gene of the present embodiment
The schematic diagram of type candidate pattern coding.Described genotype candidate pattern coding includes two genes: gene 1 and gene 2.Gene 1
Head be the spacing constraint set randomly generating, that is, by spacing constraint [0,3], spacing constraint [1,2] and spacing constraint [2,
5] form;The afterbody of gene 1 comes from the positive example arrangement set of input and the character set of negative example arrangement set, that is, by character a,
Character c, character c and character g composition.The head of gene 2 by the spacing constraint set randomly generating with just coming from input
Example arrangement set and the character set of negative example arrangement set collectively constitute, that is, by spacing constraint [3,4], spacing constraint [2,4] with
And character g composition;The afterbody of gene 2 comes from the positive example arrangement set of input and the character set of negative example arrangement set, that is, by
Character t, character a, character t and character t composition.
Step s3, is decoded operating to obtain each genotype candidate pattern volume to each genotype candidate pattern coding
The corresponding candidate of code contrasts sequence pattern.Specifically, set up each gene corresponding two from top to bottom, left to right
Fork tree, in described binary tree, each node is corresponding in turn to as one of gene element, when the leaf node of described binary tree is equal
Complete for contributing during character;Root node using the root node as first binary tree for the new root node and second binary tree
Father node to form first renewal binary tree, using new root node as first renewal binary tree root node and the
The father node of the root node of three binary trees updates binary tree to form second, by that analogy, using new root node conduct
The father node of the root node of the root node of xth -2 renewal binary trees and xth binary tree to form expression tree, wherein, often
Individual new root node is the spacing constraint randomly generating, and x is the number of gene in genotype candidate pattern coding;By middle sequence side
Formula travels through described expression tree, produces candidate's contrast sequence pattern.
Wherein, first binary tree is the corresponding binary tree of first gene in genotype candidate pattern coding, second
Binary tree is the corresponding binary tree of second gene in genotype candidate pattern coding ..., xth binary tree is genotype time
The corresponding binary tree of x-th gene in lectotype coding.Distinguishingly, if binary tree only one of which root node, root node also may be used
Think character.For the genotype candidate pattern coding of only gene, the corresponding binary tree of this gene is expression tree.
Fig. 4 is to obtain candidate's contrast sequence so that the genotype candidate pattern coding including two genes shown in Fig. 3 is decoded with operation
The schematic diagram of pattern.Taking set up the corresponding binary tree of gene 2 as a example, its root node corresponds to 2 first elements of gene, gene 2
First element is spacing constraint, continues to set up the second layer;First node of the second layer corresponds to 2 second elements of gene, the
Two layers of second node correspond to the 3rd element of gene 2, and 2 second elements of gene are spacing constraint, continue to set up third layer;
First node of third layer corresponds to the 4th element of gene 2, and second node of third layer corresponds to the 5th element of gene 2;
The 3rd element of gene 2, the 4th element of gene 2 and the 5th element of gene 2 are character, and character is as no child during node
Child node, that is, all of leaf node of described binary tree be character, achievement completes.
Step s4, the positive example arrangement set in conjunction with input and negative example arrangement set calculate each candidate and contrast sequence pattern
Contrast.According to the definition of contrast, according to formula cr ((p, γ), d+, d-) and=sup ((p, γ), d+)-sup((p,γ),
D-) calculate the contrast that each candidate contrasts sequence pattern.
Step s5, judges whether current genotype candidate pattern coding meets method termination condition.Methods described terminates bar
Part can be configured according to the actual requirements, can stablizing for method execution time, method execution number of times or eligible result
Property etc..For example, method to set up termination condition is 5 minutes for method execution time, and algorithm will stop after performing 5 minutes automatically
Only, that is, current genotype candidate pattern coding meets method termination condition, and k candidate's contrast sequence pattern of contrast optimization is
Final Result;If algorithm execution time not up to 5 minutes, that is, current genotype candidate pattern coding is unsatisfactory for method knot
Bundle condition, then execution step s6, wherein, k is the number of sequence pattern.
Step s6, is waited to current genotype using roulette wheel selection according to the contrast that each candidate contrasts sequence pattern
Lectotype coding is selected.Roulette wheel selection is a kind of conventional random selection method, and individual adaptation degree is changed in proportion
For the probability selecting, the ratio as shared by individuality carries out ratio cut partition on a disk, treats that disk stops after rotating disk every time
Backpointer stops the corresponding individual individuality for choosing in sector.Obviously, individual probability is bigger, and its shared area in disk is got over
Greatly, its selected chance is also more.Carry out current genotype candidate pattern coding using roulette wheel selection to select, contrast
Spending big candidate's contrast selected probability of sequence pattern can be larger.Specifically, every time from all current genotype candidate's moulds
A number of genotype candidate pattern coding is randomly choosed in formula coding, the more a number of current genotype candidate's mould of here
The genotype candidate pattern coding of contrast optimization is picked out, until picking out genotype candidate's mould of predetermined quantity in formula coding
Formula encodes.
Step s7, carries out predefined genetic manipulation to the partial genotype candidate pattern coding selected and forms new base
Because of type candidate pattern coding, and go to step s3.Further, described genetic manipulation includes mutation operation, inserts string operation and weight
Group operation.
Specifically, described mutation operation includes: one of gene spacing constraint is made a variation into a character or another
Individual spacing constraint;One of gene character is made a variation into a spacing constraint or another character.Described mutation operation
Any position in gene can occur, but the structure of gene can not change.That is, the character of afterbody can only make a variation into
Another character and can not make a variation and constrain at interval, but the element of head not only can make a variation into character but also can make a variation at interval
Constraint.As a example genotype candidate pattern coding shown in still by Fig. 3, Fig. 5 a~Fig. 5 d is that the present embodiment carries out showing of mutation operation
It is intended to.Wherein, Fig. 5 a is that by spacing constraint, gene first element of 1 head is made a variation into another spacing constraint;Fig. 5 b be by
Gene second element of 1 head makes a variation into character by spacing constraint;Fig. 5 c is that gene second element of 1 afterbody is made a variation by character
Become another character;Fig. 5 d is the 3rd element of gene 2 head to be made a variation by character constrain at interval.
Described slotting string operation includes: randomly choose in gene y1 continuous element insertion except first element of head it
Front optional position, deletes the last y1 element of protocephalic region to keep head length constant, wherein, y1 >=1;Random in gene
Before selecting y2 element insertion first element of head that be continuous and starting with spacing constraint, delete the last y2 unit of protocephalic region
Element to keep head length constant, wherein, y2 >=1;Except first extragenic any one during genotype candidate pattern is encoded
Before individual gene moves to first gene.As a example genotype candidate pattern coding shown in still by Fig. 3, Fig. 6 a~Fig. 6 c is this
Embodiment carries out inserting the schematic diagram of string operation.Wherein, Fig. 6 a is to insert string and arrive gene the 4th element of gene 1 and the 5th element
Between 1 first element and second element, and delete gene 1 head latter two element;Fig. 6 b is by the 3rd unit of gene 1
Element and the 4th element insert string to before 1 first element of gene, and delete gene 1 head latter two element;Fig. 6 c be by
Gene 2 moves to before gene 1, and, as new gene 1, gene 1 originally is as new gene 2 for gene 2 originally.
Described reorganization operation includes: exchanges the element that two genotype candidate pattern encode same position;Exchange two
Individual genotype candidate pattern encodes at least two elements of same position;Exchange two genotype candidate pattern and encode same position
Gene.Genotype candidate pattern coding shown in using Fig. 3 is waited as protogene type candidate pattern coding 1 and another genotype
As a example lectotype coding is as protogene type candidate pattern coding 2, Fig. 7 a~Fig. 7 c is that the present embodiment carries out showing of reorganization operation
It is intended to.Wherein, Fig. 7 a is to be interchangeable the 3rd element of gene 1, the 6th element of gene 1 and the 4th element of gene 2
Restructuring;Fig. 7 b is that to the 6th element, the 3rd element of gene 1 is exchanged restructuring;Fig. 7 c is that whole gene 2 is exchanged restructuring.
New genotype candidate pattern coding is formed by genetic manipulation, new genotype candidate pattern coding is repeated to hold
Row step s3, to step s5, can obtain final Result.
Above-described specific embodiment, has been carried out to the purpose of the present invention, technical scheme and beneficial effect further
Describe in detail, be should be understood that the specific embodiment that the foregoing is only the present invention, be not intended to limit the present invention
Protection domain, all any modification, equivalent substitution and improvement within the spirit and principles in the present invention, done etc., all should comprise
Within protection scope of the present invention.
Claims (10)
1. a kind of optimum contrast sequence pattern heuristic mining method of free spacing constraint is it is characterised in that include:
Step s1, the contrast sequence pattern number that input positive example arrangement set, negative example arrangement set and expectation are excavated;
Step s2, randomly generates the genotype candidate pattern coding of predetermined quantity, described genotype candidate pattern coding include to
The fixing gene of few length, described gene includes head and afterbody, and head includes the spacing constraint set randomly generating, tail
Portion comes from the character set of the positive example arrangement set and negative example arrangement set inputting, and described predetermined quantity is more than expectation and excavates
Contrast sequence pattern number;
Step s3, it is right to obtain each genotype candidate pattern coding that each genotype candidate pattern coding is decoded operating
The candidate's contrast sequence pattern answered;
Step s4, the positive example arrangement set in conjunction with input and negative example arrangement set calculate the contrast that each candidate contrasts sequence pattern
Degree;
Step s5, judges whether current genotype candidate pattern coding meets method termination condition: if meeting, contrast optimization
K candidate's contrast sequence pattern is final Result, otherwise execution step s6, and wherein, k is the contrast sequence that expectation is excavated
Number of modes;
Step s6, adopts roulette wheel selection to current genotype candidate's mould according to the contrast that each candidate contrasts sequence pattern
Formula coding is selected;
Step s7, carries out predefined genetic manipulation to the partial genotype candidate pattern coding selected and forms new genotype
Candidate pattern encodes, and goes to step s3.
2. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 1, its feature
It is, define finite assemble of symbol ∑, the optional sign in assemble of symbol ∑ is referred to as item, by the item in assemble of symbol ∑
The ordered sequence constituting is referred to as sequence, is expressed as s=< e1,e2,…,ei,…em>, wherein, eiBe referred to as element and 1≤
I≤m, | s | represent the length of sequence s, the element number comprising in sequence s;
For i-th element s in sequence s[i]With j-th element s[j](1≤i≤j≤| s |), gap (s, i, j) represents i-th
Element s[i]With j-th element s[j]Between interval element number, i.e. gap (s, i, j)=j-i-1.
3. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 2, its feature
It is, for any two sequence s' and s ", meet condition 1 and condition 2, then claim < k1,k2,…,k|s”|> be sequence s " in sequence
One of s' occurs, and is designated asWherein,
Condition 1 is: | s'| >=| s " |, that is, the length of sequence s' be not less than sequence s " length;
Condition 2 is: there is set of number 1≤k1≤k2≤…≤k|s”|≤ | s'| so thatFor 1≤i≤| s " |
Permanent establishment.
4. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 3, its feature
It is, spacing constraint γ is interval [γ .min, γ .max], and γ .min≤γ .max, wherein γ .min represent in spacing constraint
The minimum interval element number allowing, γ .max represents the largest interval element number allowing in spacing constraint, γ .min and γ
.max it is all higher than equal to 0;
Spacing constraint sequence γ is referred to as by the ordered sequence that different or the multiple spacing constraint of identical form, its form is: γ
=< γ1,γ2,…,γh>, wherein, h is the quantity of spacing constraint;
For each sequence p, length | | γ | |=| | p | | -1 of its interval constrained sequence.
5. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 4, its feature
Be, under spacing constraint sequence γ, support in sequence sets d for sequence p be denoted as sup ((p, γ), d):
Wherein, s is sequence, and | d | represents the number of sequence in sequence sets d, and sequence p meets spacing constraint sequence γ in sequence s
Subsequence, be designated as
Under spacing constraint sequence γ, sequence p is in positive example arrangement set d+Contrast and negative example arrangement set d- between is denoted as cr
((p,γ),d+,d-):
cr((p,γ),d+,d-)=sup ((p, γ), d+)-sup((p,γ),d_).
6. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 1, its feature
Be, head is the spacing constraint set randomly generating, or head by the spacing constraint set randomly generating with come from defeated
The positive example arrangement set entering and the character set composition of negative example arrangement set.
7. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 1, its feature
It is, the length of head presets, the length of afterbody obtains according to t=h × (n-1)+1, wherein, t is the length of afterbody, h
For the length of head, n is the maximum operand needed for spacing constraint.
8. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 7, its feature
It is, described each genotype candidate pattern is encoded is decoded operating to obtain each genotype candidate pattern coding correspondence
Candidate contrast sequence pattern include:
Set up the corresponding binary tree of each gene from top to bottom, left to right, in described binary tree each node according to
Secondary correspond to one of gene element, contribute when the leaf node of described binary tree is character and complete;
Adopt new root node as the father node of the root node of first binary tree and the root node of second binary tree with group
Become first renewal binary tree, using new root node as first renewal root node of binary tree and the 3rd binary tree
The father node of root node, to form second renewal binary tree, by that analogy, updates two using new root node for -2 as xth
To form expression tree, wherein, each new root node is the father node of the root node of the root node of fork tree and xth binary tree
The spacing constraint randomly generating, x is the number of gene in genotype candidate pattern coding;
Described expression tree is traveled through by middle sequential mode, produces candidate's contrast sequence pattern.
9. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 7, its feature
It is, described predefined genetic manipulation includes mutation operation, inserts string operation and reorganization operation;
Described mutation operation includes: one of gene spacing constraint is made a variation into a character or another interval about
Bundle;One of gene character is made a variation into a spacing constraint or another character;
Described slotting string operation includes: before in gene, the insertion of random selection y1 continuous element is except first element of head
Optional position, deletes the last y1 element of protocephalic region to keep head length constant, wherein, y1 >=1;Gene randomly chooses
Y2 is continuous and element insertion first element of head with spacing constraint beginning before, delete the last y2 element of protocephalic region with
Keep head length constant, wherein, y2 >=1;First any one base extragenic is removed during genotype candidate pattern is encoded
Because before moving to first gene;
Described reorganization operation includes: exchanges the element that two genotype candidate pattern encode same position;Exchange two bases
Because type candidate pattern encodes at least two elements of same position;Exchange the base that two genotype candidate pattern encode same position
Cause.
10. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 1, its feature
It is, methods described termination condition is method execution time, method executes the stability of number of times or eligible result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610831506.6A CN106339609A (en) | 2016-09-19 | 2016-09-19 | Heuristic mining method of optimal comparing sequence mode of free interval constraint |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610831506.6A CN106339609A (en) | 2016-09-19 | 2016-09-19 | Heuristic mining method of optimal comparing sequence mode of free interval constraint |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106339609A true CN106339609A (en) | 2017-01-18 |
Family
ID=57838934
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610831506.6A Pending CN106339609A (en) | 2016-09-19 | 2016-09-19 | Heuristic mining method of optimal comparing sequence mode of free interval constraint |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106339609A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107016354A (en) * | 2017-03-16 | 2017-08-04 | 中南大学 | The feature mode extracting method and its system of aluminium electrolysis anode current sequence |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104331466A (en) * | 2014-10-31 | 2015-02-04 | 南京邮电大学 | Space-time proximity search-based mobile trace sequence mode quick mining method |
CN104408290A (en) * | 2014-10-30 | 2015-03-11 | 西北工业大学 | Inclusion and deductive analysis-based precise sequence rule mining method |
CN104537025A (en) * | 2014-12-19 | 2015-04-22 | 北京邮电大学 | Frequent sequence mining method |
CN105046107A (en) * | 2015-08-28 | 2015-11-11 | 东北大学 | Restrictive motif discovering method |
CN105095613A (en) * | 2014-04-16 | 2015-11-25 | 华为技术有限公司 | Method and device for prediction based on sequential data |
-
2016
- 2016-09-19 CN CN201610831506.6A patent/CN106339609A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105095613A (en) * | 2014-04-16 | 2015-11-25 | 华为技术有限公司 | Method and device for prediction based on sequential data |
CN104408290A (en) * | 2014-10-30 | 2015-03-11 | 西北工业大学 | Inclusion and deductive analysis-based precise sequence rule mining method |
CN104331466A (en) * | 2014-10-31 | 2015-02-04 | 南京邮电大学 | Space-time proximity search-based mobile trace sequence mode quick mining method |
CN104537025A (en) * | 2014-12-19 | 2015-04-22 | 北京邮电大学 | Frequent sequence mining method |
CN105046107A (en) * | 2015-08-28 | 2015-11-11 | 东北大学 | Restrictive motif discovering method |
Non-Patent Citations (7)
Title |
---|
CHAO GAO等: "Mining Top-k Distinguishing Sequential Patterns with Flexible Gap Constraints", 《WEB-AGE INFORMATION MANAGEMENT》 * |
XIAONAN JI等: "Mining minimal distinguishing subsequence", 《KNOWLEDGE AND INFORMATION SYSTEMS》 * |
唐常杰等: "基于转基因GEP的公式发现", 《计算机应用》 * |
杨艳梅等: "基于二叉树编码遗传算法的SOA服务选择", 《计算机应用》 * |
王慧锋等: "免预设间隔约束的对比序列模式高效挖掘", 《计算机学报》 * |
王艳春: "基因表达式编程算法及其应用综述", 《计算机软件与应用》 * |
龚文引,蔡之华,杨鸣著: "《智能算法在高光谱遥感数据处理中的应用》", 30 November 2014, 中国地质大学出版社 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107016354A (en) * | 2017-03-16 | 2017-08-04 | 中南大学 | The feature mode extracting method and its system of aluminium electrolysis anode current sequence |
CN107016354B (en) * | 2017-03-16 | 2020-07-31 | 中南大学 | Method and system for extracting characteristic pattern of aluminum electrolysis anode current sequence |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Currin et al. | Computing exponentially faster: implementing a non-deterministic universal Turing machine using DNA | |
CN109308497A (en) | A kind of multidirectional scale dendrography learning method based on multi-tag network | |
Chang et al. | A block mining and re-combination enhanced genetic algorithm for the permutation flowshop scheduling problem | |
CN103116693B (en) | Based on the Method for HW/SW partitioning of artificial bee colony | |
Rusin et al. | Reconciliation of gene and species trees | |
Achar et al. | RNA motif discovery: a computational overview | |
Eiben et al. | Genetic algorithms | |
Ghoneimy et al. | A new hybrid clustering method of binary differential evolution and marine predators algorithm for multi-omics datasets | |
Mitra et al. | Application of meta-heuristics on reconstructing gene regulatory network: a bayesian model approach | |
CN108509764B (en) | Ancient organism pedigree evolution analysis method based on genetic attribute reduction | |
CN106339609A (en) | Heuristic mining method of optimal comparing sequence mode of free interval constraint | |
Mäkinen et al. | Genome-Scale Algorithm Design: Bioinformatics in the Era of High-Throughput Sequencing | |
Tamura et al. | Distributed Modified Extremal Optimization using Island Model for Reducing Crossovers in Reconciliation Graph. | |
Moen et al. | HyperHMM: efficient inference of evolutionary and progressive dynamics on hypercubic transition graphs | |
Du et al. | Genetic algorithms | |
Pardi | Algorithms on phylogenetic trees | |
Townsend | Genetic Algorithm–A Tutorial | |
Liu et al. | Data-driven boolean network inference using a genetic algorithm with marker-based encoding | |
Luo et al. | Linear-time algorithms for the multiple gene duplication problems | |
Kwarciak et al. | Tabu search algorithm for DNA sequencing by hybridization with multiplicity information available | |
Rastas et al. | Haplotype inference via hierarchical genotype parsing | |
Maddouri et al. | Encoding of primary structures of biological macromolecules within a data mining perspective | |
Petrowski et al. | Evolutionary algorithms | |
Bienvenu et al. | A branching process with coalescence to model random phylogenetic networks | |
Das et al. | Optimal haplotype assembly via a branch-and-bound algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170118 |