CN106339609A - Heuristic mining method of optimal comparing sequence mode of free interval constraint - Google Patents

Heuristic mining method of optimal comparing sequence mode of free interval constraint Download PDF

Info

Publication number
CN106339609A
CN106339609A CN201610831506.6A CN201610831506A CN106339609A CN 106339609 A CN106339609 A CN 106339609A CN 201610831506 A CN201610831506 A CN 201610831506A CN 106339609 A CN106339609 A CN 106339609A
Authority
CN
China
Prior art keywords
sequence
spacing constraint
pattern
candidate
gene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610831506.6A
Other languages
Chinese (zh)
Inventor
段磊
高超
杨皓
王慧锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201610831506.6A priority Critical patent/CN106339609A/en
Publication of CN106339609A publication Critical patent/CN106339609A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Abstract

The invention discloses a heuristic mining method of an optimal comparing sequence mode of free interval constraints. The heuristic mining method comprises the following steps of: step S1, inputting a positive example sequence set, a negative example sequence set and the expected mining quantity of comparing sequence modes; step S2, randomly generating gene type candidate mode codes with the pre-set quantity; step S3, obtaining a candidate comparing sequence mode corresponding to each gene type candidate mode code; step S4, calculating the contrast ratio of each candidate comparing sequence mode; step S5, judging whether the current gene type candidate mode codes meet method finishing conditions or not; if so, taking the k candidate comparing sequence modes with the optimal contrast ratio as a final mining result; otherwise, executing step S6; step S6, selecting the current gene type candidate mode codes; and step S7, forming new gene type candidate mode codes and returning to the step S3. By adopting the comparing sequence mode mining method provided by the invention, a condition that a result is lost due to improper parameter setting under the condition that a user does not have priori knowledge is avoided.

Description

The optimum contrast sequence pattern heuristic mining method of free spacing constraint
Technical field
The present invention relates to data mining technology field is and in particular to a kind of the optimum of free spacing constraint contrasts sequence pattern Heuristic mining method.
Background technology
Sequential mode mining, as a vital task of data mining, has a wide range of applications.For example, Utilities Electric Co. leads to Cross analysis of history electricity consumption data, improve the degree of accuracy to load forecast.For another example, health disease control department passes through to analyze infectious disease The space-time Monitoring Data propagated is it is expected to find that infectious disease time-place clustering breaks out rule, and then provides reference for prevention and control.With This simultaneously, sequential mode mining also receives the concern of numerous researchers, and different types of sequence pattern is proposed, for example successively Frequent Sequential Patterns, closed sequential pattern, periodic sequence pattern, partial order sequence pattern etc..
The target of contrast sequential mode mining is to excavate in positive example arrangement set that frequently (support of sequence pattern is big In specified threshold) and in negative example arrangement set infrequently (support of sequence pattern be less than specified threshold) contrast sequence mould Formula.Contrast sequence pattern can identify comparative information between different classes of arrangement set, and identifies the feature of sample set of all categories, It is applied to the sequence data analysis of multiple fields.For example, in medical domain, the dna sequence of analysis positive tumor and negative tumours, By contrasting sequence pattern, it is possible to increase the precision of clinical diagnosis;In commercial field, contrast the different purchases of different age group client Thing pattern, can improve the specific aim of commercial promotions activity.
Widely use the concept of spacing constraint in sequential mining, its object is to allow sequence pattern coupling more flexibly, General.Spacing constraint is an interval being determined by two nonnegative integers, represents that in sequence pattern, two adjacent elements are in sequence The middle minimum of a value of element number allowing interval and maximum.In the research having contrasted sequential mode mining method, interval Constraint needs user to set.Practice have shown that, when not having enough prioris, user is difficult to set appropriate spacing constraint.Lose When spacing constraint, much useful sequence pattern can be lost.And exhaustive all possible spacing constraint then can lead to algorithm to be held Row overlong time, loses practicality.
Content of the invention
To be solved by this invention is that user sets inappropriate spacing constraint and leads to lose when not having enough prioris Lose useful sequence pattern, exhaustive all possible spacing constraint leads to the long problem of algorithm execution time.
The present invention is achieved through the following technical solutions:
A kind of optimum contrast sequence pattern heuristic mining method of free spacing constraint, comprising: step s1, inputs positive example The contrast sequence pattern number that arrangement set, negative example arrangement set and expectation are excavated;Step s2, randomly generates predetermined quantity Genotype candidate pattern encodes, and described genotype candidate pattern coding includes the fixing gene of at least one length, described gene Including head and afterbody, head includes the spacing constraint set randomly generating, and afterbody comes from the positive example arrangement set inputting The contrast sequence pattern number excavated more than expectation with the character set of negative example arrangement set, described predetermined quantity;Step s3 is right Each genotype candidate pattern coding is decoded operating to obtain the corresponding candidate contrast of each genotype candidate pattern coding Sequence pattern;Step s4, the positive example arrangement set in conjunction with input and negative example arrangement set calculate each candidate contrast sequence pattern Contrast;Step s5, judges whether current genotype candidate pattern coding meets method termination condition: if meeting, contrast K optimum candidate's contrast sequence pattern is final Result, otherwise execution step s6, and wherein, k is the right of expectation excavation Ratio sequence pattern number;Step s6, adopts roulette wheel selection to current according to the contrast that each candidate contrasts sequence pattern Genotype candidate pattern coding is selected;Step s7, predefines to the partial genotype candidate pattern coding selected Genetic manipulation form new genotype candidate pattern coding, and go to step s3.
The contrast sequential mode mining method that the present invention provides it is not necessary to user pre-sets spacing constraint, but automatically Optimal spacing constraint is calculated to candidate pattern, it is to avoid lose useful sequence pattern.In this method for digging, for avoid by The situation that cannot obtain solution in reasonable run time that may lead in the full search of high calculation cost, present invention introduces evolve The mode calculating.Evolutionary computation is a class heuristic search optimized algorithm, mainly passes through these three behaviour that select, evaluate and make a variation Realize the optimization of candidate solution.Evolutionary computation has the advantages that robustness, can be reasonable to various data sets to be excavated Adapt to.It is to apply to contrast in sequential mode mining by the heuristic search mechanism of evolutionary computation, the present invention proposes new base Because of type candidate pattern coding.With each gene internal between different genes, spacing constraint randomly generates, and each interval is about Bundle allows to differ.Therefore, spacing constraint can be constantly updated so that candidate contrasts sequence pattern contrast court during evolution Bigger direction to evolve, thus overcome user oneself setting monospace constraint and can not find optimal solution, or exhaustive Be possible to spacing constraint leads to the irrational problem of run time.Further, the present invention adopts roulette wheel selection to current Genotype candidate pattern coding is selected, and the big candidate's contrast selected probability of sequence pattern of contrast can be bigger, obtains Result more accurate.
Optionally, define finite assemble of symbol ∑, the optional sign in assemble of symbol ∑ is referred to as item, by glossary of symbols The ordered sequence closing the composition of the item in ∑ is referred to as sequence, is expressed as s=< e1,e2,…,ei,…em>, wherein, eiIt is referred to as For element and 1≤i≤m, the length of | s | expression sequence s, the element number comprising in sequence s;For i-th in sequence s Element s[i]With j-th element s[j](1≤i≤j≤| s |), gap (s, i, j) represents i-th element s[i]With j-th element s[j] Between interval element number, i.e. gap (s, i, j)=j-i-1.
Optionally, for any two sequence s' and s ", meet condition 1 and condition 2, then claim < k1,k2,…,k|s”|> it is sequence Row s " occurs in one of sequence s', is designated asWherein, condition 1 is: | s'| >=| s " |, that is, the length of sequence s' is not Less than sequence s " length;Condition 2 is: there is set of number 1≤k1≤k2≤…≤k|s”|≤ | s'| so thatRight In 1≤i≤| s " | permanent set up.
Optionally, spacing constraint γ is interval [γ .min, γ .max], and γ .min≤γ .max, wherein γ .min represent The minimum interval element number allowing in spacing constraint, γ .max represents the largest interval element number allowing in spacing constraint, γ .min and γ .max is all higher than equal to 0;It is referred to as being spaced by the ordered sequence that different or the multiple spacing constraint of identical form Constrained sequence γ, its form is: γ=< γ1, γ2,…,γh>, wherein, h is the quantity of spacing constraint;For each sequence p, Length | | γ | |=| | p | | -1 of its interval constrained sequence.
Optionally, under spacing constraint sequence γ, support in sequence sets d for sequence p be denoted as sup ((p, γ), d):Wherein, s is sequence, and | d | represents the number of sequence in sequence sets d, sequence p Meet the subsequence of spacing constraint sequence γ in sequence s, be designated asUnder spacing constraint sequence γ, sequence p is in positive example Arrangement set d+With negative example arrangement set d-Between contrast be denoted as cr ((p, γ), d+,d-): cr ((p, γ), d+,d_)= sup((p,γ),d+)-sup((p,γ),d-).
Optionally, head is the spacing constraint set randomly generating, or head is by the spacing constraint set randomly generating Character set composition with the positive example arrangement set coming from input and negative example arrangement set.
Optionally, the length of head presets, and the length of afterbody obtains according to t=h × (n-1)+1, and wherein, t is tail The length in portion, h is the length of head, and n is the maximum operand needed for spacing constraint.
Optionally, described each genotype candidate pattern is encoded is decoded operation to obtain each genotype candidate's mould Formula coding corresponding candidate contrast sequence pattern includes: sets up each gene from top to bottom, left to right corresponding Binary tree, in described binary tree, each node is corresponding in turn to as one of gene element, when the leaf node of described binary tree Being contributes during character completes;Root section using the root node as first binary tree for the new root node and second binary tree Point father node with form first renewal binary tree, using new root node as first renewal binary tree root node and The father node of the root node of the 3rd binary tree, to form second renewal binary tree, by that analogy, is made using new root node The father node of the root node of the root node for xth -2 renewal binary trees and xth binary tree to form expression tree, wherein, Each new root node is the spacing constraint randomly generating, and x is the number of gene in genotype candidate pattern coding;By middle sequence Mode travels through described expression tree, produces candidate's contrast sequence pattern.
The new development gene expression programming based on evolutionary computation for the present invention, to realize the optimum of free spacing constraint Contrast sequential mode mining.Genetic algorithm before comparing and genetic programming, gene expression programming make use of gene expression Knowledge, even if adopt the same genotype candidate pattern coding of length in individual UVR exposure, can produce length different after the decoding Expressing information, in the present invention produce is then free spacing constraint candidate contrast sequence pattern.
Optionally, described predefined genetic manipulation includes mutation operation, inserts string operation and reorganization operation;
Described mutation operation includes: one of gene spacing constraint is made a variation into a character or another interval Constraint;One of gene character is made a variation into a spacing constraint or another character;
Described slotting string operation includes: randomly choose in gene y1 continuous element insertion except first element of head it Front optional position, deletes the last y1 element of protocephalic region to keep head length constant, wherein, y1 >=1;Random in gene Before selecting y2 element insertion first element of head that be continuous and starting with spacing constraint, delete the last y2 unit of protocephalic region Element to keep head length constant, wherein, y2 >=1;Except first extragenic any one during genotype candidate pattern is encoded Before individual gene moves to first gene;
Described reorganization operation includes: exchanges the element that two genotype candidate pattern encode same position;Exchange two Individual genotype candidate pattern encodes at least two elements of same position;Exchange two genotype candidate pattern and encode same position Gene.
Optionally, methods described termination condition be method execution time, method execute stablizing of number of times or eligible result Property.
The present invention compared with prior art, has such advantages as and beneficial effect:
The optimum contrast sequence pattern heuristic mining method of the free spacing constraint that the present invention provides, digs in given expectation Automatically obtain the sequence pattern of contrast optimization under the scene of contrast sequence pattern number of pick, be implemented without between user setup Every constraint, and optimal spacing constraint is automatically calculated to candidate's contrast sequence pattern, it is to avoid user knows not possessing priori Lose the situation of result because arrange parameter is incorrect in the case of knowledge.Meanwhile, by means of power in the heuristic search of evolutionary computation Mechanism, it is long to overcome method of exhaustion run time, unpractical shortcoming, therefore should to the reality promoting contrast sequential mode mining With there being positive role.
Brief description
Accompanying drawing described herein is used for providing the embodiment of the present invention is further understood, and constitutes of the application Point, do not constitute the restriction to the embodiment of the present invention.In the accompanying drawings:
Fig. 1 is the defeated of the optimum contrast sequence pattern heuristic mining method of the free spacing constraint of the embodiment of the present invention Enter-export schematic diagram;
Fig. 2 is the flow process of the optimum contrast sequence pattern heuristic mining method of the free spacing constraint of the embodiment of the present invention Schematic diagram;
Fig. 3 is the schematic diagram of the genotype candidate pattern coding of the embodiment of the present invention;
Fig. 4 is the schematic diagram obtaining candidate's contrast sequence pattern of the embodiment of the present invention;
Fig. 5 a~Fig. 5 d is the schematic diagram that the embodiment of the present invention carries out mutation operation;
Fig. 6 a~Fig. 6 c is that the embodiment of the present invention carries out inserting the schematic diagram of string operation;
Fig. 7 a~Fig. 7 c is the schematic diagram that the embodiment of the present invention carries out reorganization operation.
Specific embodiment
Just as described in the background art, having contrasted sequential mode mining method needs user to preset interval about Bundle.But without sufficient priori, user is difficult to set suitable spacing constraint, and then lead to find useful Pattern.The invention provides a kind of optimum contrast sequence pattern heuristic mining method of free spacing constraint is it is only necessary to use Family gives contrast sequence pattern number k that an expectation is excavated, and method does not need user to pre-set spacing constraint, and it can be to time Choosing contrast sequence pattern calculates optimal spacing constraint automatically.Meanwhile, by means of power in the heuristic search mechanism of evolutionary computation, gram Taken method of exhaustion run time long it is impossible to the shortcoming of practicality.
For making the object, technical solutions and advantages of the present invention become more apparent, with reference to embodiment and accompanying drawing, to this Invention is described in further detail, and the exemplary embodiment of the present invention and its explanation are only used for explaining the present invention, do not make For limitation of the invention.
Embodiment
Provide the related definition of contrast sequential mode mining: a given finite assemble of symbol ∑ first, we by its Referred to as alphabet, the optional sign in alphabet is referred to as item.Sequence, table are referred to as by the ordered sequence that ∑ middle term is constituted It is shown as s=< e1,e2,…,en>, wherein ei∈ ∑ (1≤i≤m) is referred to as element.We use | s | to represent the length of sequence s Degree, the number of the element comprising in sequence s.We use s[i]To represent i-th element in sequence s (1≤i≤| s |), right Two element s in sequence s[i]And s[j](1≤i≤j≤| s |), to be represented this two in sequence s using gap (s, i, j) The element number at interval, i.e. gap (s, i, j)=j-i-1 between element.
For any two sequence s' and s ", meet following condition:
Condition 1:| s'| >=| s " |, that is, the length of sequence s' be not less than sequence s " length;
Condition 2: there is set of number 1≤k1≤k2≤…≤k|s”|≤ | s'| so thatFor 1≤i≤| s " | permanent establishment;
So, we claim < k1, k2,…,k|s”|> be sequence s " and in the appearance of one of sequence s' it may also be said to sequence s' Sequence s " supersequence, sequence s in other words " be sequence s' subsequence, be designated as
Spacing constraint γ is defined as an interval [γ .min, γ .max], γ .min≤γ .max, wherein γ .min (γ .min >=0) and γ .max (γ .max >=0) represent the minimum and maximum space elements number allowing in spacing constraint respectively. Spacing constraint sequence γ is referred to as by the ordered sequence that different or the multiple spacing constraint of identical form, its form is: γ=< γ1, γ2,…,γh>, wherein, h is the quantity of spacing constraint;For each sequence p, it is spaced the length of constrained sequence | | γ | |=| | p | | -1.Given two sequences s' and s ", makes < k1, k2,…,k|s”|> be sequence s " appearance in sequence s'.If having γ.min≤gap(s',ki, ki+1)≤γ .max for any 1≤i≤| s " | all set up, then we claim sequence s " be in sequence Meet the subsequence of spacing constraint sequence γ in row s', be expressed as
A given arrangement set d, be otherwise known as sequence sets d, and under spacing constraint sequence γ, sequence p is in sequence sets d Support with sup ((p, γ), d) representing, its physical significance be in sequence sets d sequence p under spacing constraint sequence γ In the number of supersequence and sequence sets d the number of sequence ratio it may be assumed that
s u p ( ( p , &gamma; ) , d ) = | { s &element; d | p &subsetequal; &gamma; s } / | d | ,
Wherein, s is sequence, and | d | represents the number of sequence in sequence sets d, and sequence p meets spacing constraint sequence in sequence s The subsequence of row γ, is designated as
Under spacing constraint sequence γ, give two arrangement sets, positive example arrangement set d+With negative example arrangement set d-, sequence Row p is in positive example arrangement set d+With negative example arrangement set d-Between contrast cr ((p, γ), d+,d-) representing, its meaning The difference of the support representing this sequence between two arrangement sets it may be assumed that
cr((p,γ),d+,d-)=sup ((p, γ), d+)-sup((p,γ),d-).
Fig. 1 is the input-defeated of the optimum contrast sequence pattern heuristic mining method of the free spacing constraint of the present embodiment Go out schematic diagram.For given positive example arrangement set d+With negative example arrangement set d-And number k of sequence pattern, our target It is to find out positive example arrangement set d+With negative example arrangement set d-Between contrast optimization k spaced constraint contrast sequence mould Formula.Fig. 2 is the schematic flow sheet of the optimum contrast sequence pattern heuristic mining method of the free spacing constraint of the present embodiment, institute The optimum contrast sequence pattern heuristic mining method stating free spacing constraint includes:
Step s1, the contrast sequence pattern number that input positive example arrangement set, negative example arrangement set and expectation are excavated.
Step s2, randomly generates the genotype candidate pattern coding of predetermined quantity.Specifically, described predetermined quantity is more than the phase Hope the contrast sequence pattern number excavated, described genotype candidate pattern coding includes the fixing gene of at least one length, institute State gene and include head and afterbody.Head includes the spacing constraint set randomly generating, and that is, head can only randomly generate Spacing constraint set is it is also possible to by the spacing constraint set randomly generating and the positive example arrangement set coming from input and negative example sequence The character set of row set collectively constitutes;Afterbody comes from the character set of the positive example arrangement set and negative example arrangement set inputting Close.Further, the length of head presets according to the actual requirements, and the length of afterbody obtains according to t=h × (n-1)+1, its In, t is the length of afterbody, and h is the length of head, and n is the maximum operand needed for spacing constraint.
So that head length is 3, the maximum operand needed for spacing constraint is 2 as a example, Fig. 3 is a gene of the present embodiment The schematic diagram of type candidate pattern coding.Described genotype candidate pattern coding includes two genes: gene 1 and gene 2.Gene 1 Head be the spacing constraint set randomly generating, that is, by spacing constraint [0,3], spacing constraint [1,2] and spacing constraint [2, 5] form;The afterbody of gene 1 comes from the positive example arrangement set of input and the character set of negative example arrangement set, that is, by character a, Character c, character c and character g composition.The head of gene 2 by the spacing constraint set randomly generating with just coming from input Example arrangement set and the character set of negative example arrangement set collectively constitute, that is, by spacing constraint [3,4], spacing constraint [2,4] with And character g composition;The afterbody of gene 2 comes from the positive example arrangement set of input and the character set of negative example arrangement set, that is, by Character t, character a, character t and character t composition.
Step s3, is decoded operating to obtain each genotype candidate pattern volume to each genotype candidate pattern coding The corresponding candidate of code contrasts sequence pattern.Specifically, set up each gene corresponding two from top to bottom, left to right Fork tree, in described binary tree, each node is corresponding in turn to as one of gene element, when the leaf node of described binary tree is equal Complete for contributing during character;Root node using the root node as first binary tree for the new root node and second binary tree Father node to form first renewal binary tree, using new root node as first renewal binary tree root node and the The father node of the root node of three binary trees updates binary tree to form second, by that analogy, using new root node conduct The father node of the root node of the root node of xth -2 renewal binary trees and xth binary tree to form expression tree, wherein, often Individual new root node is the spacing constraint randomly generating, and x is the number of gene in genotype candidate pattern coding;By middle sequence side Formula travels through described expression tree, produces candidate's contrast sequence pattern.
Wherein, first binary tree is the corresponding binary tree of first gene in genotype candidate pattern coding, second Binary tree is the corresponding binary tree of second gene in genotype candidate pattern coding ..., xth binary tree is genotype time The corresponding binary tree of x-th gene in lectotype coding.Distinguishingly, if binary tree only one of which root node, root node also may be used Think character.For the genotype candidate pattern coding of only gene, the corresponding binary tree of this gene is expression tree. Fig. 4 is to obtain candidate's contrast sequence so that the genotype candidate pattern coding including two genes shown in Fig. 3 is decoded with operation The schematic diagram of pattern.Taking set up the corresponding binary tree of gene 2 as a example, its root node corresponds to 2 first elements of gene, gene 2 First element is spacing constraint, continues to set up the second layer;First node of the second layer corresponds to 2 second elements of gene, the Two layers of second node correspond to the 3rd element of gene 2, and 2 second elements of gene are spacing constraint, continue to set up third layer; First node of third layer corresponds to the 4th element of gene 2, and second node of third layer corresponds to the 5th element of gene 2; The 3rd element of gene 2, the 4th element of gene 2 and the 5th element of gene 2 are character, and character is as no child during node Child node, that is, all of leaf node of described binary tree be character, achievement completes.
Step s4, the positive example arrangement set in conjunction with input and negative example arrangement set calculate each candidate and contrast sequence pattern Contrast.According to the definition of contrast, according to formula cr ((p, γ), d+, d-) and=sup ((p, γ), d+)-sup((p,γ), D-) calculate the contrast that each candidate contrasts sequence pattern.
Step s5, judges whether current genotype candidate pattern coding meets method termination condition.Methods described terminates bar Part can be configured according to the actual requirements, can stablizing for method execution time, method execution number of times or eligible result Property etc..For example, method to set up termination condition is 5 minutes for method execution time, and algorithm will stop after performing 5 minutes automatically Only, that is, current genotype candidate pattern coding meets method termination condition, and k candidate's contrast sequence pattern of contrast optimization is Final Result;If algorithm execution time not up to 5 minutes, that is, current genotype candidate pattern coding is unsatisfactory for method knot Bundle condition, then execution step s6, wherein, k is the number of sequence pattern.
Step s6, is waited to current genotype using roulette wheel selection according to the contrast that each candidate contrasts sequence pattern Lectotype coding is selected.Roulette wheel selection is a kind of conventional random selection method, and individual adaptation degree is changed in proportion For the probability selecting, the ratio as shared by individuality carries out ratio cut partition on a disk, treats that disk stops after rotating disk every time Backpointer stops the corresponding individual individuality for choosing in sector.Obviously, individual probability is bigger, and its shared area in disk is got over Greatly, its selected chance is also more.Carry out current genotype candidate pattern coding using roulette wheel selection to select, contrast Spending big candidate's contrast selected probability of sequence pattern can be larger.Specifically, every time from all current genotype candidate's moulds A number of genotype candidate pattern coding is randomly choosed in formula coding, the more a number of current genotype candidate's mould of here The genotype candidate pattern coding of contrast optimization is picked out, until picking out genotype candidate's mould of predetermined quantity in formula coding Formula encodes.
Step s7, carries out predefined genetic manipulation to the partial genotype candidate pattern coding selected and forms new base Because of type candidate pattern coding, and go to step s3.Further, described genetic manipulation includes mutation operation, inserts string operation and weight Group operation.
Specifically, described mutation operation includes: one of gene spacing constraint is made a variation into a character or another Individual spacing constraint;One of gene character is made a variation into a spacing constraint or another character.Described mutation operation Any position in gene can occur, but the structure of gene can not change.That is, the character of afterbody can only make a variation into Another character and can not make a variation and constrain at interval, but the element of head not only can make a variation into character but also can make a variation at interval Constraint.As a example genotype candidate pattern coding shown in still by Fig. 3, Fig. 5 a~Fig. 5 d is that the present embodiment carries out showing of mutation operation It is intended to.Wherein, Fig. 5 a is that by spacing constraint, gene first element of 1 head is made a variation into another spacing constraint;Fig. 5 b be by Gene second element of 1 head makes a variation into character by spacing constraint;Fig. 5 c is that gene second element of 1 afterbody is made a variation by character Become another character;Fig. 5 d is the 3rd element of gene 2 head to be made a variation by character constrain at interval.
Described slotting string operation includes: randomly choose in gene y1 continuous element insertion except first element of head it Front optional position, deletes the last y1 element of protocephalic region to keep head length constant, wherein, y1 >=1;Random in gene Before selecting y2 element insertion first element of head that be continuous and starting with spacing constraint, delete the last y2 unit of protocephalic region Element to keep head length constant, wherein, y2 >=1;Except first extragenic any one during genotype candidate pattern is encoded Before individual gene moves to first gene.As a example genotype candidate pattern coding shown in still by Fig. 3, Fig. 6 a~Fig. 6 c is this Embodiment carries out inserting the schematic diagram of string operation.Wherein, Fig. 6 a is to insert string and arrive gene the 4th element of gene 1 and the 5th element Between 1 first element and second element, and delete gene 1 head latter two element;Fig. 6 b is by the 3rd unit of gene 1 Element and the 4th element insert string to before 1 first element of gene, and delete gene 1 head latter two element;Fig. 6 c be by Gene 2 moves to before gene 1, and, as new gene 1, gene 1 originally is as new gene 2 for gene 2 originally.
Described reorganization operation includes: exchanges the element that two genotype candidate pattern encode same position;Exchange two Individual genotype candidate pattern encodes at least two elements of same position;Exchange two genotype candidate pattern and encode same position Gene.Genotype candidate pattern coding shown in using Fig. 3 is waited as protogene type candidate pattern coding 1 and another genotype As a example lectotype coding is as protogene type candidate pattern coding 2, Fig. 7 a~Fig. 7 c is that the present embodiment carries out showing of reorganization operation It is intended to.Wherein, Fig. 7 a is to be interchangeable the 3rd element of gene 1, the 6th element of gene 1 and the 4th element of gene 2 Restructuring;Fig. 7 b is that to the 6th element, the 3rd element of gene 1 is exchanged restructuring;Fig. 7 c is that whole gene 2 is exchanged restructuring.
New genotype candidate pattern coding is formed by genetic manipulation, new genotype candidate pattern coding is repeated to hold Row step s3, to step s5, can obtain final Result.
Above-described specific embodiment, has been carried out to the purpose of the present invention, technical scheme and beneficial effect further Describe in detail, be should be understood that the specific embodiment that the foregoing is only the present invention, be not intended to limit the present invention Protection domain, all any modification, equivalent substitution and improvement within the spirit and principles in the present invention, done etc., all should comprise Within protection scope of the present invention.

Claims (10)

1. a kind of optimum contrast sequence pattern heuristic mining method of free spacing constraint is it is characterised in that include:
Step s1, the contrast sequence pattern number that input positive example arrangement set, negative example arrangement set and expectation are excavated;
Step s2, randomly generates the genotype candidate pattern coding of predetermined quantity, described genotype candidate pattern coding include to The fixing gene of few length, described gene includes head and afterbody, and head includes the spacing constraint set randomly generating, tail Portion comes from the character set of the positive example arrangement set and negative example arrangement set inputting, and described predetermined quantity is more than expectation and excavates Contrast sequence pattern number;
Step s3, it is right to obtain each genotype candidate pattern coding that each genotype candidate pattern coding is decoded operating The candidate's contrast sequence pattern answered;
Step s4, the positive example arrangement set in conjunction with input and negative example arrangement set calculate the contrast that each candidate contrasts sequence pattern Degree;
Step s5, judges whether current genotype candidate pattern coding meets method termination condition: if meeting, contrast optimization K candidate's contrast sequence pattern is final Result, otherwise execution step s6, and wherein, k is the contrast sequence that expectation is excavated Number of modes;
Step s6, adopts roulette wheel selection to current genotype candidate's mould according to the contrast that each candidate contrasts sequence pattern Formula coding is selected;
Step s7, carries out predefined genetic manipulation to the partial genotype candidate pattern coding selected and forms new genotype Candidate pattern encodes, and goes to step s3.
2. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 1, its feature It is, define finite assemble of symbol ∑, the optional sign in assemble of symbol ∑ is referred to as item, by the item in assemble of symbol ∑ The ordered sequence constituting is referred to as sequence, is expressed as s=< e1,e2,…,ei,…em>, wherein, eiBe referred to as element and 1≤ I≤m, | s | represent the length of sequence s, the element number comprising in sequence s;
For i-th element s in sequence s[i]With j-th element s[j](1≤i≤j≤| s |), gap (s, i, j) represents i-th Element s[i]With j-th element s[j]Between interval element number, i.e. gap (s, i, j)=j-i-1.
3. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 2, its feature It is, for any two sequence s' and s ", meet condition 1 and condition 2, then claim < k1,k2,…,k|s”|> be sequence s " in sequence One of s' occurs, and is designated asWherein,
Condition 1 is: | s'| >=| s " |, that is, the length of sequence s' be not less than sequence s " length;
Condition 2 is: there is set of number 1≤k1≤k2≤…≤k|s”|≤ | s'| so thatFor 1≤i≤| s " | Permanent establishment.
4. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 3, its feature It is, spacing constraint γ is interval [γ .min, γ .max], and γ .min≤γ .max, wherein γ .min represent in spacing constraint The minimum interval element number allowing, γ .max represents the largest interval element number allowing in spacing constraint, γ .min and γ .max it is all higher than equal to 0;
Spacing constraint sequence γ is referred to as by the ordered sequence that different or the multiple spacing constraint of identical form, its form is: γ =< γ12,…,γh>, wherein, h is the quantity of spacing constraint;
For each sequence p, length | | γ | |=| | p | | -1 of its interval constrained sequence.
5. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 4, its feature Be, under spacing constraint sequence γ, support in sequence sets d for sequence p be denoted as sup ((p, γ), d):
s u p ( ( p , &gamma; ) , d ) = | { s &element; d | p &subsetequal; &gamma; s } | / | d | ,
Wherein, s is sequence, and | d | represents the number of sequence in sequence sets d, and sequence p meets spacing constraint sequence γ in sequence s Subsequence, be designated as
Under spacing constraint sequence γ, sequence p is in positive example arrangement set d+Contrast and negative example arrangement set d- between is denoted as cr ((p,γ),d+,d-):
cr((p,γ),d+,d-)=sup ((p, γ), d+)-sup((p,γ),d_).
6. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 1, its feature Be, head is the spacing constraint set randomly generating, or head by the spacing constraint set randomly generating with come from defeated The positive example arrangement set entering and the character set composition of negative example arrangement set.
7. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 1, its feature It is, the length of head presets, the length of afterbody obtains according to t=h × (n-1)+1, wherein, t is the length of afterbody, h For the length of head, n is the maximum operand needed for spacing constraint.
8. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 7, its feature It is, described each genotype candidate pattern is encoded is decoded operating to obtain each genotype candidate pattern coding correspondence Candidate contrast sequence pattern include:
Set up the corresponding binary tree of each gene from top to bottom, left to right, in described binary tree each node according to Secondary correspond to one of gene element, contribute when the leaf node of described binary tree is character and complete;
Adopt new root node as the father node of the root node of first binary tree and the root node of second binary tree with group Become first renewal binary tree, using new root node as first renewal root node of binary tree and the 3rd binary tree The father node of root node, to form second renewal binary tree, by that analogy, updates two using new root node for -2 as xth To form expression tree, wherein, each new root node is the father node of the root node of the root node of fork tree and xth binary tree The spacing constraint randomly generating, x is the number of gene in genotype candidate pattern coding;
Described expression tree is traveled through by middle sequential mode, produces candidate's contrast sequence pattern.
9. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 7, its feature It is, described predefined genetic manipulation includes mutation operation, inserts string operation and reorganization operation;
Described mutation operation includes: one of gene spacing constraint is made a variation into a character or another interval about Bundle;One of gene character is made a variation into a spacing constraint or another character;
Described slotting string operation includes: before in gene, the insertion of random selection y1 continuous element is except first element of head Optional position, deletes the last y1 element of protocephalic region to keep head length constant, wherein, y1 >=1;Gene randomly chooses Y2 is continuous and element insertion first element of head with spacing constraint beginning before, delete the last y2 element of protocephalic region with Keep head length constant, wherein, y2 >=1;First any one base extragenic is removed during genotype candidate pattern is encoded Because before moving to first gene;
Described reorganization operation includes: exchanges the element that two genotype candidate pattern encode same position;Exchange two bases Because type candidate pattern encodes at least two elements of same position;Exchange the base that two genotype candidate pattern encode same position Cause.
10. the optimum contrast sequence pattern heuristic mining method of free spacing constraint according to claim 1, its feature It is, methods described termination condition is method execution time, method executes the stability of number of times or eligible result.
CN201610831506.6A 2016-09-19 2016-09-19 Heuristic mining method of optimal comparing sequence mode of free interval constraint Pending CN106339609A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610831506.6A CN106339609A (en) 2016-09-19 2016-09-19 Heuristic mining method of optimal comparing sequence mode of free interval constraint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610831506.6A CN106339609A (en) 2016-09-19 2016-09-19 Heuristic mining method of optimal comparing sequence mode of free interval constraint

Publications (1)

Publication Number Publication Date
CN106339609A true CN106339609A (en) 2017-01-18

Family

ID=57838934

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610831506.6A Pending CN106339609A (en) 2016-09-19 2016-09-19 Heuristic mining method of optimal comparing sequence mode of free interval constraint

Country Status (1)

Country Link
CN (1) CN106339609A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107016354A (en) * 2017-03-16 2017-08-04 中南大学 The feature mode extracting method and its system of aluminium electrolysis anode current sequence

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104331466A (en) * 2014-10-31 2015-02-04 南京邮电大学 Space-time proximity search-based mobile trace sequence mode quick mining method
CN104408290A (en) * 2014-10-30 2015-03-11 西北工业大学 Inclusion and deductive analysis-based precise sequence rule mining method
CN104537025A (en) * 2014-12-19 2015-04-22 北京邮电大学 Frequent sequence mining method
CN105046107A (en) * 2015-08-28 2015-11-11 东北大学 Restrictive motif discovering method
CN105095613A (en) * 2014-04-16 2015-11-25 华为技术有限公司 Method and device for prediction based on sequential data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095613A (en) * 2014-04-16 2015-11-25 华为技术有限公司 Method and device for prediction based on sequential data
CN104408290A (en) * 2014-10-30 2015-03-11 西北工业大学 Inclusion and deductive analysis-based precise sequence rule mining method
CN104331466A (en) * 2014-10-31 2015-02-04 南京邮电大学 Space-time proximity search-based mobile trace sequence mode quick mining method
CN104537025A (en) * 2014-12-19 2015-04-22 北京邮电大学 Frequent sequence mining method
CN105046107A (en) * 2015-08-28 2015-11-11 东北大学 Restrictive motif discovering method

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
CHAO GAO等: "Mining Top-k Distinguishing Sequential Patterns with Flexible Gap Constraints", 《WEB-AGE INFORMATION MANAGEMENT》 *
XIAONAN JI等: "Mining minimal distinguishing subsequence", 《KNOWLEDGE AND INFORMATION SYSTEMS》 *
唐常杰等: "基于转基因GEP的公式发现", 《计算机应用》 *
杨艳梅等: "基于二叉树编码遗传算法的SOA服务选择", 《计算机应用》 *
王慧锋等: "免预设间隔约束的对比序列模式高效挖掘", 《计算机学报》 *
王艳春: "基因表达式编程算法及其应用综述", 《计算机软件与应用》 *
龚文引,蔡之华,杨鸣著: "《智能算法在高光谱遥感数据处理中的应用》", 30 November 2014, 中国地质大学出版社 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107016354A (en) * 2017-03-16 2017-08-04 中南大学 The feature mode extracting method and its system of aluminium electrolysis anode current sequence
CN107016354B (en) * 2017-03-16 2020-07-31 中南大学 Method and system for extracting characteristic pattern of aluminum electrolysis anode current sequence

Similar Documents

Publication Publication Date Title
Currin et al. Computing exponentially faster: implementing a non-deterministic universal Turing machine using DNA
CN109308497A (en) A kind of multidirectional scale dendrography learning method based on multi-tag network
Chang et al. A block mining and re-combination enhanced genetic algorithm for the permutation flowshop scheduling problem
CN103116693B (en) Based on the Method for HW/SW partitioning of artificial bee colony
Rusin et al. Reconciliation of gene and species trees
Achar et al. RNA motif discovery: a computational overview
Eiben et al. Genetic algorithms
Ghoneimy et al. A new hybrid clustering method of binary differential evolution and marine predators algorithm for multi-omics datasets
Mitra et al. Application of meta-heuristics on reconstructing gene regulatory network: a bayesian model approach
CN108509764B (en) Ancient organism pedigree evolution analysis method based on genetic attribute reduction
CN106339609A (en) Heuristic mining method of optimal comparing sequence mode of free interval constraint
Mäkinen et al. Genome-Scale Algorithm Design: Bioinformatics in the Era of High-Throughput Sequencing
Tamura et al. Distributed Modified Extremal Optimization using Island Model for Reducing Crossovers in Reconciliation Graph.
Moen et al. HyperHMM: efficient inference of evolutionary and progressive dynamics on hypercubic transition graphs
Du et al. Genetic algorithms
Pardi Algorithms on phylogenetic trees
Townsend Genetic Algorithm–A Tutorial
Liu et al. Data-driven boolean network inference using a genetic algorithm with marker-based encoding
Luo et al. Linear-time algorithms for the multiple gene duplication problems
Kwarciak et al. Tabu search algorithm for DNA sequencing by hybridization with multiplicity information available
Rastas et al. Haplotype inference via hierarchical genotype parsing
Maddouri et al. Encoding of primary structures of biological macromolecules within a data mining perspective
Petrowski et al. Evolutionary algorithms
Bienvenu et al. A branching process with coalescence to model random phylogenetic networks
Das et al. Optimal haplotype assembly via a branch-and-bound algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170118