CN109033341A - A kind of Top-k comparison Sequential Pattern Mining Algorithm based on concurrent spaced constraint - Google Patents
A kind of Top-k comparison Sequential Pattern Mining Algorithm based on concurrent spaced constraint Download PDFInfo
- Publication number
- CN109033341A CN109033341A CN201810811661.0A CN201810811661A CN109033341A CN 109033341 A CN109033341 A CN 109033341A CN 201810811661 A CN201810811661 A CN 201810811661A CN 109033341 A CN109033341 A CN 109033341A
- Authority
- CN
- China
- Prior art keywords
- sequence
- mode
- concurrent
- constraint
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 69
- 238000005065 mining Methods 0.000 title claims abstract description 45
- 238000000034 method Methods 0.000 claims abstract description 20
- 238000013138 pruning Methods 0.000 claims description 37
- 238000003860 storage Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 12
- 238000013480 data collection Methods 0.000 claims description 7
- 238000004519 manufacturing process Methods 0.000 claims description 7
- 238000009826 distribution Methods 0.000 claims description 5
- 101100001670 Emericella variicolor andE gene Proteins 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 208000035473 Communicable disease Diseases 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000009412 basement excavation Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 208000015181 infectious disease Diseases 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 230000005541 medical transmission Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000006116 polymerization reaction Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000003825 pressing Methods 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of, and the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, it includes: by prescribed form input data set and parameter;Scan data set produces the set and the wherein location information of whole elements of candidate's element;The whole candidate pattern of data acquisition system Enumeration Tree traversal, finds out contrast k mode the most significant;Contrast k mode the most significant is exported to specified position.The present invention introduces the concept of top-k on the basis of the comparison sequential mode mining of spaced constraint.The top-k comparison sequential mode mining of spaced constraint is intended to find that support changes k comparison sequence pattern the most significant between two datasets, and this method can missing to avoid the useful mode due to caused by inappropriate threshold value;It only needs user to set the number of desired mode, is substantially reduced using the more previous method of difficulty;The interpretation of Result is enhanced simultaneously.
Description
Technical field
The present invention relates to the Series Data Minings in Computer Data excavation applications, more particularly, to one kind based on concurrent
, the mode excavation that can solve spacing constraint, the comparison sequence pattern for being substituted using top-k concept specific support threshold dug
Dig algorithm.
Background technique
Since Agrawal and Srikant proposes sequential mode mining, sequence pattern is as data mining one important
Business has attracted the concern of large quantities of researchers, and a variety of different sequence patterns are proposed successively, such as Frequent Sequential Patterns, compares sequence
Mode, closed mode, partial order mode, cyclic pattern etc..In real life, sequence pattern has a wide range of applications.For example,
Hygienic disease control department can excavate mode of the infectious disease transmission in time series, when Result can be used for finding infectious disease
Empty aggregation breaks out rule, and then provides reference for prevention and control.Bioscience man can pass through analysis DNA and protein sequence
Column find the root that disease generates, and research and develop novel drugs.Utilities Electric Co. is improved by analysis of history electricity consumption data to electric load
The accuracy of prediction.
The concept of widely used spacing constraint makes the matching of mode more flexible in sequential mining.Spacing constraint is one
The section determined by two nonnegative integers indicates the minimum for the element number for allowing to be spaced between two adjacent elements in sequence pattern
Value and maximum value.Such as: enabling spacing constraint is [1,3], and sequence pattern P=at is meaned and deposited in S if P can be matched in sequence S
In element a and element t, and there are one group of at, a is at least spaced 1 element before t between the two, at most 3, interval
Element.
In previous comparison sequential mode mining work, user requires setting positive example support threshold α, negative example support
Threshold value beta and spacing constraint γ.Its target is to excavate positive example support more than or equal to α, and negative example is supported under spacing constraint
Degree is less than or equal to the minimum mode of β, but there are two problems for such mining algorithm: (a) user is difficult to set suitable branch
Degree of holding threshold value, if setting inappropriate support threshold, the mode excavated may be unsatisfactory for the expectation of user;(b) make
Beta pruning is carried out with constraint is minimized, although reducing search space, some useful modes is caused to be cut up.
Summary of the invention
In view of the above drawbacks of the prior art, it is an object of the invention to provide a kind of based on concurrent spaced constraint
Top-k compare Sequential Pattern Mining Algorithm, comparison Sequential Pattern Mining Algorithm can be greatly improved based on concurrent task division
Efficiency can be relatively easy to realize in Hadoop platform according to the division principle of its task, further increase algorithm effect
Rate improves the applicability of algorithm.
It is realized the purpose of the present invention is technical solution in this way, a kind of Top- based on concurrent spaced constraint
K compares Sequential Pattern Mining Algorithm, it includes:
S1: prescribed form input data set and parameter are pressed;
S2: scan data set produces the set and the wherein location information of whole elements of candidate's element;
S3: the whole candidate pattern of data acquisition system Enumeration Tree traversal finds out contrast k mode the most significant;
S4: output contrast k mode the most significant to specified position.
Further, the data acquisition system parameter of the step S1 input includes: a) positive example data set;B) negative example data set;
C) spacing constraint;D) k value.
Further, the step S2 has been specifically included:
S211: the positive example data set that scan data is concentrated;
S212: then each according to sequence according to every sequence of order traversal of sequence for the sequence data collection of input
The value of element on position and its sequence number at place and the updating location information position corresponding element in the sequence are in member
Information in element set;
S213: if the element is not present, then the element is put into set, then by sequence number and in the sequence
Location information is also placed in the corresponding structure information storage of the element;
S214: if the element has existed, then only location information by sequence number and in the sequence is also placed in
In the corresponding structure information storage of the element.
Further, the step S2 further includes having:
S221: the negative example data set that scan data is concentrated;
S222: then each according to sequence according to every sequence of order traversal of sequence for the sequence data collection of input
The value of element on position and its sequence number at place and the updating location information position corresponding element in the sequence are in member
Information in element set;
S223: if the element is not present, then abandoning the element;
S224: if the element has existed, then only location information by sequence number and in the sequence is also placed in
In the corresponding structure information storage of the element.
Further, step S21 includes: realizing the scanning of sequence using MapReduce, i.e., sequence is distributed to more
The end slaver is scanned, and will finally be polymerize from the production candidate's element result set of multiple nodes, is obtained final
Element set.
Further, the step S3 further includes having Pruning strategy;It has specifically included:
S311: Pruning strategy 1: for arbitrary element e ∈ Σ, if e ∈ D-, andE is removed from Σ;
S312: Pruning strategy 2: during top-k compares sequential mode mining, current results integrate as R, when | R |=k,
That is there is k mode in set R, and in this k mode, the minimum value of contrast is CRk, if candidate pattern P is in positive example data
The support of concentration is less than CRk, i.e. Sup (P, D+) < CRk cuts off P corresponding node and its all descendant nodes;
S313: Pruning strategy 3: during top-k compares sequential mode mining, when | R |=k, and in this k mode
In, the minimum value of contrast is CRk, for arbitrary element e ∈ Σ, Sup (e, D+) < CRk, cuts off and owns in Set-Enumeration Tree
Node comprising element e;
In above-mentioned Pruning strategy, Σ is alphabet, i.e. a finite element set;Element e ∈ Σ;D+ is positive number of cases evidence
Collection;D-is negative a data set;Sup (P, D) is support of the mode P in data set D at spacing constraint γ;CR(P,D+,
D -) it is contrast of the mode P between data set D+ and D-at spacing constraint γ.
Further, the step S3 further includes having segmentation Set-Enumeration Tree, the specific steps are as follows:
S321: n part will be divided by the Set-Enumeration Tree of root node of null value;
S322: primary distribution carries out traversal search to n example, the subtree of each independent self-responsibility of example.
Further, the step S3 further includes having subtree ergodic process:
S331: each example is in the subtree being assigned to according to the sequence progress time that depth-first (or range) is preferential
It goes through, global result set is read in ergodic process and is come using Pruning strategy 2 and 3;
S332: when finding a in subtree when the overall situation all meets the mode of definition, global result set is updated;
S333: continuing searching, and updates a global outcome collection until subtree search completion, and before terminating.
Further, data scanning collection is the scanning that sequence is realized by MapReduce, i.e., sequence is distributed to more
The end slaver is scanned, and will finally be polymerize from the production candidate's element result set of multiple nodes, is obtained final
Element set.
By adopting the above-described technical solution, the present invention has the advantage that:
(1) present invention introduces the concept of top-k on the basis of the comparison sequential mode mining of spaced constraint.Interband
It is intended to find that between two datasets, support changes k the most significant every the top-k comparison sequential mode mining of constraint
Sequence pattern is compared, this method can missing to avoid the useful mode due to caused by inappropriate threshold value;Only user is needed to set
The number of desired mode is substantially reduced using the more previous method of difficulty;Enhance Result simultaneously can
It is explanatory.
(2) specific tasks of algorithm are decomposed into multiple parallel parts by the present invention, then give respectively multiple threads (or
Multiple nodes in person Hadoop) it handles simultaneously, thus accelerating algorithm.
(3) efficiency that can greatly improve comparison Sequential Pattern Mining Algorithm is divided the present invention is based on concurrent task, according to
The division principle of its task can be relatively easy to realize in Hadoop platform, further increase efficiency of algorithm, improve algorithm
Applicability.Existing comparison Sequential Pattern Mining Algorithm needs user to set positive example support threshold and negative example support threshold
Value.In the case where not having enough priori knowledges, user is difficult to set appropriate support threshold, so as to miss
Useful mode, and the present invention introduces the concept of top-k on the basis of previous comparison sequential mode mining, does not need to use
Support threshold is arranged in family, so that mining algorithm is easier to use, is as a result easier to explain.Meanwhile devising multiple cut
Branch strategy and Heuristic Strategy accelerating algorithm.
Other advantages, target and feature of the invention will be illustrated in the following description to a certain extent, and
And to a certain extent, based on will be apparent to those skilled in the art to investigating hereafter, Huo Zheke
To be instructed from the practice of the present invention.
Detailed description of the invention
Detailed description of the invention of the invention is as follows:
Fig. 1 is the flow diagram that the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm.
Fig. 2 is the block schematic illustration that the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm.
Fig. 3 is the schematic diagram of element support storage organization.
Fig. 4 is the schematic diagram of Set-Enumeration Tree.
Fig. 5 is the schematic diagram of concurrent tasks distribution.
Specific embodiment
Present invention will be further explained below with reference to the attached drawings and examples.
Embodiment 1: as shown in Figures 1 to 5;A kind of Top-k comparison sequence pattern digging based on concurrent spaced constraint
Algorithm is dug, it includes:
S1: prescribed form input data set and parameter are pressed;
S2: scan data set produces the set and the wherein location information of whole elements of candidate's element;
S3: the whole candidate pattern of data acquisition system Enumeration Tree traversal finds out contrast k mode the most significant;
S4: output contrast k mode the most significant to specified position.
The data acquisition system parameter of step S1 input includes: a) positive example data set;B) negative example data set;C) spacing constraint;
D) k value.
Step S2 has been specifically included:
S211: the positive example data set that scan data is concentrated;
S212: then each according to sequence according to every sequence of order traversal of sequence for the sequence data collection of input
The value of element on position and its sequence number at place and the updating location information position corresponding element in the sequence are in member
Information in element set;
S213: if the element is not present, then the element is put into set, then by sequence number and in the sequence
Location information is also placed in the corresponding structure information storage of the element;
S214: if the element has existed, then only location information by sequence number and in the sequence is also placed in
In the corresponding structure information storage of the element.
Step S2 further includes having:
S221: the negative example data set that scan data is concentrated;
S222: then each according to sequence according to every sequence of order traversal of sequence for the sequence data collection of input
The value of element on position and its sequence number at place and the updating location information position corresponding element in the sequence are in member
Information in element set;
S223: if the element is not present, then abandoning the element;
S224: if the element has existed, then only location information by sequence number and in the sequence is also placed in
In the corresponding structure information storage of the element.
Step S21 includes: realizing the scanning of sequence using MapReduce, i.e., sequence is distributed to more slaver
End is scanned, and will finally be polymerize from the production candidate's element result set of multiple nodes, is obtained final element set
It closes.
Step S3 further includes having Pruning strategy;It has specifically included:
S311: Pruning strategy 1: for arbitrary element e ∈ Σ, if e ∈ D-, andE is removed from Σ;
S312: Pruning strategy 2: during top-k compares sequential mode mining, current results integrate as R, when | R |=k,
That is there is k mode in set R, and in this k mode, the minimum value of contrast is CRk, if candidate pattern P is in positive example data
The support of concentration is less than CRk, i.e. Sup (P, D+) < CRk cuts off P corresponding node and its all descendant nodes;
S313: Pruning strategy 3: during top-k compares sequential mode mining, when | R |=k, and in this k mode
In, the minimum value of contrast is CRk, for arbitrary element e ∈ Σ, Sup (e, D+) < CRk, cuts off and owns in Set-Enumeration Tree
Node comprising element e;
In above-mentioned Pruning strategy, Σ is alphabet, i.e. a finite element set;Element e ∈ Σ;D+ is positive number of cases evidence
Collection;D-is negative a data set;Sup (P, D) is support of the mode P in data set D at spacing constraint γ;CR(P,D+,
D -) it is contrast of the mode P between data set D+ and D-at spacing constraint γ.
Step S3 further includes having segmentation Set-Enumeration Tree, the specific steps are as follows:
S321: n part will be divided by the Set-Enumeration Tree of root node of null value;
S322: primary distribution carries out traversal search to n example, the subtree of each independent self-responsibility of example.
Step S3 further includes having subtree ergodic process:
S331: each example is in the subtree being assigned to according to the sequence progress time that depth-first (or range) is preferential
It goes through, global result set is read in ergodic process and is come using Pruning strategy 2 and 3;
S332: when finding a in subtree when the overall situation all meets the mode of definition, global result set is updated;
S333: continuing searching, and updates a global outcome collection until subtree search completion, and before terminating.
Data scanning collection is the scanning that sequence is realized by MapReduce, i.e., sequence is distributed to the more ends slaver
It is scanned, will finally polymerize from the production candidate's element result set of multiple nodes, obtain final element set.
Embodiment 2: the present embodiment is preferred embodiment.As shown in Figures 1 to 5;It is a kind of based on it is concurrent it is spaced about
The Top-k of beam compares Sequential Pattern Mining Algorithm, and for the positive example data set (D+) and negative example data set (D-) of input, interval is about
The number (k) for the comparison contrastive pattern the most significant that beam γ and user need to excavate, the present invention decompose specific tasks
For multiple parallel parts, then these tasks of concurrent execution, are finally obtained from positive example data set to negative example data set
Support variation is the most significant and meets k comparison sequence pattern of spacing constraint.
Sequence pattern is compared in order to efficiently and effectively excavate the top-k of spaced constraint, the present invention needs to solve effective
Property, four aspects of scalability that applicability, high efficiency and element set are higher-dimension the problem of.
Validity refers to, that is, excavates and meet all mode that definition requires, and do not omit, as long as needing ensure that
Be candidate pattern support in whole contrastive patterns, contrast is the largest k, then one is scheduled in result set, and
And the mode in result set is also necessarily satisfying for definition.
Applicability refers to, due in the data of sequence pattern itself may noise data, the number of certain points in sequence
According to may be mistake, if stringent carries out pattern search according to sequence pattern sequence, the potential mould in part may be missed
Formula, these modes may meet definition, but due to including noise in sequence data, cause these modes can not in itself
Directly by it is stringent it is matched in a manner of excavate and obtain from sequence data.So the present invention needs to introduce other constraints, algorithm is improved
The applicability for the mode excavated.
High efficiency refers to, due to the validity in order to guarantee result, the present invention needs to guarantee the validity of arithmetic result, calculates
Method needs to traverse full search space, this will lead to the candidate pattern data that need to enumerate for biggish data set compared with
Be it is huge, will lead to the runing time of algorithm beyond acceptable range.So the present invention needs to improve the operational efficiency of algorithm,
That is high efficiency.
Element set be higher-dimension scalability refer to, be similar to high efficiency, Set-Enumeration Tree come to whole search spaces into
Row traversal, and as the dimension of element set increases, the growth of exponential form is presented in the combination of element, this will lead to the fortune of algorithm
It is also unacceptable that the row time, which is higher-dimension in element set,.So the present invention needs to improve the operational efficiency of algorithm, i.e., efficiently
Property.
In order to guarantee the validity of algorithm, present invention employs carry out candidate pattern based on the mode of Set-Enumeration Tree
It search and checks, algorithm in a certain order successively enumerating the candidate's element in set, and modal length increases since 1
It is long, until having searched for whole spaces.This search process can be converted to the tree being made of to one combination of set element
Traversal.For traversal of tree, generally the most commonly used is depth-first and two kinds of breadth First: 1) depth-first: depth-first
Traversal can save memory headroom;2) breadth First: the traversal of breadth First can be with accelerating algorithm.
Algorithm execute when specifically select which kind of traversal mode be can according to the size and time requirement of data set,
And depth-first and the respective characteristic of breadth first traversal are selected, to reach the demand of user.
In order to guarantee the applicability of algorithm, that is, allow mining algorithm that can also find mode, this hair on noisy data set
It is bright to be the introduction of spacing constraint.The concept of spacing constraint makes the matching of mode more flexible.Spacing constraint is one non-by two
The section that negative integer determines indicates the minimum value and maximum of the element number for allowing to be spaced between two adjacent elements in sequence pattern
Value.Introducing spacing constraint makes the matching of mode extend to subsequence by the substring of the sequence of script, only considers to be spaced in
Among minimum value and maximum value, rather than the reason of whole subsequence, is, on many data sets, interval is greater than certain value and small
It is not all made much sense in certain value, and arbitrary spacing constraint can be represented using minimum value and maximum value.
In order to guarantee the high efficiency of algorithm, present invention employs three kinds of Pruning strategies and a kind of inspiration acceleration strategy, so that
The number for the candidate pattern that algorithm needs to be traversed for can be reduced greatly.In order to better illustrate the Pruning strategy of this patent, often
It is as shown in table 1 with symbol.
The definition of the conventional sign of the present invention of table 1
Symbol | Meaning |
Σ | Alphabet, i.e. a finite element set |
e | Element e ∈ Σ |
D+ | Positive example data set |
D– | Negative example data set |
Sup(P,D) | At spacing constraint γ, support of the mode P in data set D |
CR(P,D+,D–) | At spacing constraint γ, mode P is in data set D+And D–Between contrast |
1) Pruning strategy:
A) Pruning strategy 1: for arbitrary element e ∈ Σ, if e ∈ D-, andE is removed from Σ.
B) Pruning strategy 2: during top-k compares sequential mode mining, current results integrate as R, when | R |=k collects
Closing in R has k mode, and in this k mode, and the minimum value of contrast is CRk, if candidate pattern P is in positive example data set
Support be less than CRk, i.e. Sup (P, D+) < CRk cuts off P corresponding node and its all descendant nodes.
C) Pruning strategy 3: during top-k compares sequential mode mining, when | R |=k, and in this k mode, it is right
Minimum value than degree is CRk, for arbitrary element e ∈ Σ, Sup (e, D+) < CRk, cuts off and all in Set-Enumeration Tree includes
The node of element e.
2) Heuristic Strategy:
In practice, it has been found that in above-mentioned 3 Pruning strategies, Pruning strategy 1 can use before enumeration process;It cuts
2 it is generally most effective Pruning strategy of branch strategy, but it and Pruning strategy 3 are all k comparison sequence patterns to be found
It just comes into effect later, so entire enumeration process can be accelerated by finding contrast k comparison sequence pattern as big as possible as early as possible,
Reduce unnecessary trial.
Contrast k comparison sequence pattern as big as possible so how could is found as early as possible? according to Set-Enumeration Tree
Property, be readily apparent that change Set-Enumeration Tree enumeration order, first enumerate be more likely to generate comparison sequence pattern element.For
This, sorts to the element in Σ: (a) sorting from large to small by contrast;(b) it is sorted from large to small by support.It is tied according to sequence
Fruit traversal Set-Enumeration Tree is expected than traversing the Set-Enumeration Tree comparison sequence that quickly to find contrast big under original sequence
Mode reduces the runing time of algorithm so that Pruning strategy 2 and 3 be enable to apply effect as early as possible.
In order to guarantee that the element set of algorithm is the scalability of higher-dimension, the present invention accelerates algorithm by the way of more examples
Operational efficiency, reduce the runing time of algorithm.It needs using more examples, it is necessary to which be related to two problems: 1) how is task
It is allocated with how 2) result set polymerize.
1) how task is allocated: in order to allow algorithm not compute repeatedly, quite a few calculation, the present invention divides Set-Enumeration Tree
It is cut into multiple independent and disjoint subtree, subtree is then distributed to n concurrently according to its possible calculation amount respectively respectively
Example, can thus greatly improve the efficiency of algorithm, and the calculation amount that multiple examples may be implemented is roughly the same, without
Certain examples are easy to produce to finish soon, and remaining example also needs for a long time terminate, when leading to overall operation
Between cannot reach expected.These examples can be thread, be also possible to process or even multiple servers.
2) how result set polymerize: for the operation of the polymerization of result, we can use 2 kinds of modes substantially, or will
The two is used in combination with:
A) just can once be polymerize generating a result every time.Because the result set of this mode is global synchronization
, it will lead to multiple examples and fight for result set and be written, lead to certain obstruction, drag down the parallel efficiency of algorithm, but this
The algorithm optimization strategy proposed before the present invention can be used in kind mode, although causing each example that may block, its
The space for needing to search for greatly reduces.
B) can be polymerize again after an example finds respective top-k comparison sequence pattern.In this way, each operation is real
The result set of example be it is independent, only to after sample result, can just polymerize.Example each so directly would not be due to
It writes result set and generates obstruction, but be not available Pruning strategy 2 and 3 in this way, the time for causing each example to need to run is big
Width increases.
C) both the above strategy can be combined, is once polymerize after generating u mode, this reduces resistance
The probability of plug decreases the time of each example operation, i.e., obtains balance between both the above strategy, but the value of u
Need priori knowledge.
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description.
Fig. 2 is a kind of block schematic illustration of top-k comparison Sequential Pattern Mining Algorithm based on concurrent spaced constraint,
This frame is made of four major parts:
1) it inputs: pressing prescribed form input data set and parameter
2) candidate's element: data set scanning is produced, the set and the wherein location information of whole elements of candidate's element are produced;
3) it produces candidate pattern: according to the whole possible candidate patterns of Set-Enumeration Tree traversal, it is the most aobvious to find out contrast
K mode of work;
4) it exports;Contrast k mode the most significant is exported to designated position.
Fig. 1 is idiographic flow schematic diagram of the invention, and the process of this frame is as follows:
1) information is inputted, input information is made of four parts:
A) positive example data set;
B) negative example data set;
C) spacing constraint;
D) k value.
2) scan positive example data set: for the sequence data collection of input, the present invention according to sequence every sequence of order traversal
Column, then according to the value of element and its sequence number at place on each position of sequence and updating location information in the sequence
The information of the position corresponding element in element set;If the element is not present, then the element is put into set, then will
Sequence number and location information in the sequence are also placed in the corresponding structure information storage of the element;If the element has been deposited
Then only location information by sequence number and in the sequence is also placed in the corresponding structure information storage of the element.
It should be noted that due to the scanning of sequence be between each sequence it is pairwise independent, i.e., every sequence can individually generate knot
Fruit information, and other sequences of getting along well generate conflict.The present invention realizes the scanning of sequence using MapReduce, i.e., sequence point
It is dealt into the more ends slaver to be scanned, will finally polymerize from the production candidate's element result set of multiple nodes, obtain
To final element set.
3) Pruning strategy 1: according to Pruning strategy 1, if an element only occurs in negative number of cases according to concentration, then the element
It is unlikely to be the result for constituting algorithm;Equivalence can obtain, in the scanning process of negative example data set, it is thus only necessary to consider which occurs
Element in positive example data set.
4) it scans negative example data set: being substantially equivalent to scanning positive example data set, but it is slightly a little different.According to sequence
Every sequence of order traversal, then according to the value of element and its sequence number at place on each position of sequence and in the sequence
The updating location information information of the position corresponding element in element set;If the element is not present, then abandoning the element;
If the element has existed, then only location information by sequence number and in the sequence is also placed in the corresponding letter of the element
It ceases in storage organization.Likewise, accelerating algorithm using MapReduce, efficiency of algorithm is improved, it is time-consuming to reduce algorithm.
5) Heuristic Strategy 1: for the candidate's element set generated, it is ranked up (only according to Heuristic Strategy 1
Change enumeration order), then element more forward in new enumeration order, the probability for generating desired mode are got over
Greatly, the probability of element generation result more rearward is lower, in order to generate contrast k mode as significant as possible faster, calculates
Method can allow example sequentially to traverse Set-Enumeration Tree according to this in order, rather than according to former sequence.
6) divide Set-Enumeration Tree: n part will be divided by the Set-Enumeration Tree of root node of null value, then primary point
N example of dispensing, each example can independent self-responsibility subtree carry out traversal search.General method is, first
By l layers of breadth first traversal, then m subtree is obtained, the lexcographical order for then constituting this m subtree according to its Heuristic Strategy
Sequence, is then distributed to the example numbered as i%n for i-th subtree.
7) subtree traverse: each example in the subtree being assigned to according to the preferential sequence of depth-first (or range) into
Row traverses, and global result set is read in ergodic process and is come using Pruning strategy 2 and 3, with accelerating algorithm operational efficiency.Work as subtree
On find a when the overall situation all meets the mode of definition, update global result set.It then proceedes to search for, until subtree
Search is completed, and updates a global outcome collection before terminating.
8) it exports result: the top-k comparison sequence pattern of spaced constraint is exported as required.
Fig. 3 is that element provided in an embodiment of the present invention the schematic diagram of structure information storage occurs, and the present invention uses a collection
It closes to store the value of whole elements, by Hash table, the value with element is key, and the corresponding structure for information occur of element is
value;The corresponding number for information occur and storing the sequence of its appearance with a list of each element, each sequence correspond to one again
A container stores it and particularly occurs at those of sequence position.For the same element, the appearance information of positive example data set and
There is information and stores respectively in negative example.
Fig. 4 be Set-Enumeration Tree provided in an embodiment of the present invention schematic diagram, it is seen that for candidate's element set Σ=
{ e1, e2, e3 ..., eD }, algorithm may make up Set-Enumeration Tree as shown in it, and root node is sky, and lower layer is candidate's element collection
Whole length that Σ is constituted are closed as 1 mode, share D child node;For any one node p, it includes have D son section
Point, value are respectively { pe1, pe2, pe3 ..., peD }.
Fig. 5 is the schematic diagram of concurrent tasks provided in an embodiment of the present invention distribution, and legend is the mode for being 1 with whole length
Task division is carried out for root node, can be that the mode of 2,3 or even l carries out task point as root node using length if n > D
Match.It is bigger as the probability of the significant mode of the mode of header element generation comparison using it due to the more forward element that sorts, so calculating
Method cannot assign them to the same running example, but compartment of terrain is allocated subtree, so that may in each example
It generates and compares significant mode.
Present invention application Hadoop platform excavates comparison sequence pattern, institute using the constraint of spacing constraint and top-k
Stating algorithm includes two submodules: generating candidate's element;Generate candidate pattern.Use the MapReduce frame of Hadoop platform
Produce the appearance position information of candidate's element and its element, and using Pruning strategy 1 reduce negative number of cases according to concentration need not
The element wanted.Come in prescribed model that two adjacent elements are at the interval that original data is concentrated using spacing constraint, about using top-k
Beam provides the type of the mode eventually found, using Pruning strategy 2 and 3 and acceleration strategy 1 is inspired to carry out accelerating algorithm efficiency,
And carry out distributed tasks using the MapReduce frame of Hadoop platform to improve efficiency.
The present invention provides a kind of top-k comparison Sequential Pattern Mining Algorithm based on concurrent spaced constraint, based on simultaneously
The task of hair divides the efficiency that can greatly improve comparison Sequential Pattern Mining Algorithm, can be compared with according to the division principle of its task
Readily to be realized in Hadoop platform, efficiency of algorithm is further increased, the applicability of algorithm is improved.Existing comparison sequence
Pattern mining algorithm needs user to set positive example support threshold and negative example support threshold.Do not having enough priori knowledges
In the case of, user is difficult to set appropriate support threshold, and so as to miss some useful modes, and the present invention is previous
Comparison sequential mode mining on the basis of introduce the concept of top-k, do not need user setting support threshold so that
Mining algorithm is easier to use, and is as a result easier to explain.Meanwhile devising multiple Pruning strategies and Heuristic Strategy accelerating algorithm.
It should be understood that the part that this specification does not elaborate belongs to the prior art.Finally, it is stated that above
Embodiment is only used to illustrate the technical scheme of the present invention and not to limit it, although having carried out in detail referring to preferred embodiment to the present invention
Illustrate, those skilled in the art should understand that, can with modification or equivalent replacement of the technical solution of the present invention are made,
Without departing from the objective and range of the technical program, it is intended to be within the scope of the claims of the invention.
Claims (9)
1. a kind of Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, which is characterized in that the algorithm
Include:
S1: prescribed form input data set and parameter are pressed;
S2: scan data set produces the set and the wherein location information of whole elements of candidate's element;
S3: the whole candidate pattern of data acquisition system Enumeration Tree traversal finds out contrast k mode the most significant;
S4: output contrast k mode the most significant to specified position.
2. the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, feature as described in claim 1
It is, the data acquisition system parameter of the step S1 input includes: a) positive example data set;B) negative example data set;C) spacing constraint;
D) k value.
3. the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, feature as described in claim 1
It is, the step S2 has been specifically included:
S211: the positive example data set that scan data is concentrated;
S212: for the sequence data collection of input, according to every sequence of order traversal of sequence, then according to each position of sequence
On the value of element and its sequence number at place and the updating location information position corresponding element in the sequence in element set
Information in conjunction;
S213: if the element is not present, then the element is put into set, the then position by sequence number and in the sequence
Information is also placed in the corresponding structure information storage of the element;
S214: if the element has existed, then only location information by sequence number and in the sequence is also placed in this yuan
In the corresponding structure information storage of element.
4. the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, feature as described in claim 1
It is, the step S2 further includes having:
S221: the negative example data set that scan data is concentrated;
S222: for the sequence data collection of input, according to every sequence of order traversal of sequence, then according to each position of sequence
On the value of element and its sequence number at place and the updating location information position corresponding element in the sequence in element set
Information in conjunction;
S223: if the element is not present, then abandoning the element;
S224: if the element has existed, then only location information by sequence number and in the sequence is also placed in this yuan
In the corresponding structure information storage of element.
5. the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, feature as claimed in claim 3
Be, step S21 includes: realizing the scanning of sequence using MapReduce, i.e., sequence be distributed to the more ends slaver into
Row scanning, will finally polymerize from the production candidate's element result set of multiple nodes, obtains final element set.
6. the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, feature as described in claim 1
It is, the step S3 further includes having Pruning strategy;It has specifically included:
S311: Pruning strategy 1: for arbitrary element e ∈ Σ, if e ∈ D-, andE is removed from Σ;
S312: Pruning strategy 2: during top-k compares sequential mode mining, current results integrate as R, when | R |=k collects
Closing in R has k mode, and in this k mode, and the minimum value of contrast is CRk, if candidate pattern P is in positive example data set
Support be less than CRk, i.e. Sup (P, D+) < CRk cuts off P corresponding node and its all descendant nodes;
S313: Pruning strategy 3: during top-k compares sequential mode mining, when | R |=k, and in this k mode, it is right
Minimum value than degree is CRk, for arbitrary element e ∈ Σ, Sup (e, D+) < CRk, cuts off and all in Set-Enumeration Tree includes
The node of element e;
In above-mentioned Pruning strategy, Σ is alphabet, i.e. a finite element set;Element e ∈ Σ;D+ is positive a data set;
D-is negative a data set;Sup (P, D) is support of the mode P in data set D at spacing constraint γ;CR(P,D+,D–)
For at spacing constraint γ, contrast of the mode P between data set D+ and D-.
7. the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, feature as described in claim 1
It is, the step S3 further includes having segmentation Set-Enumeration Tree, the specific steps are as follows:
S321: n part will be divided by the Set-Enumeration Tree of root node of null value;
S322: primary distribution carries out traversal search to n example, the subtree of each independent self-responsibility of example.
8. the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, feature as described in claim 1
It is, the step S3 further includes having subtree ergodic process:
S331: each example is traversed in the subtree being assigned to according to the preferential sequence of depth-first (or range), time
Global result set is read during going through to come using Pruning strategy 2 and 3;
S332: when finding a in subtree when the overall situation all meets the mode of definition, global result set is updated;
S333: continuing searching, and updates a global outcome collection until subtree search completion, and before terminating.
9. the Top-k based on concurrent spaced constraint as described in claim 3 or 4 compares Sequential Pattern Mining Algorithm,
It is characterized in that, data scanning collection is the scanning that sequence is realized by MapReduce, i.e., sequence is distributed to the more ends slaver
It is scanned, will finally polymerize from the production candidate's element result set of multiple nodes, obtain final element set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810811661.0A CN109033341A (en) | 2018-07-23 | 2018-07-23 | A kind of Top-k comparison Sequential Pattern Mining Algorithm based on concurrent spaced constraint |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810811661.0A CN109033341A (en) | 2018-07-23 | 2018-07-23 | A kind of Top-k comparison Sequential Pattern Mining Algorithm based on concurrent spaced constraint |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109033341A true CN109033341A (en) | 2018-12-18 |
Family
ID=64644215
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810811661.0A Pending CN109033341A (en) | 2018-07-23 | 2018-07-23 | A kind of Top-k comparison Sequential Pattern Mining Algorithm based on concurrent spaced constraint |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109033341A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112765469A (en) * | 2021-01-25 | 2021-05-07 | 东北大学 | Method for mining representative sequence mode from Web click stream data |
-
2018
- 2018-07-23 CN CN201810811661.0A patent/CN109033341A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112765469A (en) * | 2021-01-25 | 2021-05-07 | 东北大学 | Method for mining representative sequence mode from Web click stream data |
CN112765469B (en) * | 2021-01-25 | 2023-10-27 | 东北大学 | Method for mining representative sequence mode from Web click stream data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hu et al. | The hierarchical fair competition (hfc) framework for sustainable evolutionary algorithms | |
Lin et al. | High utility pattern mining using the maximal itemset property and lexicographic tree structures | |
Durand et al. | Gridformation: Towards self-driven online data partitioning using reinforcement learning | |
Abdelhalim et al. | A new method for learning decision trees from rules | |
CN108416381B (en) | Multi-density clustering method for three-dimensional point set | |
CN111651613B (en) | Knowledge graph embedding-based dynamic recommendation method and system | |
CN112734051A (en) | Evolutionary ensemble learning method for classification problem | |
Zhu et al. | A shuffled cellular evolutionary grey wolf optimizer for flexible job shop scheduling problem with tree-structure job precedence constraints | |
CN109033341A (en) | A kind of Top-k comparison Sequential Pattern Mining Algorithm based on concurrent spaced constraint | |
Darlington et al. | Parallel induction algorithms for data mining | |
US6148303A (en) | Regression tree generation method and apparatus therefor | |
Lin et al. | Linguistic frequent pattern mining using a compressed structure | |
Elmasry et al. | Multipartite priority queues | |
Al-Hegami | Classical and incremental classification in data mining process | |
Huy | Constraint propagation in flexible manufacturing | |
CN115168601A (en) | Visual analysis system and method for time sequence knowledge graph | |
Yesantharao | Parallel Batch-Dynamic 𝑘� d-trees | |
CN113902003A (en) | MITree-based multidimensional time series online motif discovery method | |
Benmouna et al. | New method for Bayesian network learning | |
CN114490799A (en) | Method and device for mining frequent subgraphs of single graph | |
Lin et al. | Referential hierarchical clustering algorithm based upon principal component analysis and genetic algorithm | |
Chen et al. | Fast and efficient operations on Parallel Priority Queues: Preliminary version | |
Yesantharao et al. | Parallel Batch-Dynamic $ k $ d-Trees | |
Gawrychowski et al. | Dispersion on trees | |
Yang et al. | IMBT--A Binary Tree for Efficient Support Counting of Incremental Data Mining |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181218 |
|
RJ01 | Rejection of invention patent application after publication |