CN109033341A

CN109033341A - A kind of Top-k comparison Sequential Pattern Mining Algorithm based on concurrent spaced constraint

Info

Publication number: CN109033341A
Application number: CN201810811661.0A
Authority: CN
Inventors: 李刚; 邹波; 尹心; 侯兴哲; 周全; 胡晓锐; 吴彬; 周艳玲; 籍勇亮; 张羽
Original assignee: Electric Power Research Institute of State Grid Chongqing Electric Power Co Ltd; State Grid Corp of China SGCC; State Grid Chongqing Electric Power Co Ltd
Current assignee: Electric Power Research Institute of State Grid Chongqing Electric Power Co Ltd; State Grid Corp of China SGCC; State Grid Chongqing Electric Power Co Ltd
Priority date: 2018-07-23
Filing date: 2018-07-23
Publication date: 2018-12-18

Abstract

The invention discloses a kind of, and the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, it includes: by prescribed form input data set and parameter；Scan data set produces the set and the wherein location information of whole elements of candidate's element；The whole candidate pattern of data acquisition system Enumeration Tree traversal, finds out contrast k mode the most significant；Contrast k mode the most significant is exported to specified position.The present invention introduces the concept of top-k on the basis of the comparison sequential mode mining of spaced constraint.The top-k comparison sequential mode mining of spaced constraint is intended to find that support changes k comparison sequence pattern the most significant between two datasets, and this method can missing to avoid the useful mode due to caused by inappropriate threshold value；It only needs user to set the number of desired mode, is substantially reduced using the more previous method of difficulty；The interpretation of Result is enhanced simultaneously.

Description

A kind of Top-k comparison Sequential Pattern Mining Algorithm based on concurrent spaced constraint

Technical field

The present invention relates to the Series Data Minings in Computer Data excavation applications, more particularly, to one kind based on concurrent , the mode excavation that can solve spacing constraint, the comparison sequence pattern for being substituted using top-k concept specific support threshold dug Dig algorithm.

Background technique

Since Agrawal and Srikant proposes sequential mode mining, sequence pattern is as data mining one important Business has attracted the concern of large quantities of researchers, and a variety of different sequence patterns are proposed successively, such as Frequent Sequential Patterns, compares sequence Mode, closed mode, partial order mode, cyclic pattern etc..In real life, sequence pattern has a wide range of applications.For example, Hygienic disease control department can excavate mode of the infectious disease transmission in time series, when Result can be used for finding infectious disease Empty aggregation breaks out rule, and then provides reference for prevention and control.Bioscience man can pass through analysis DNA and protein sequence Column find the root that disease generates, and research and develop novel drugs.Utilities Electric Co. is improved by analysis of history electricity consumption data to electric load The accuracy of prediction.

The concept of widely used spacing constraint makes the matching of mode more flexible in sequential mining.Spacing constraint is one The section determined by two nonnegative integers indicates the minimum for the element number for allowing to be spaced between two adjacent elements in sequence pattern Value and maximum value.Such as: enabling spacing constraint is [1,3], and sequence pattern P=at is meaned and deposited in S if P can be matched in sequence S In element a and element t, and there are one group of at, a is at least spaced 1 element before t between the two, at most 3, interval Element.

In previous comparison sequential mode mining work, user requires setting positive example support threshold α, negative example support Threshold value beta and spacing constraint γ.Its target is to excavate positive example support more than or equal to α, and negative example is supported under spacing constraint Degree is less than or equal to the minimum mode of β, but there are two problems for such mining algorithm: (a) user is difficult to set suitable branch Degree of holding threshold value, if setting inappropriate support threshold, the mode excavated may be unsatisfactory for the expectation of user；(b) make Beta pruning is carried out with constraint is minimized, although reducing search space, some useful modes is caused to be cut up.

Summary of the invention

In view of the above drawbacks of the prior art, it is an object of the invention to provide a kind of based on concurrent spaced constraint Top-k compare Sequential Pattern Mining Algorithm, comparison Sequential Pattern Mining Algorithm can be greatly improved based on concurrent task division Efficiency can be relatively easy to realize in Hadoop platform according to the division principle of its task, further increase algorithm effect Rate improves the applicability of algorithm.

It is realized the purpose of the present invention is technical solution in this way, a kind of Top- based on concurrent spaced constraint K compares Sequential Pattern Mining Algorithm, it includes:

S1: prescribed form input data set and parameter are pressed；

S2: scan data set produces the set and the wherein location information of whole elements of candidate's element；

S3: the whole candidate pattern of data acquisition system Enumeration Tree traversal finds out contrast k mode the most significant；

S4: output contrast k mode the most significant to specified position.

Further, the data acquisition system parameter of the step S1 input includes: a) positive example data set；B) negative example data set； C) spacing constraint；D) k value.

Further, the step S2 has been specifically included:

S211: the positive example data set that scan data is concentrated；

S212: then each according to sequence according to every sequence of order traversal of sequence for the sequence data collection of input The value of element on position and its sequence number at place and the updating location information position corresponding element in the sequence are in member Information in element set；

S213: if the element is not present, then the element is put into set, then by sequence number and in the sequence Location information is also placed in the corresponding structure information storage of the element；

S214: if the element has existed, then only location information by sequence number and in the sequence is also placed in In the corresponding structure information storage of the element.

Further, the step S2 further includes having:

S221: the negative example data set that scan data is concentrated；

S222: then each according to sequence according to every sequence of order traversal of sequence for the sequence data collection of input The value of element on position and its sequence number at place and the updating location information position corresponding element in the sequence are in member Information in element set；

S223: if the element is not present, then abandoning the element；

S224: if the element has existed, then only location information by sequence number and in the sequence is also placed in In the corresponding structure information storage of the element.

Further, step S21 includes: realizing the scanning of sequence using MapReduce, i.e., sequence is distributed to more The end slaver is scanned, and will finally be polymerize from the production candidate's element result set of multiple nodes, is obtained final Element set.

Further, the step S3 further includes having Pruning strategy；It has specifically included:

S311: Pruning strategy 1: for arbitrary element e ∈ Σ, if e ∈ D-, andE is removed from Σ；

S312: Pruning strategy 2: during top-k compares sequential mode mining, current results integrate as R, when | R |=k, That is there is k mode in set R, and in this k mode, the minimum value of contrast is CRk, if candidate pattern P is in positive example data The support of concentration is less than CRk, i.e. Sup (P, D+) < CRk cuts off P corresponding node and its all descendant nodes；

S313: Pruning strategy 3: during top-k compares sequential mode mining, when | R |=k, and in this k mode In, the minimum value of contrast is CRk, for arbitrary element e ∈ Σ, Sup (e, D+) < CRk, cuts off and owns in Set-Enumeration Tree Node comprising element e；

In above-mentioned Pruning strategy, Σ is alphabet, i.e. a finite element set；Element e ∈ Σ；D+ is positive number of cases evidence Collection；D-is negative a data set；Sup (P, D) is support of the mode P in data set D at spacing constraint γ；CR(P,D+, D -) it is contrast of the mode P between data set D+ and D-at spacing constraint γ.

Further, the step S3 further includes having segmentation Set-Enumeration Tree, the specific steps are as follows:

S321: n part will be divided by the Set-Enumeration Tree of root node of null value；

S322: primary distribution carries out traversal search to n example, the subtree of each independent self-responsibility of example.

Further, the step S3 further includes having subtree ergodic process:

S331: each example is in the subtree being assigned to according to the sequence progress time that depth-first (or range) is preferential It goes through, global result set is read in ergodic process and is come using Pruning strategy 2 and 3；

S332: when finding a in subtree when the overall situation all meets the mode of definition, global result set is updated；

S333: continuing searching, and updates a global outcome collection until subtree search completion, and before terminating.

Further, data scanning collection is the scanning that sequence is realized by MapReduce, i.e., sequence is distributed to more The end slaver is scanned, and will finally be polymerize from the production candidate's element result set of multiple nodes, is obtained final Element set.

By adopting the above-described technical solution, the present invention has the advantage that:

(1) present invention introduces the concept of top-k on the basis of the comparison sequential mode mining of spaced constraint.Interband It is intended to find that between two datasets, support changes k the most significant every the top-k comparison sequential mode mining of constraint Sequence pattern is compared, this method can missing to avoid the useful mode due to caused by inappropriate threshold value；Only user is needed to set The number of desired mode is substantially reduced using the more previous method of difficulty；Enhance Result simultaneously can It is explanatory.

(2) specific tasks of algorithm are decomposed into multiple parallel parts by the present invention, then give respectively multiple threads (or Multiple nodes in person Hadoop) it handles simultaneously, thus accelerating algorithm.

(3) efficiency that can greatly improve comparison Sequential Pattern Mining Algorithm is divided the present invention is based on concurrent task, according to The division principle of its task can be relatively easy to realize in Hadoop platform, further increase efficiency of algorithm, improve algorithm Applicability.Existing comparison Sequential Pattern Mining Algorithm needs user to set positive example support threshold and negative example support threshold Value.In the case where not having enough priori knowledges, user is difficult to set appropriate support threshold, so as to miss Useful mode, and the present invention introduces the concept of top-k on the basis of previous comparison sequential mode mining, does not need to use Support threshold is arranged in family, so that mining algorithm is easier to use, is as a result easier to explain.Meanwhile devising multiple cut Branch strategy and Heuristic Strategy accelerating algorithm.

Other advantages, target and feature of the invention will be illustrated in the following description to a certain extent, and And to a certain extent, based on will be apparent to those skilled in the art to investigating hereafter, Huo Zheke To be instructed from the practice of the present invention.

Detailed description of the invention

Detailed description of the invention of the invention is as follows:

Fig. 1 is the flow diagram that the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm.

Fig. 2 is the block schematic illustration that the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm.

Fig. 3 is the schematic diagram of element support storage organization.

Fig. 4 is the schematic diagram of Set-Enumeration Tree.

Fig. 5 is the schematic diagram of concurrent tasks distribution.

Specific embodiment

Present invention will be further explained below with reference to the attached drawings and examples.

Embodiment 1: as shown in Figures 1 to 5；A kind of Top-k comparison sequence pattern digging based on concurrent spaced constraint Algorithm is dug, it includes:

S1: prescribed form input data set and parameter are pressed；

S4: output contrast k mode the most significant to specified position.

The data acquisition system parameter of step S1 input includes: a) positive example data set；B) negative example data set；C) spacing constraint； D) k value.

Step S2 has been specifically included:

S211: the positive example data set that scan data is concentrated；

Step S2 further includes having:

S221: the negative example data set that scan data is concentrated；

S223: if the element is not present, then abandoning the element；

Step S21 includes: realizing the scanning of sequence using MapReduce, i.e., sequence is distributed to more slaver End is scanned, and will finally be polymerize from the production candidate's element result set of multiple nodes, is obtained final element set It closes.

Step S3 further includes having Pruning strategy；It has specifically included:

Step S3 further includes having segmentation Set-Enumeration Tree, the specific steps are as follows:

Step S3 further includes having subtree ergodic process:

Data scanning collection is the scanning that sequence is realized by MapReduce, i.e., sequence is distributed to the more ends slaver It is scanned, will finally polymerize from the production candidate's element result set of multiple nodes, obtain final element set.

Embodiment 2: the present embodiment is preferred embodiment.As shown in Figures 1 to 5；It is a kind of based on it is concurrent it is spaced about The Top-k of beam compares Sequential Pattern Mining Algorithm, and for the positive example data set (D+) and negative example data set (D-) of input, interval is about The number (k) for the comparison contrastive pattern the most significant that beam γ and user need to excavate, the present invention decompose specific tasks For multiple parallel parts, then these tasks of concurrent execution, are finally obtained from positive example data set to negative example data set Support variation is the most significant and meets k comparison sequence pattern of spacing constraint.

Sequence pattern is compared in order to efficiently and effectively excavate the top-k of spaced constraint, the present invention needs to solve effective Property, four aspects of scalability that applicability, high efficiency and element set are higher-dimension the problem of.

Validity refers to, that is, excavates and meet all mode that definition requires, and do not omit, as long as needing ensure that Be candidate pattern support in whole contrastive patterns, contrast is the largest k, then one is scheduled in result set, and And the mode in result set is also necessarily satisfying for definition.

Applicability refers to, due in the data of sequence pattern itself may noise data, the number of certain points in sequence According to may be mistake, if stringent carries out pattern search according to sequence pattern sequence, the potential mould in part may be missed Formula, these modes may meet definition, but due to including noise in sequence data, cause these modes can not in itself Directly by it is stringent it is matched in a manner of excavate and obtain from sequence data.So the present invention needs to introduce other constraints, algorithm is improved The applicability for the mode excavated.

High efficiency refers to, due to the validity in order to guarantee result, the present invention needs to guarantee the validity of arithmetic result, calculates Method needs to traverse full search space, this will lead to the candidate pattern data that need to enumerate for biggish data set compared with Be it is huge, will lead to the runing time of algorithm beyond acceptable range.So the present invention needs to improve the operational efficiency of algorithm, That is high efficiency.

Element set be higher-dimension scalability refer to, be similar to high efficiency, Set-Enumeration Tree come to whole search spaces into Row traversal, and as the dimension of element set increases, the growth of exponential form is presented in the combination of element, this will lead to the fortune of algorithm It is also unacceptable that the row time, which is higher-dimension in element set,.So the present invention needs to improve the operational efficiency of algorithm, i.e., efficiently Property.

In order to guarantee the validity of algorithm, present invention employs carry out candidate pattern based on the mode of Set-Enumeration Tree It search and checks, algorithm in a certain order successively enumerating the candidate's element in set, and modal length increases since 1 It is long, until having searched for whole spaces.This search process can be converted to the tree being made of to one combination of set element Traversal.For traversal of tree, generally the most commonly used is depth-first and two kinds of breadth First: 1) depth-first: depth-first Traversal can save memory headroom；2) breadth First: the traversal of breadth First can be with accelerating algorithm.

Algorithm execute when specifically select which kind of traversal mode be can according to the size and time requirement of data set, And depth-first and the respective characteristic of breadth first traversal are selected, to reach the demand of user.

In order to guarantee the applicability of algorithm, that is, allow mining algorithm that can also find mode, this hair on noisy data set It is bright to be the introduction of spacing constraint.The concept of spacing constraint makes the matching of mode more flexible.Spacing constraint is one non-by two The section that negative integer determines indicates the minimum value and maximum of the element number for allowing to be spaced between two adjacent elements in sequence pattern Value.Introducing spacing constraint makes the matching of mode extend to subsequence by the substring of the sequence of script, only considers to be spaced in Among minimum value and maximum value, rather than the reason of whole subsequence, is, on many data sets, interval is greater than certain value and small It is not all made much sense in certain value, and arbitrary spacing constraint can be represented using minimum value and maximum value.

In order to guarantee the high efficiency of algorithm, present invention employs three kinds of Pruning strategies and a kind of inspiration acceleration strategy, so that The number for the candidate pattern that algorithm needs to be traversed for can be reduced greatly.In order to better illustrate the Pruning strategy of this patent, often It is as shown in table 1 with symbol.

The definition of the conventional sign of the present invention of table 1

Symbol	Meaning
		Σ	Alphabet, i.e. a finite element set
e	Element e ∈ Σ
		D₊	Positive example data set
D_–	Negative example data set
		Sup(P,D)	At spacing constraint γ, support of the mode P in data set D
CR(P,D₊,D_–)	At spacing constraint γ, mode P is in data set D₊And D_–Between contrast

1) Pruning strategy:

A) Pruning strategy 1: for arbitrary element e ∈ Σ, if e ∈ D-, andE is removed from Σ.

B) Pruning strategy 2: during top-k compares sequential mode mining, current results integrate as R, when | R |=k collects Closing in R has k mode, and in this k mode, and the minimum value of contrast is CRk, if candidate pattern P is in positive example data set Support be less than CRk, i.e. Sup (P, D+) < CRk cuts off P corresponding node and its all descendant nodes.

C) Pruning strategy 3: during top-k compares sequential mode mining, when | R |=k, and in this k mode, it is right Minimum value than degree is CRk, for arbitrary element e ∈ Σ, Sup (e, D+) < CRk, cuts off and all in Set-Enumeration Tree includes The node of element e.

2) Heuristic Strategy:

In practice, it has been found that in above-mentioned 3 Pruning strategies, Pruning strategy 1 can use before enumeration process；It cuts 2 it is generally most effective Pruning strategy of branch strategy, but it and Pruning strategy 3 are all k comparison sequence patterns to be found It just comes into effect later, so entire enumeration process can be accelerated by finding contrast k comparison sequence pattern as big as possible as early as possible, Reduce unnecessary trial.

Contrast k comparison sequence pattern as big as possible so how could is found as early as possible? according to Set-Enumeration Tree Property, be readily apparent that change Set-Enumeration Tree enumeration order, first enumerate be more likely to generate comparison sequence pattern element.For This, sorts to the element in Σ: (a) sorting from large to small by contrast；(b) it is sorted from large to small by support.It is tied according to sequence Fruit traversal Set-Enumeration Tree is expected than traversing the Set-Enumeration Tree comparison sequence that quickly to find contrast big under original sequence Mode reduces the runing time of algorithm so that Pruning strategy 2 and 3 be enable to apply effect as early as possible.

In order to guarantee that the element set of algorithm is the scalability of higher-dimension, the present invention accelerates algorithm by the way of more examples Operational efficiency, reduce the runing time of algorithm.It needs using more examples, it is necessary to which be related to two problems: 1) how is task It is allocated with how 2) result set polymerize.

1) how task is allocated: in order to allow algorithm not compute repeatedly, quite a few calculation, the present invention divides Set-Enumeration Tree It is cut into multiple independent and disjoint subtree, subtree is then distributed to n concurrently according to its possible calculation amount respectively respectively Example, can thus greatly improve the efficiency of algorithm, and the calculation amount that multiple examples may be implemented is roughly the same, without Certain examples are easy to produce to finish soon, and remaining example also needs for a long time terminate, when leading to overall operation Between cannot reach expected.These examples can be thread, be also possible to process or even multiple servers.

2) how result set polymerize: for the operation of the polymerization of result, we can use 2 kinds of modes substantially, or will The two is used in combination with:

A) just can once be polymerize generating a result every time.Because the result set of this mode is global synchronization , it will lead to multiple examples and fight for result set and be written, lead to certain obstruction, drag down the parallel efficiency of algorithm, but this The algorithm optimization strategy proposed before the present invention can be used in kind mode, although causing each example that may block, its The space for needing to search for greatly reduces.

B) can be polymerize again after an example finds respective top-k comparison sequence pattern.In this way, each operation is real The result set of example be it is independent, only to after sample result, can just polymerize.Example each so directly would not be due to It writes result set and generates obstruction, but be not available Pruning strategy 2 and 3 in this way, the time for causing each example to need to run is big Width increases.

C) both the above strategy can be combined, is once polymerize after generating u mode, this reduces resistance The probability of plug decreases the time of each example operation, i.e., obtains balance between both the above strategy, but the value of u Need priori knowledge.

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description.

Fig. 2 is a kind of block schematic illustration of top-k comparison Sequential Pattern Mining Algorithm based on concurrent spaced constraint, This frame is made of four major parts:

1) it inputs: pressing prescribed form input data set and parameter

2) candidate's element: data set scanning is produced, the set and the wherein location information of whole elements of candidate's element are produced；

3) it produces candidate pattern: according to the whole possible candidate patterns of Set-Enumeration Tree traversal, it is the most aobvious to find out contrast K mode of work；

4) it exports；Contrast k mode the most significant is exported to designated position.

Fig. 1 is idiographic flow schematic diagram of the invention, and the process of this frame is as follows:

1) information is inputted, input information is made of four parts:

A) positive example data set；

B) negative example data set；

C) spacing constraint；

D) k value.

2) scan positive example data set: for the sequence data collection of input, the present invention according to sequence every sequence of order traversal Column, then according to the value of element and its sequence number at place on each position of sequence and updating location information in the sequence The information of the position corresponding element in element set；If the element is not present, then the element is put into set, then will Sequence number and location information in the sequence are also placed in the corresponding structure information storage of the element；If the element has been deposited Then only location information by sequence number and in the sequence is also placed in the corresponding structure information storage of the element. It should be noted that due to the scanning of sequence be between each sequence it is pairwise independent, i.e., every sequence can individually generate knot Fruit information, and other sequences of getting along well generate conflict.The present invention realizes the scanning of sequence using MapReduce, i.e., sequence point It is dealt into the more ends slaver to be scanned, will finally polymerize from the production candidate's element result set of multiple nodes, obtain To final element set.

3) Pruning strategy 1: according to Pruning strategy 1, if an element only occurs in negative number of cases according to concentration, then the element It is unlikely to be the result for constituting algorithm；Equivalence can obtain, in the scanning process of negative example data set, it is thus only necessary to consider which occurs Element in positive example data set.

4) it scans negative example data set: being substantially equivalent to scanning positive example data set, but it is slightly a little different.According to sequence Every sequence of order traversal, then according to the value of element and its sequence number at place on each position of sequence and in the sequence The updating location information information of the position corresponding element in element set；If the element is not present, then abandoning the element； If the element has existed, then only location information by sequence number and in the sequence is also placed in the corresponding letter of the element It ceases in storage organization.Likewise, accelerating algorithm using MapReduce, efficiency of algorithm is improved, it is time-consuming to reduce algorithm.

5) Heuristic Strategy 1: for the candidate's element set generated, it is ranked up (only according to Heuristic Strategy 1 Change enumeration order), then element more forward in new enumeration order, the probability for generating desired mode are got over Greatly, the probability of element generation result more rearward is lower, in order to generate contrast k mode as significant as possible faster, calculates Method can allow example sequentially to traverse Set-Enumeration Tree according to this in order, rather than according to former sequence.

6) divide Set-Enumeration Tree: n part will be divided by the Set-Enumeration Tree of root node of null value, then primary point N example of dispensing, each example can independent self-responsibility subtree carry out traversal search.General method is, first By l layers of breadth first traversal, then m subtree is obtained, the lexcographical order for then constituting this m subtree according to its Heuristic Strategy Sequence, is then distributed to the example numbered as i%n for i-th subtree.

7) subtree traverse: each example in the subtree being assigned to according to the preferential sequence of depth-first (or range) into Row traverses, and global result set is read in ergodic process and is come using Pruning strategy 2 and 3, with accelerating algorithm operational efficiency.Work as subtree On find a when the overall situation all meets the mode of definition, update global result set.It then proceedes to search for, until subtree Search is completed, and updates a global outcome collection before terminating.

8) it exports result: the top-k comparison sequence pattern of spaced constraint is exported as required.

Fig. 3 is that element provided in an embodiment of the present invention the schematic diagram of structure information storage occurs, and the present invention uses a collection It closes to store the value of whole elements, by Hash table, the value with element is key, and the corresponding structure for information occur of element is value；The corresponding number for information occur and storing the sequence of its appearance with a list of each element, each sequence correspond to one again A container stores it and particularly occurs at those of sequence position.For the same element, the appearance information of positive example data set and There is information and stores respectively in negative example.

Fig. 4 be Set-Enumeration Tree provided in an embodiment of the present invention schematic diagram, it is seen that for candidate's element set Σ= { e1, e2, e3 ..., eD }, algorithm may make up Set-Enumeration Tree as shown in it, and root node is sky, and lower layer is candidate's element collection Whole length that Σ is constituted are closed as 1 mode, share D child node；For any one node p, it includes have D son section Point, value are respectively { pe1, pe2, pe3 ..., peD }.

Fig. 5 is the schematic diagram of concurrent tasks provided in an embodiment of the present invention distribution, and legend is the mode for being 1 with whole length Task division is carried out for root node, can be that the mode of 2,3 or even l carries out task point as root node using length if n > D Match.It is bigger as the probability of the significant mode of the mode of header element generation comparison using it due to the more forward element that sorts, so calculating Method cannot assign them to the same running example, but compartment of terrain is allocated subtree, so that may in each example It generates and compares significant mode.

Present invention application Hadoop platform excavates comparison sequence pattern, institute using the constraint of spacing constraint and top-k Stating algorithm includes two submodules: generating candidate's element；Generate candidate pattern.Use the MapReduce frame of Hadoop platform Produce the appearance position information of candidate's element and its element, and using Pruning strategy 1 reduce negative number of cases according to concentration need not The element wanted.Come in prescribed model that two adjacent elements are at the interval that original data is concentrated using spacing constraint, about using top-k Beam provides the type of the mode eventually found, using Pruning strategy 2 and 3 and acceleration strategy 1 is inspired to carry out accelerating algorithm efficiency, And carry out distributed tasks using the MapReduce frame of Hadoop platform to improve efficiency.

The present invention provides a kind of top-k comparison Sequential Pattern Mining Algorithm based on concurrent spaced constraint, based on simultaneously The task of hair divides the efficiency that can greatly improve comparison Sequential Pattern Mining Algorithm, can be compared with according to the division principle of its task Readily to be realized in Hadoop platform, efficiency of algorithm is further increased, the applicability of algorithm is improved.Existing comparison sequence Pattern mining algorithm needs user to set positive example support threshold and negative example support threshold.Do not having enough priori knowledges In the case of, user is difficult to set appropriate support threshold, and so as to miss some useful modes, and the present invention is previous Comparison sequential mode mining on the basis of introduce the concept of top-k, do not need user setting support threshold so that Mining algorithm is easier to use, and is as a result easier to explain.Meanwhile devising multiple Pruning strategies and Heuristic Strategy accelerating algorithm.

It should be understood that the part that this specification does not elaborate belongs to the prior art.Finally, it is stated that above Embodiment is only used to illustrate the technical scheme of the present invention and not to limit it, although having carried out in detail referring to preferred embodiment to the present invention Illustrate, those skilled in the art should understand that, can with modification or equivalent replacement of the technical solution of the present invention are made, Without departing from the objective and range of the technical program, it is intended to be within the scope of the claims of the invention.

Claims

1. a kind of Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, which is characterized in that the algorithm Include:

S1: prescribed form input data set and parameter are pressed；

S4: output contrast k mode the most significant to specified position.

2. the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, feature as described in claim 1 It is, the data acquisition system parameter of the step S1 input includes: a) positive example data set；B) negative example data set；C) spacing constraint； D) k value.

3. the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, feature as described in claim 1 It is, the step S2 has been specifically included:

S211: the positive example data set that scan data is concentrated；

S212: for the sequence data collection of input, according to every sequence of order traversal of sequence, then according to each position of sequence On the value of element and its sequence number at place and the updating location information position corresponding element in the sequence in element set Information in conjunction；

S213: if the element is not present, then the element is put into set, the then position by sequence number and in the sequence Information is also placed in the corresponding structure information storage of the element；

S214: if the element has existed, then only location information by sequence number and in the sequence is also placed in this yuan In the corresponding structure information storage of element.

4. the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, feature as described in claim 1 It is, the step S2 further includes having:

S221: the negative example data set that scan data is concentrated；

S222: for the sequence data collection of input, according to every sequence of order traversal of sequence, then according to each position of sequence On the value of element and its sequence number at place and the updating location information position corresponding element in the sequence in element set Information in conjunction；

S223: if the element is not present, then abandoning the element；

S224: if the element has existed, then only location information by sequence number and in the sequence is also placed in this yuan In the corresponding structure information storage of element.

5. the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, feature as claimed in claim 3 Be, step S21 includes: realizing the scanning of sequence using MapReduce, i.e., sequence be distributed to the more ends slaver into Row scanning, will finally polymerize from the production candidate's element result set of multiple nodes, obtains final element set.

6. the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, feature as described in claim 1 It is, the step S3 further includes having Pruning strategy；It has specifically included:

S312: Pruning strategy 2: during top-k compares sequential mode mining, current results integrate as R, when | R |=k collects Closing in R has k mode, and in this k mode, and the minimum value of contrast is CRk, if candidate pattern P is in positive example data set Support be less than CRk, i.e. Sup (P, D+) < CRk cuts off P corresponding node and its all descendant nodes；

S313: Pruning strategy 3: during top-k compares sequential mode mining, when | R |=k, and in this k mode, it is right Minimum value than degree is CRk, for arbitrary element e ∈ Σ, Sup (e, D+) < CRk, cuts off and all in Set-Enumeration Tree includes The node of element e；

In above-mentioned Pruning strategy, Σ is alphabet, i.e. a finite element set；Element e ∈ Σ；D+ is positive a data set； D-is negative a data set；Sup (P, D) is support of the mode P in data set D at spacing constraint γ；CR(P,D+,D–) For at spacing constraint γ, contrast of the mode P between data set D+ and D-.

7. the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, feature as described in claim 1 It is, the step S3 further includes having segmentation Set-Enumeration Tree, the specific steps are as follows:

8. the Top-k based on concurrent spaced constraint compares Sequential Pattern Mining Algorithm, feature as described in claim 1 It is, the step S3 further includes having subtree ergodic process:

S331: each example is traversed in the subtree being assigned to according to the preferential sequence of depth-first (or range), time Global result set is read during going through to come using Pruning strategy 2 and 3；

9. the Top-k based on concurrent spaced constraint as described in claim 3 or 4 compares Sequential Pattern Mining Algorithm, It is characterized in that, data scanning collection is the scanning that sequence is realized by MapReduce, i.e., sequence is distributed to the more ends slaver It is scanned, will finally polymerize from the production candidate's element result set of multiple nodes, obtain final element set.