CN108170799A

CN108170799A - A kind of Frequent episodes method for digging of mass data

Info

Publication number: CN108170799A
Application number: CN201711457785.5A
Authority: CN
Inventors: 王宏志; 秦谦
Original assignee: Jiangsu Mingtong Tech Co Ltd
Current assignee: Jiangsu Mingtong Tech Co Ltd
Priority date: 2017-12-28
Filing date: 2017-12-28
Publication date: 2018-06-15

Abstract

The invention discloses a kind of Frequent episodes method for digging of mass data, user's input time sequence data first, calculate the frequency of each, and construct frequent episode set, secondly for all frequent episodes, in the division of Map construction ω equivalences, Frequent episodes then are obtained for the independent excavation of Reduce progress that is divided in of construction, finally all Frequent episodes collection are merged, and filters out the sequence that will be repeated and exports.It, being capable of effective boosting algorithm efficiency the present invention provides a kind of division methods to input database；In excavation phase, any one may be used, and molding mining algorithm has been excavated the present invention now, easy to implement.

Description

A kind of Frequent episodes method for digging of mass data

Technical field

The present invention relates to a kind of Frequent episodes method for digging of mass data, belong to technical field of data processing.

Background technology

The innovatory algorithm based on Apriori is just proposed when earliest occurrence sequence mode excavation concept, such as： AprioriSome, AprioriAll, Dynamic-some.Later, with the improvement of algorithm, based on Apriori thoughts, and have People proposes GSP algorithms, which defines classifies as defined in time restriction, sliding time window and the user of sequence, The Frequent episodes excavated so more meet realistic meaning.Later and in succession it has been proposed that MFS algorithms and PSP algorithms, all The execution efficiency of GSP algorithms is improved.These modified hydrothermal process all derive from the thought of Apriori algorithm.But The shortcomings that Apriori has itself, that is, Multiple-Scan database is required for, when this needs huge for mass data Between efficiency, and a large amount of Candidate Set can be generated, smaller or frequent mode is longer, this is asked if support threshold obtains Topic will become very intractable.

A kind of sequential mode mining method SPADE algorithms based on vertical storage form, base are proposed by M.zaki et al. This thought is exactly：List entries database by certain means is converted into the form of equivalence class first, then utilizes letter Single connection method, using the thought of case theory come Mining Frequent sequence pattern.Its advantage is：It is calculated compared to Apriori series Method, the number of scan database greatly reduce, and mining process only needs 3 scan databases from the beginning to the end.But SPADE algorithms There are some drawbacks, that is, it needs additional memory space when the database for saying horizontal format becomes vertical format With calculate time, and the traversal method used or breadth first traversal method in this algorithm, this just needs huge The cost that Candidate key generates.

Recent years, J.Han, J.Pei et al. had also been proposed the algorithm that a kind of frequent mode based on projection increases --- FreesPan algorithms, this algorithm had evolved into PrefixsPan algorithms by improvement later, and performance further increases substantially. The advantage of FreeSpan algorithms is that it can greatly reduce the generation of candidate sequence, decreases the expense for generating candidate sequence, And it can completely find whole Frequent Sequential Patterns.But there is also some drawbacks for the algorithm, can exactly generate a large amount of Data for projection library, it is contemplated that a kind of special circumstances appear in each in input database if there is a certain pattern In sequence, then the corresponding data for projection library of this pattern would not be reduced compared to original database；Except this it Outside, if length be K subsequence may increase in any one position, then search length be (K+1) candidate sequence just It to consider each possible combination, sizable time cost will be increased.

The characteristics of multi-dimensional sequential pattern excavates is exactly to excavate that user in multidimensional information is interested, significant information, It considers other dimensional informations on the basis of common excavation sequence pattern mode.For example, for consumer spending It is accustomed in this data, the gender of consumer, the age, the information such as occupation just constitute the sequence pattern of multidimensional.This pattern contains There are more valuable information, there is higher application value.There are a variety of multi-dimensional sequential pattern mining algorithms at present, such as： The main thought of UniSeq, Seq-Dim and Dim-Seq wherein UniSeq algorithms is exactly by the multidimensional information in database It is respectively embedded in each sequence, so as to form new sequence spreading database, then can utilize PrefixSpan algorithms pair The sequence library of this new extension carries out Frequent Sequential Patterns and excavates so as to obtain multidimensional Frequent Sequential Patterns.

Frequent episodes excavation is a series of basis of significant data mining tasks, such as in text mining, Frequent episodes It is used to construct statistical language model, data recovery, information extraction and the spam detection of machine translation, word-meaning association is also It can be used for relationship extraction.In webpage usage mining and dialog analysis, Frequent episodes can represent user, and certain is common Or general behavior (Frequent episodes in such as web page access daily record).Above several situations and some simple application programs In, the excavation object that Frequent episodes excavate is huge, and is contained with hundred million as the order of magnitude sequence.Such as Microsoft's offer The right to use of one n dimension data based on hundreds billion of webpages, the expectation library more than 1,000,000,000 dimensions that Google publishes.This In the case of, a kind of Frequent episodes mining algorithm that can handle mass data just seems increasingly important.Existing method is come It says, the size of a forms data collection is huge, then computing overhead and memory are using being still very huge.

Invention content

The technical problems to be solved by the invention are that a kind of frequent sequence of mass data is provided the defects of overcoming the prior art Row method for digging, being capable of effective boosting algorithm efficiency.

In order to solve the above technical problems, the present invention provides a kind of Frequent episodes method for digging of mass data, including following Step：

1) user's input time sequence data obtains the basic statistics information of data, calculates the frequency of each ω ∈ Σ Rate, and construction set F is wanted for frequent episode_σ,0,1(D), wherein, ω represents the subsequence of input, and Σ is complete or collected works, represents input All time series set, D represent input time sequence library, and subscript σ represents support threshold, and 0 is interval threshold, and 1 is Length threshold；The frequent episode refers to, for σ>0, if meeting f_γ(S, D) >=σ, then sequence S be (σ, γ)-frequently, In, f_γ(S, D) represents the frequency of sequence S；

2) for frequent episodes all in Σ, in the division P of Map construction ω-equivalences_ω；

3) to the division P of step 2) construction_ωIndependent excavation is carried out in Reduce, obtains F_σ,γ,λ(P_ω), wherein, P_ωIt is The division of item, F centered on ω_σ,γ,λ(P_ω) it is P_ωAll length is no more than λ and meets (σ, γ)-frequent sequence in the middle；

4) F of each frequent episode for obtaining step 3)_σ,γ,λ(P_ω) collection merge, by repeat sequence filter fall Up to output to the end.

In aforementioned step 1), the basic statistics information of data includes the average length of time series data, length maximum Value, sequence sum, item sum, different item numbers, total bytes.

Aforementioned step 1) is completed by single MapReduce operations.

In aforementioned step 1), an integer identifiers are stated, and completely with integer identifiers for each Array represents sequence, first, integer identifiers is ranked up according to the frequency descending of item, then changes encoding using byte Item is collapsed into integer by mode.

In aforementioned step 2), construct ω-equivalence division the step of it is as follows：

2-1) examine input time sequence whether related to central term with minimality；If uncorrelated, enableIf Correlation then performs a reverse scan to input time sequence to obtain all right distances of lower target；

2-2) and then a forward scan is performed, need to be performed simultaneously the following：

(a) left distance is calculated；

(b) it carries out not reaching abbreviation；

(c) uncorrelated item is replaced with space；

(d) prefix/postfix abbreviation and space abbreviation are performed；

(e) list entries is split into several subsequences using+1 space of γ, these subsequences can be used for space Method for splitting, so as to form last output P_ω。

It is aforementioned before being divided, first, pass through and scan set F_σ,0,1(D), wherein the item in set is according to frequency Rate descending arranges, and adjacent item is divided into one group until their frequency and more than setting value m, traverses each, complete Into grouping；Then, it for each grouping, constructs one and individually divides.

Aforementioned uses PrefixSpan algorithms to dividing P_ωIt is excavated.

The advantageous effect that the present invention is reached：

(1) present invention is the distributed algorithm that the first supports gap constraint；

It (2), being capable of effective boosting algorithm efficiency the present invention provides a kind of division methods to input database；

(3) present invention compresses intermediate generation sequence the time cost that can substantially reduce algorithm；Item is divided into Group rather than a data for projection library is generated for each central term, efficiency of algorithm can be improved in this way；In excavation phase, Any one may be used, and molding mining algorithm has been excavated now.

Description of the drawings

Fig. 1 is MapReduce model schematic diagram；

Fig. 2 is example of the present invention using the processing of MapReduce programming models.

Specific embodiment

The invention will be further described below.Following embodiment is only used for the technical side for clearly illustrating the present invention Case, and be not intended to limit the protection scope of the present invention and limit the scope of the invention.

The present invention uses MapReduce programming models, comprising a Map function and a Reduce function, wherein, Map Function is used for, and one group of key-value (key-value) is right to being mapped to one group of new key-value (key-value), and Reduce functions are used Each key key to ensure the key-value of all mappings (key-value) centering shares identical key group, basic thought such as Fig. 1 It is shown.

MapReduce can handle mass data collection, and Map functions are specified by user, and key- is handled by this Map function Value (key-value) is right, and it is right to generate a series of middle k ey-value (key-value), is closed finally by Reduce functions And the value value parts of all intermediate key assignments centerings with identical key values, Fig. 2 are using at MapReduce programming models Manage the example of an example problem.The example is word number in statistic document, passes through the every of Map function statistic documents first Then the number of each word in a piecemeal sums it up the number of word in piecemeal in Reduce functions.

The present invention relates to relational language noun it is as follows：

Sequence library D={ S₁... ..., S_DBe list entries multiple set.The individual event collection that sequence is ordered into, and it is single Item is contained in complete or collected works' ∑ { ω₁... ..., ω_|∑|}.Use S=s₁s₂……s_|S|Represent a length be | S | sequence, s_i∈∑(1 ≤ i≤| S |), ∑⁺It represents to form all nonempty sequences by the element in Σ.

In general, a list entries in input database is represented with symbol T, and symbol S represents any bar sequence.

Variable γ >=0 represents spacer maximum value.If S is the subsequence of T, and S is by between an of length no more than γ Every separating, and the sequence of continuous items composition being divided into around here in T, then we claim γ-subsequence that S is T, are expressed as The n dimensions of standard, which are excavated, is equivalent to γ=0.In general,And if only if there are subscript i₁<…<i_nMeet 1, S_K=T_ik(1≤K ≤n)；2、i_k+1-i_k-1≤γ(1≤k≤n).If for example, T=abcd, S₁=acd, S₂=bc, then

γ-support, that is, Sup of the sequence S of database D_γ(S, D) passes through following multiple set expression：

f_γ(S, D)=| Sup_γ(S, D) | represent the frequency of sequence S.Here the estimation of frequency is equivalent in text mining The concept of document frequency calculates the number (rather than total degree of S appearance) for occurring the sequence of S in list entries.For σ >0, if meeting f_γ(S, D) >=σ, then sequence S be (σ, γ)-frequently.

MG-FSM (Frequent episodes excavation) algorithm of the present invention is divided into three phases：1st, pretreatment stage obtains data Basic statistics information；2nd, the stage is divided, for frequent episodes all in Σ, constructs the division of ω-equivalence；3rd, excavation phase is right The division of second stage construction carries out independent excavation, can use molding Frequent episodes mining algorithm at this time, Output can be generated by being partitioned into the numerous sequential mining of line frequency to each, finally need that these outputs are filtered to obtain to the end Output.

Each stage is specific as follows：

(1) pretreatment stage：

User's input time sequence data, and it is total to obtain the average length of time series data, length maximum value, sequence Number, item sum, different item numbers, total bytes.

This stage will calculate the frequency of each ω ∈ Σ, and want construction set F for frequent episode_σ,0,1(D), Commonly referred to as f-list.Middle term of the present invention refers to the subsequence of input.F_σ,0,1(D) in, subscript σ represents support threshold, and 0 is Interval threshold, 1 is length threshold.This process can be completed by single MapReduce operations (by performing one The deformation of WordCount algorithms is ignored it and is repeated in list entries middle term).Based on this, length can be exported For 1 Frequent episodes set.For length be more than 1 sequence, using f-list define on a set Σ symbol "<”：

It is denoted as ω<ω ' works as f₀(ω,D)>f₀(ω′,D)

f₀(ω, D) represents the item frequency smaller of the frequency of item ω, i.e. the item frequency bigger of " small " and " big ".

When having ω≤ω ' for all ω ∈ S ', claim S≤ω.The set of all sequences comprising ω, and these sequences Item in row other than ω is expressed as all no more than ω

Finally, the central term of sequence S is expressed as p (S)=min_ω∈SMaximal term in (S≤ω), i.e. S.It is noted that

For example, work as S=abc, and S≤c and p (S)=c.

The present invention states an integer identifiers, and represented completely with the array of integer identifiers for each Sequence.Byte can be used to change coding mode to array as compression.It compresses and refers to item is represented to become integer, such as Use the method for similar Huffman encoding.In order to make compression more efficient, integer identifiers are carried out according to the frequency descending of item Sequence.In addition to this, irrelevant item (mess code can be regarded as) is replaced, and utilize stroke length with space (identifier is -1) The thought of compression algorithm represents continuous space (such as representing two continuous spaces with identifier -2).

For all examples in the present invention, arrange the size of letter by the sorting representationb of alphabet：a<b<c……

(2) stage (Map) is divided：

Division stage and excavation phase perform in the MapReduce operations of a single.In Map parts structural division P_ω (T)：For each different item in list entries T ∈ D, a small sequence library P is constructed_ω(T) and wherein sequence is exported And key assignments.Here, it is desirable that P_ω(T) with T it is (σ, γ, λ)-of equal value, wherein, σ is support threshold, and γ is interval threshold, λ It is length threshold.

It is now assumed that P_ω(T)={ T }.The emphasis of the present invention, which is that, divides P_ω(T) construction.

It is obtained using such a way from list entries T and divides P_ω(T)：

Examine list entries whether related to central term with minimality first；

If uncorrelated, enable

If related, a reverse scan to list entries is performed to obtain all right distances of lower target, is then held One forward scan of row, needs to be performed simultaneously the following：

(1) left distance is calculated；

(2) it carries out not reaching abbreviation；

(3) uncorrelated item is replaced with space；

(4) prefix/postfix abbreviation and space abbreviation are performed；

(5) list entries is split into several subsequences using+1 space of γ, these subsequences can be used for space Method for splitting, so as to form last output P_ω(T)。

The present invention does not construct one for each different central term and individually divides, but for united Several central terms construct one and individually divide, each are thus allowed to divide comprising similar m or more a plurality of sequence, this is just It is grouping.Grouping is by scanning set f-list, and middle term is arranged according to frequency descending, and adjacent item is divided into one Group until they frequency and more than m.Each is traversed in this way, and the division of grouping has just divided.

(3) excavation phase (Reduce)：

The input of excavation phase is the P for doing and operating by dividing the result in stage_ω.At this moment, it takes a kind of general FSM algorithms come to P_ω(T) it is handled and can obtain F_σ,γ,λ(P_ω), wherein, P_ωIt is the division of the item centered on ω, F_σ,γ,λ (P_ω) it is P_ωAll length is no more than λ and meets (σ, γ)-frequent sequence in the middle.The present invention uses PrefixSpan algorithms, PrefixSpan algorithms can be referred to as the leading portion tract of prefix, input database be projected on prefix, Ran Houzai The frequent episode in data for projection library is excavated, then it is extended in prefix, is further continued for excavating, it is all frequent until finding Sequence.Whether time efficiency or space efficiency all improve very big than class Apriori algorithm.

Finally F will be obtained for each frequent episode_σ,γ,λ(P_ω), by these collection merge will contain it is all Frequent episodes, but exist repeat.It is last the sequence filter repeated only to be fallen.

As it is assumed that P_ω(T)={ T }, then this relationship, f are met for all sequence S for meeting ω ∈ S_γ(S, P_ω)>f_γ(S, D), it is clear that algorithm is correct.

It for example illustrates below, it is assumed that input database D=acb, dacbd, dacbddca, bd, bcaddbd, Addcd } and central term c.

If：If c ∈ T so P_c(T)={ T }；

OtherwiseIt obviously can obtain in this way：

P_c={ acb, dacbd, dacbddbca, bcaddbd, addcd }

If using such dividing mode, P_cTo be huge, to lead to huge communication-cost.In addition, based on this The P of sample_cFrequent episodes mining algorithm can generate a large amount of sequence in excavation phase, but be eventually filtered, be useless Sequence.For example, F_1,1,3(P_c) sequence " da, dab, add ... " etc. is contained, these are in the last filtering knot of excavation phase It can be all filtered during fruit.For from view of efficiency, these extra calculating are exactly to waste.So introduce ω-equivalence Definition, ω-equivalence will greatly reduce operation cost and communication cost.

Finally it should be noted that the Frequent episodes that the present invention excavates are not necessarily continuously, we can set one The threshold value at a interval, to excavate discrete Frequent episodes that interval is less than this threshold value.In this way, this feature can also be expanded Zhan Wei：Data redundancy or the database of loss of data mistake can be excavated, as long as the data length of continuous redundancy is little In the interval threshold of setting.

The program of MG-FSM algorithms is as follows：

Input：Sequence library D, σ, γ, λ, f-listF_σ,0,1(D)

Output：Meet all sequences S and its frequency that condition is discussed in first segment.

1：Map(T):

2：for all distinctω∈T satisfyω∈F_σ,0,1(D)do

3：Construct a sequence databaseP_ω(T)that is(ω,γ,λ)-equivalent to{T}

4：For eachS∈P_ω(T),output(ω,S)

5：end for

6：

7：Reduce(ω,P_ω):

8：F_σ,γ,λ(P_ω)←FSM_σ,γ,λ(P_ω)

9：for allS∈F_σ,γ,λ(P_ω)do

10：If p (S)=ω andS ≠ ω then

11：Output(s,f_γ(s,P_ω))

12：end if

13：end for

The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the technical principles of the invention, several improvement and deformation can also be made, these are improved and deformation Also it should be regarded as protection scope of the present invention.

Claims

1. the Frequent episodes method for digging of a kind of mass data, which is characterized in that include the following steps：

1) user's input time sequence data obtains the basic statistics information of data, calculates the frequency of each ω ∈ Σ, and And construction set F is wanted for frequent episode_σ,0,1(D), wherein, ω represents the subsequence of input, and Σ is complete or collected works, represents input institute sometimes Between arrangement set, D represent input time sequence library, subscript σ represent support threshold, 0 is interval threshold, and 1 is length threshold Value；The frequent episode refers to, for σ>0, if meeting f_γ(S, D) >=σ, then sequence S be (σ, γ)-frequently, wherein, f_γ (S, D) represents the frequency of sequence S；

3) to the division P of step 2) construction_ωIndependent excavation is carried out in Reduce, obtains F_σ,γ,λ(P_ω), wherein, P_ωBe using ω as The division of central term, F_σ,γ,λ(P_ω) it is P_ωAll length is no more than λ and meets (σ, γ)-frequent sequence in the middle；

4) F of each frequent episode for obtaining step 3)_σ,γ,λ(P_ω) collection merges, the sequence filter repeated is fallen and obtained Last output.

A kind of 2. Frequent episodes method for digging of mass data according to claim 1, which is characterized in that the step 1) In, average length of the basic statistics information including time series data, length maximum value, the sequence of data are total, item is total, no Item number together, total bytes.

A kind of 3. Frequent episodes method for digging of mass data according to claim 1, which is characterized in that the step 1) It is completed by single MapReduce operations.

A kind of 4. Frequent episodes method for digging of mass data according to claim 1, which is characterized in that the step 1) In, an integer identifiers are stated, and represent sequence with the array of integer identifiers completely for each, first, Integer identifiers are ranked up according to the frequency descending of item, then item are collapsed into using byte variation coding mode whole Number.

A kind of 5. Frequent episodes method for digging of mass data according to claim 1, which is characterized in that the step 2) In, construct ω-equivalence division the step of it is as follows：

2-1) examine input time sequence whether related to central term with minimality；If uncorrelated, enableIf related, A reverse scan to input time sequence is then performed to obtain all right distances of lower target；

(a) left distance is calculated；

(b) it carries out not reaching abbreviation；

(c) uncorrelated item is replaced with space；

(d) prefix/postfix abbreviation and space abbreviation are performed；

(e) list entries is split into several subsequences using+1 space of γ, these subsequences can be used for space fractionation Method, so as to form last output P_ω。

6. the Frequent episodes method for digging of a kind of mass data according to claim 5, which is characterized in that divided Before, first, by scanning set F_σ,0,1(D), wherein the item in set is arranged according to frequency descending, by adjacent item It is divided into one group until their frequency and more than setting value m, traverses each, complete grouping；Then, for each grouping, Construction one individually divides.

7. the Frequent episodes method for digging of a kind of mass data according to claim 5, which is characterized in that use PrefixSpan algorithms are to dividing P_ωIt is excavated.