CN104537025A

CN104537025A - Frequent sequence mining method

Info

Publication number: CN104537025A
Application number: CN201410802280.8A
Authority: CN
Inventors: 苏森; 程祥; 许胜之; 双锴; 徐鹏; 王玉龙; 张忠宝
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2014-12-19
Filing date: 2014-12-19
Publication date: 2015-04-22
Anticipated expiration: 2034-12-19
Also published as: CN104537025B

Abstract

The invention relates to the technical field of data privacy and data mining and discloses a frequent sequence mining method. The method includes the first step of calculating the maximum constraint length lmax of sequences from a raw data base and obtaining beta= {beta1...betai...betan}, wherein betai stands for the maximum support degree of sequences of lengths i; the second step of searching the raw data base for frequent sequences on the basis of sampled candidate set pruning technology according to lmax and beta= {beta1...betai...betan} under the condition of a differential privacy protection paradigm. According to the frequent sequence mining method (PFS2) achieving candidate set pruning on the basis of sampling under the condition of meeting differential privacy, high usability of mining results can be achieved while differential privacy protection is satisfied.

Description

Frequent episodes method for digging

Technical field

The present invention relates to data-privacy and data mining technology field, particularly a kind of Frequent episodes method for digging.

Background technology

Frequent episodes excavation is the Basic Problems of Data Mining, and it has a wide range of applications in numerous areas.Frequent episodes excavates and can be described below: a given sequence library, wherein sequence is the ordered list of the item with some, and a sequence can regard the record of unique user as.For two sequence S=s ₁s ₂s _{| S|}and T=t ₁t ₂t _{| T|}if there is integer w ₁<w ₂< ... <w _{| s|}, make t is then claimed to comprise S.The support of sequence refers in database the number comprising this sequence.When the support of some sequences is not less than a certain given threshold value, this sequence is claimed to be Frequent episodes.When a given sequence library and a threshold value, it is exactly all Frequent episodes occurred in mining data storehouse that Frequent episodes excavates.

In mining process, if the sequence in sequence library relates to sensitive information, directly issue the leakage that Frequent episodes can cause privacy of user.In Frequent episodes mining process, how to protect privacy of user to receive the extensive concern of academia and industry member.Difference secret protection normal form ^[1]proposition be solve in data analysis process the privacy concern produced to provide a kind of feasible scheme.Difference secret protection normal form has the theoretical privacy of user protection ensured by adding noise to provide.

Current, utilize difference privacy technical protection normal form Frequent Pattern Mining to be carried out to the research of secret protection, can be divided three classes according to by the type excavating object, i.e. Frequent episodes excavation, frequent item set mining and frequent graph excavate.Frequent episodes is excavated, document ^[2]for the problem solving Mining Frequent continuous items sequence proposes two stage difference Privacy preserving algorithms, its first time utilizes prefix trees to search candidate's Frequent episodes, and utilizes database transformation technology to improve the support of candidate sequence.Document ^{[3] [4]}solve the sequence library RELEASE PROBLEM meeting difference privacy.Wherein, document ^[3]propose a sequence library meeting difference privacy based on prefix trees and issue algorithm, document ^[4]utilize an elongated n-gram model from sequence library, extract necessary information, and utilize parsing tree to reduce adding of noise.Frequent episodes is excavated, document ^[5]propose the PrivBasis algorithm meeting difference privacy, for excavating top-k frequent item set.Document ^[6]find that the length of restriction affairs effectively can improve the availability of data and the balance of secret protection.It utilizes method for cutting, devises a kind of Frequent Itemsets Mining Algorithm based on Apriori algorithm meeting difference secret protection.Frequent tree mining is excavated, document ^[7]the mining process of Frequent tree mining and secret protection are incorporated in a Markov Chain Monte Carlo framework, propose a kind of Frequent Subgraph Mining meeting difference secret protection newly.

But the availability of the Result of said method and secret protection aspect all Shortcomings, hinder the application of difference secret protection technology in Frequent Pattern Mining research.

Below the pertinent literature of background technology:

[1]C.Dwork,“Differential privacy,”in ICALP,2006.

[2]L.Sweeney,“k-anonymity:A model for protecting privacy,”Int.J.Uncertain.Fuzziness Knowl.-Base Syst,2002.

[3]R.Chen,B.C.M.Fung,and B.C.Desai,“Differentially privatetransit data publication:A case study on the montreal transportationsystem,”in KDD,2012.

[4]R.Chen,G.Acs,and C.Castelluccia,“Differentially privatesequential data publication via variable-length n-grams,”in CCS,2012.

[5]N.Li,W.Qardaji,D.Su,and J.Cao,“Privbasis:frequent itemsetmining with differential privacy,”in VLDB,2012.

[6]C.Zeng,J.F.Naughton,and J.-Y.Cai,“On differentially privatefrequent itemset mining,”in VLDB,2012.

[7]E.Shen and T.Yu,“Mining frequent graph patterns withdifferential privacy,”in KDD,2013.

[8]R.Srikant and R.Agrawal,“Mining sequential patterns:Generalizations and performance improvements,”in EDBT,1996.

Summary of the invention

(1) technical matters that will solve

The technical problem to be solved in the present invention is: how can provide higher Result availability meeting difference secret protection while.

(2) technical scheme

For solving the problems of the technologies described above, the invention provides a kind of Frequent episodes method for digging, comprising step:

S1: the maximum limited length l of the sequence of calculation from raw data base _max, and obtain β={ β ₁... β _i, β _n, β _irepresent that length is the max support of the sequence of i;

S2: according to described l _maxwith β={ β ₁... β _i, β _n, based on the Candidate Set technology of prunning branches of sampling, under the condition meeting difference secret protection normal form, from described raw data base, search Frequent episodes.

Wherein, in described step S1, the maximum limited length of the sequence of calculation specifically comprises:

Obtain the maximum length l of the sequence in described raw data base ₁;

Calculate l ₂, and make be not less than predetermined value, wherein α _iand α _jrepresent that in described raw data base, length is respectively the number of the sequence of i and j;

Calculate the maximum limited length l of described sequence _max=min{l ₁, l ₂.

Wherein, described predetermined value is 85%.

Wherein, at calculating l ₂time, be each α _iadd noise.

Wherein, also comprise in described step S1 for each β _iadd noise.

Wherein, described step S2 specifically comprises:

S2.1: for given threshold value θ, utilizes β to estimate Maximal frequent sequence length L _f, make L _ffor integer y, make β _yfor being greater than the minimum value of θ in β;

S2.2: random for raw data base is divided into L _findividual mutually disjoint database is as sample database, and composition set dbSet, each database comprises | D|/L _findividual sequence, wherein | D| represents the number of sequence in database;

S2.3: generate candidate Frequent episodes, when Mining Frequent 1-sequence, the frequent 1-sequence of candidate is the item in database, later according to downward closure property, uses frequent (k-1)-sequence to generate the frequent k-sequence of candidate, is used for Mining Frequent k-sequence;

S2.4: in sample database, exceedes the maximum limited length l of sequence for length _maxsequence, adopting sequence contraction method to limit its length, the threshold value that user specifies being relaxed, for judging the frequent attribute of sequence in sample database meanwhile;

S2.5: calculated candidate sequence adds the support of noise in raw data base, exports the candidate sequence that the support adding noise is greater than described θ as Frequent episodes.

Wherein, described sequence contraction method specifically comprises:

Sequence contraction method mainly comprises the following steps:

Step one, outlier are deleted, and the item be not included in arbitrary candidate sequence in sequence is called outlier;

Step 2, continuous mode compress, and in a sequence, if a certain pattern may occur continuously, this pattern are called continuous mode, carry out continuous mode compression in the following manner:

Make p ^kintermediate scheme p occurs k time continuously, T ₁| T ₂| T ₃represent by sequence T ₁, T ₂and T ₃the sequence connected into, Contain _k(T) represent by the set being included in the frequent k-sequence of candidate in T and forming.Like this, for sequence T ₁| p ^j| T ₂and T ₁| p ^k| T ₂(wherein j>k), has Contain _k(T ₁| p ^j| T ₂)=Contain _k(T ₁| p ^k| T ₂);

Step 3, sequence reconstruct, if sequence length still can not meet the requirement of maximum length restriction after above step one and step 2, carry out sequence reconstruct in the following manner:

Form candidate sequence tree CS-tree, the common prefix of different candidate sequence is put into same branch; The height of tree equals the length of candidate sequence, and in tree, each vertex ticks is an item, and this node is associated with the sequence that all items form to its path from root; For a CS-tree by one group of frequent k-Sequence composition of candidate, node on kth layer is associated with the frequent k-sequence of candidate, these nodes are claimed to be c-node, node on k-1 layer is associated with (k-1)-sequence, these nodes are claimed to be g-node, the sequence be associated with them is called as formation sequence, specifically introduces sequence restructuring procedure:

A) the frequent k-sequence C of candidate is utilized _kbuild a CS-tree, be designated as CT, for sequence S, find out the frequent k-sequence C of the candidate be included in s ' _k, and generate a new CS-tree by these candidates frequent k-sequence, be designated as subCT, subCT is a stalk tree of CT, builds an empty sequence S' simultaneously;

B) from C' _kin choose a candidate sequence and be appended in S', for C' _kin arbitrary candidate sequence cs, the set that its (k-1)-subsequence is formed may comprise some C' _kin the formation sequence of other candidate sequences, the corresponding g-node of each candidate sequence, each g-node has one group of child nodes, the number sum of the child nodes of these g-nodes is called the scoring of candidate sequence cs, be designated as c-score, so choose the candidate sequence that its c-score is maximum, if multiple candidate sequence has identical c-score, choose and comprise maximum that of different item;

C) from subCT, remove step b) in the sequence pair the chosen c-node of answering;

D) use subCT find the formation sequence be included in (the k-1)-son sequence set of S', described in the formation sequence be included in (the k-1)-son sequence set of S' can corresponding one group of g-node.If have node to have child nodes and c-node in this group g-node, the all corresponding tag entry of these child nodes, and the tag entry of different child nodes may be identical, the item choosing corresponding c-nodes more is appended to S', and is removed from subCT by these corresponding all c-nodes; If all nodes all do not have child nodes, then according to step b in this group g-node) from C' _kin choose a candidate sequence and be appended in S', and the candidate sequence be selected is not comprised by S'; Constantly for S' adds item, until make it meet the requirement of maximum limited length.

Wherein, described threshold value is relaxed method and is specifically comprised:

For a given sample database D _swith one group of frequent k-sequence of candidate, make a frequent k-sequence t of candidate _k, make its true support be the described threshold value θ specified; Then, t is calculated _kat D _sin add the cumulative distribution function F of the support of noise _k; Finally set F _k(θ ')=ζ, calculates corresponding θ ', and in sample database, support is greater than the sequence of θ ' is potential Frequent episodes.

(3) beneficial effect

Frequent episodes method for digging (the PFS realizing Candidate Set beta pruning based on sampling meeting difference privacy in the present invention ²) higher Result availability can be provided while meeting difference secret protection.

Accompanying drawing explanation

Fig. 1 is a kind of Frequent episodes method for digging schematic flow sheet of the embodiment of the present invention;

Fig. 2 is CS-tree schematic diagram;

In Fig. 3, (a) ~ (f) is method of the present invention (i.e. PFS ²algorithm) scoring under different threshold value and the comparison diagram of recall rate from Prefix algorithm and n-gram algorithm;

In Fig. 4, (a) ~ (d) is method of the present invention (i.e. PFS ²algorithm) scoring under different privacy parameters and the comparison diagram of recall rate from Prefix algorithm and n-gram algorithm;

In Fig. 5, (a) ~ (d) is that sequence contraction method and threshold value relax method to PFS ²the impact of algorithm performance.

Embodiment

Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples for illustration of the present invention, but are not used for limiting the scope of the invention.

The Frequent episodes method for digging of the embodiment of the present invention is the Frequent episodes method for digging realizing Candidate Set beta pruning based on sampling meeting difference privacy, specifically comprises:

Step S1: the maximum limited length l of the sequence of calculation from raw data base _max, and obtain β={ β ₁... β _i, β _n, β _irepresent that length is the max support of the sequence of i;

Step S2: use the Candidate Set technology of prunning branches based on sampling, search Frequent episodes under the condition meeting difference secret protection normal form.In the process of candidate sequence beta pruning, employ in the present embodiment sequence shrink and threshold value relax method.

The implementation procedure of step 1 and step 2 will be introduced in detail below.Step S1 specifically comprises:

Step S1.1: calculate maximum limited length.A given database, by setting l _max=min{l ₁, l ₂, determine l in the didactic mode of one _max.Wherein l ₁the maximum length of sequence in representation database, which determines the maximal value of the error that the noise that adds in support computation process brings; L is calculated from database ₂.L ₂computing method can be described as: first, make α={ α ₁..., α _n, wherein α _iit is the number of the list entries of i for length in raw data base; Then l is calculated ₂make be not less than 85%.Because the calculating of α relates to data-privacy, be each α _iadd appropriate noise.The mode adding noise is according to Laplace mechanism, on the basis of precise results, namely add the random number that is obeyed Laplace distribution.

Step S1.2: calculate β={ β ₁... β _i, β _n, wherein β _iit is the max support of the sequence of i for all length.It will be used for estimating Maximal frequent sequence length L in step 2 _f.Because accurate Calculation β is that calculating is upper infeasible, in the present embodiment, use non-privacy Frequent episodes mining algorithm GSP ^[8]calculate β.Due to the build-in attribute that β is database, may leak data privacy, be therefore each β _iadd appropriate noise.

The maximum limited length l of required sequence is just extracted like this through step 1 _maxand β.Step S2 specifically comprises:

S2.1: for given threshold value θ, utilizes β to estimate Maximal frequent sequence length L _f, make L _ffor integer y, make β _yfor being greater than the minimum value of θ in β.

S2.2: random for raw data base is divided into L _findividual mutually disjoint database is as sample database, and their compositions gather dbSet, and each database approximately comprises | D|/L _findividual sequence, wherein | D| represents the number of sequence in database;

S2.3: generate candidate's Frequent episodes.When Mining Frequent 1-sequence (this expression formula of x-sequence represents that length is the sequence of x), the frequent 1-sequence of candidate is the item in database, after, according to downward closure property, use frequent (k-1)-sequence to generate the frequent k-sequence of candidate, be used for Mining Frequent k-sequence.

S2.4: given sample database, the Candidate Set pruning algorithms (comprise sequence contraction method and threshold value relaxes method) based on sampling utilizing the present invention to propose carries out beta pruning to Candidate Set.Specifically, in sampling database, the maximum limited length l of sequence is exceeded for length _maxsequence, adopt sequence contraction method to limit its length., the threshold value that user specifies is relaxed, for judging the frequent attribute of sequence in sampling database meanwhile.

S2.5: calculated candidate sequence adds the support of noise in raw data base, exports the candidate sequence that the support adding noise is greater than θ as Frequent episodes.

To shrink introducing the sequence used in this step respectively and threshold value relaxes method below.Sequence contraction method specifically comprises:

In sample database, the support of sequence in this sample database i.e. local support is used to estimate whether it is potential Frequent episodes.Due to privacy requirement, noise must be added for the local support of each sequence.In order to make to estimate more accurately, accomplish that the noise added is the least possible.In the frequent item set mining meeting difference privacy, find that the length limiting affairs can effectively reduce the quantity adding noise.But due to the difference of sequence and item collection essence, its method proposed is excavated no longer applicable to Frequent episodes.For this reason, the present invention proposes sequence contraction method.

For convenience, lift an example: have the frequent 2-sequence ab of one sequence A=abcbbce and four candidate, be, bb and ae, can find out sequence A ₁=abbbe and A ₂=abbe and sequence A comprise the frequent 2-sequence of identical candidate.

Sequence contraction method mainly comprises the following steps:

Step one: outlier is deleted.The item be not included in arbitrary candidate sequence in sequence is called outlier, as above c in example.Find that outlier does not affect candidate sequence, its deletion can not be caused information loss.A is converted into by A ₁.

Step 2: continuous mode compresses.In a sequence, a certain pattern may occur continuously, and this pattern is called continuous mode, below theorem ensure that by carrying out certain compression to continuous mode and can not cause damage to frequent information.

Theorem: make p ^kintermediate scheme p occurs k time continuously, T ₁| T ₂| T ₃represent by sequence T ₁, T ₂and T ₃the sequence connected into, Contain _k(T) represent by the set being included in the frequent k-sequence of candidate in T and forming.Like this, for sequence T ₁| p ^j| T ₂and T ₁| p ^k| T ₂(wherein j>k), has Contain _k(T ₁| p ^j| T ₂)=Contain _k(T ₁| p ^k| T ₂).Like this, can compress continuous mode.

Step 3: sequence reconstructs.If still can not meet the requirement of maximum length restriction through above step one and step 2 sequence length, some in sequence must be removed further.If removed at random, a large amount of frequent information dropout will be caused, this will cause potential Frequent episodes misjudgment, if and use the method enumerated to determine to meet maximum length restriction to require and the maximum sequence of the quantity of the identical candidate sequence comprised with original series, be computationally infeasible.Therefore the present invention proposes a kind of sequence reconstructing method of novelty.

First the definition of candidate sequence tree is proposed, i.e. CS-tree.The common prefix of different candidate sequence is put into same branch; The height of tree equals the length of candidate sequence; In tree, each vertex ticks is an item, and this node is associated with the sequence that all items form to its path from root.Fig. 2 be one by candidate's frequent 3-arrangement set abc, bcd, bda, bdb} form CS-tree.

For a CS-tree by one group of frequent k-Sequence composition of candidate, node on kth layer is associated with the frequent k-sequence of candidate, these nodes are claimed to be c-node, equally, node on k-1 layer is associated with (k-1)-sequence, claim these nodes to be g-node, the sequence be associated with them is called as formation sequence.Lower mask body introduces sequence restructuring procedure:

A) the frequent k-sequence C of candidate is utilized _kbuild a CS-tree, be designated as CT, for sequence S, find out the frequent k-sequence C of the candidate be included in s ' _k, and generate a new CS-tree by these candidates frequent k-sequence, be designated as subCT.Can find out, subCT is a stalk tree of CT.Build an empty sequence S' simultaneously.

B) from C' _kin choose a candidate sequence and be appended in S'.For C' _kin arbitrary candidate sequence cs, the set that its (k-1)-subsequence is formed may comprise some C' _kin the formation sequence of other candidate sequences, the corresponding g-node of each candidate sequence, each g-node has one group of child nodes, and the number sum of the child nodes of these g-nodes is called and the scoring of candidate sequence cs is designated as c-score.So choose the candidate sequence that its c-score is maximum.If multiple candidate sequence has identical c-score, choose and comprise maximum that of different item.

C) from subCT, remove b) the c-node that the sequence pair chosen is answered.

D) use subCT to find the formation sequence be included in (the k-1)-son sequence set of S', they can corresponding one group of g-node.If have node to have child nodes and c-node in this group g-node, the all corresponding tag entry of these child nodes, and the tag entry of different child nodes may be identical, the item choosing corresponding c-nodes more is appended to S', and is removed from subCT by these corresponding all c-nodes; If all nodes all do not have child nodes in this group g-node, then according to b) from C' _kin choose a candidate sequence and be appended in S', and the candidate sequence be selected is not comprised by S'.

Like this, constantly for S' adds item, until make it meet the requirement of maximum limited length.By this method, can realize sequence reconstruct efficiently, the public candidate sequence simultaneously making S' and S comprise is many as much as possible.Threshold value is relaxed method and is specifically comprised:

Because the sequence in sequence library is randomly drawed from raw data base, this can cause a lot of Frequent episodes in sample database, become no longer frequent.If still use user's threshold value of specifying to judge which sequence is potential Frequent episodes, the estimated result that can lead to errors.In order to address this problem, the present invention proposes threshold value and relaxing method.Its process specifically can specifically describe: for a given sample database D _swith one group of frequent k-sequence of candidate, first suppose a frequent k-sequence t of candidate _k, make its true support be the threshold value θ specified; Then, t is calculated _kat D _sin add the cumulative distribution function F of the support of noise _k.T _kat D _sin add noise support be designated as z, z is two stochastic variable x (t _kat D _sin true support) and Laplace noise y sum, wherein x Gaussian distributed Normal (μ, σ ²), y obeys Laplace distribution Laplace so by the Cumulative Distribution Function calculating z be:

Wherein, μ, σ ²for the parameter of Gaussian distribution, wherein μ is the expectation of distribution, σ ²for the variance of distribution; λ, for the parameter of Laplace distribution, wherein λ is the location parameter of distribution, for the scale parameter of distribution.Erfc is error function, and its expression formula is:

(x) = {&Integral;}_{x}^{\infty} (- p^{2}) dp

P is integration variable.Finally set F _k(θ ')=ζ (F _kthe expression formula of (θ ') is as above formula, z in the suitable above formula of θ '), calculate corresponding θ ', and be potential Frequent episodes with which candidate sequence in its sample estimates database, the sequence that namely in sample database, support is greater than θ ' is potential Frequent episodes.Find in experiment, when ζ is set to 0.3, usually can obtain good result.And drawn by theoretical analysis, the method meets the requirement of difference privacy normal form.

In sample database, the maximum limited length l of sequence is exceeded for length _maxsequence, adopt sequence contraction method to limit its length, thus desensitising size, and then reduce Laplace noise and add, the final availability improving frequent item set support.And by sample database to an item integrate whether judge in advance as frequent item set time, a part of frequent item set may by misjudgement for nonmatching grids, and in order to reduce this error, by threshold value reduction to a certain degree, this method is called that threshold value relaxes method.Namely sequence contraction method and threshold value are relaxed method and are acted on the different stages, sequence contraction method is when utilizing sample database statistical items collection support, reduce noise to add, it is to reduce by the quantity of the frequent item set misjudged as nonmatching grids that threshold value relaxes method.

By form analysis, the Frequent episodes mining algorithm (PFS realizing Candidate Set beta pruning based on sampling meeting difference privacy in the present invention ²) higher Result availability can be provided while meeting difference secret protection.

By with document ^[3]the algorithm (Prefix) proposed and document ^[4]the algorithm (n-gram) proposed is compared, and can determine that the PFP algorithm proposed has obvious advantage in the availability and secret protection level of Result.In order to the advantage of algorithm of the present invention is described better, adopt widely used module F-score ^[6]with relative error (RE) ^[5]come PFS ²algorithm and Prefix algorithm, n-gram algorithm compare.Wherein, weigh the availability of the Frequent episodes of generation with F-score, weigh its error relative to the accurate support of sequence with RE.

Use Prefix and n-gram complete privacy search Frequent episodes time, have employed general method: on original database, first run them generate anonymous database, on the database of anonymity, then run the Frequent episodes mining algorithm GSP of non-privacy.

Arranging of concrete experiment is as follows.

Use three groups of real data sets, because the data in data set House-Power are time series, these values of discretize also successfully construct a sequence from every 50 samples.In these three databases, the specific features of data is as shown in table 1 below:

Data characteristics in table 1 database

Database	Sequence number	Item number	Maximum length	Average length
					MSNBC	989818	17	14795	4.75
BIBLE	36369	19305	100	21.64
					House_Power	40986	21	50	50

All algorithms are realized by JAVA language.The experimental situation of test is Intel Core2DuoE8400CPU (3.0GHz) and 4GB RAM.

Below by analysis design mothod data, PFS is described ²the performance of algorithm.

First PFS is compared ²algorithm, Prefix algorithm and the performance of n-gram algorithm under different threshold value.Because the quantity of BIBLE middle term is too large, concerning Prefix algorithm, be difficult to it and build a prefix trees, so do not show the performance of Prefix algorithm in BIBLE.Experimental result is as shown in (a) ~ (f) in Fig. 3.

As can be seen from Figure 3 PFS ²the performance of algorithm is obviously better than Prefix algorithm and n-gram algorithm.Interpretation of result is as follows.For meeting the requirement of maximum limited length, the item exceeding restriction in list entries is directly deleted by Prefix algorithm and n-gram algorithm, causes a large amount of frequent information dropout like this.Comparatively speaking, PFS ²algorithm employs sequence contraction method, and it effectively can protect the frequent information in each list entries, so it can significantly improve the availability of Frequent episodes.

Then, data set MSNBC (θ=0.015) and House_Power (θ=0.34) is utilized to compare PFS ²algorithm, Prefix algorithm and the performance of n-gram algorithm under different parameters.Experimental result is as shown in (a) ~ (d) in Fig. 4.

4 can find out from the graph, under the privacy parameters of phase same level, and PFS ²the performance of algorithm is better than Prefix algorithm and n-gram algorithm all the time.Find, three algorithm table reveal consistent characteristic simultaneously, and namely along with the increase of ε, the quality of Frequent episodes improves constantly.ε refers to the privacy parameters in ε-difference privacy herein, and its represents difference secret protection intensity, and ε is larger, and secret protection intensity is more weak, needed for the noise that adds fewer, so the quality of Frequent episodes is higher.

Finally, database BIBLE and House_Power measurement sequence contraction method and threshold value is utilized to relax method to PFS ²the impact of algorithm performance.Represent deleted entry random from sequence with RR in the present embodiment, and the threshold value that user specifies is not relaxed in sample database, represent with SR and employ sequence contraction method but do not use threshold value to relax method.

(a) ~ (d) as can be seen from Fig. 5, the RR not using sequence contraction method and threshold value to relax method can not produce rational result.Meanwhile, find that sequence contraction method and threshold value are relaxed method and can effectively be improved algorithm PFS ²the performance of algorithm in F-score, in RE, although after employing these two kinds of methods, performance slightly reduces, and they can reach good performance.This is because by using sequence contraction method and threshold value to relax method, more real Frequent episodes is retained in sample database, and this can increase a little by causing the noise content added for each candidate sequence.

By above great many of experiments, the PFS proposed can be determined ²algorithm has obvious advantage in the availability and secret protection level of Result.

Above embodiment is only for illustration of the present invention; and be not limitation of the present invention; the those of ordinary skill of relevant technical field; without departing from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all equivalent technical schemes also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Claims

1. a Frequent episodes method for digging, is characterized in that, comprises step:

2. Frequent episodes method for digging as claimed in claim 1, it is characterized in that, in described step S1, the maximum limited length of the sequence of calculation specifically comprises:

Obtain the maximum length l of the sequence in described raw data base ₁;

3. Frequent episodes method for digging as claimed in claim 2, it is characterized in that, described predetermined value is 85%.

4. Frequent episodes method for digging as claimed in claim 2, is characterized in that, at calculating l ₂time, be each α _iadd noise.

5. Frequent episodes method for digging as claimed in claim 2, is characterized in that, also comprise for each β in described step S1 _iadd noise.

6. the Frequent episodes method for digging according to any one of Claims 1 to 5, is characterized in that, described step S2 specifically comprises:

S2.5: calculated candidate sequence adds the support of noise in raw data base, exports the candidate sequence that the support adding noise is greater than described threshold value θ as Frequent episodes.

7. Frequent episodes method for digging as claimed in claim 6, it is characterized in that, described sequence contraction method specifically comprises:

Sequence contraction method mainly comprises the following steps:

Make p ^kintermediate scheme p occurs k time continuously, T ₁| T ₂| T ₃represent by sequence T ₁, T ₂and T ₃the sequence connected into, Contain _k(T) represent by the set being included in the frequent k-sequence of candidate in T and forming, for sequence T ₁| p ^j| T ₂and T ₁| p ^k| T ₂(wherein j>k), has Contain _k(T ₁| p ^j| T ₂)=Contain _k(T ₁| p ^k| T ₂);

D) subCT is used to find the formation sequence be included in (the k-1)-son sequence set of S', described be included in formation sequence in (the k-1)-son sequence set of S' can corresponding one group of g-node, if have node to have child nodes and c-node in this group g-node, the all corresponding tag entry of these child nodes, and the tag entry of different child nodes may be identical, the item choosing corresponding c-nodes more is appended to S', and is removed from subCT by these corresponding all c-nodes; If all nodes all do not have child nodes, then according to step b in this group g-node) from C' _kin choose a candidate sequence and be appended in S', and the candidate sequence be selected is not comprised by S'; Constantly for S' adds item, until make it meet the requirement of maximum limited length.

8. Frequent episodes method for digging as claimed in claim 6, it is characterized in that, described threshold value is relaxed method and is specifically comprised: