CN105740245A

CN105740245A - Frequent item set mining method

Info

Publication number: CN105740245A
Application number: CN201410746488.2A
Authority: CN
Inventors: 程祥; 苏森; 许胜之; 徐鹏; 双锴; 王玉龙; 张忠宝
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2014-12-08
Filing date: 2014-12-08
Publication date: 2016-07-06

Abstract

The invention relates to the technical field of data mining and data privacy, and discloses a frequent item set mining method. The frequent item set mining method comprises the following steps: S1: segmenting a transaction of which the transaction length is greater than a restriction length in an original database into a plurality of sub-transactions, and causing the length of each transaction in the segmented database to be smaller than or equal to the restriction length; and S2: according to a support degree threshold value which is appointed in advance, utilizing a support degree estimation method and a dynamic descent method to mine the frequent item set in the segmented database. The frequent item set mining method can provide higher mining efficiency and mining result availability while differential privacy protection is met.

Description

Mining Frequent Itemsets

Technical field

The present invention relates to data mining and data-privacy technical field, particularly to a kind of Mining Frequent Itemsets.

Background technology

Frequent item set mining is the Basic Problems of Data Mining, and it has a wide range of applications in numerous areas.Frequent item set mining can describe as follows: a given transaction database, the personal record of each the corresponding user of affairs.Wherein, affairs are the set of item.A given item collection (set of item), its support refers to the number of the affairs comprising this collection.When support of some collection is not less than a certain given threshold value, this is claimed to integrate as frequent item set.When a given transaction database and a threshold value, frequent item set mining is exactly all frequent item sets occurred in mining data storehouse.

In frequent item set mining, FP-growth algorithm^[2]It it is the mining algorithm being widely used.FP-growth algorithm is a kind of depth-first traversal algorithm.In mining process, FP-growth algorithm utilizes FP-tree to accelerate the mining process of whole frequent item set.FP-tree is a kind of special prefix trees.Utilize FP-tree, FP-growth algorithm only data base need to be carried out two-pass scan, thus substantially increasing the efficiency of excavation.

In mining process, if the affairs in transaction database (personal record) belong to sensitive information, directly issue frequent episode rally and cause the leakage of individual subscriber record.The privacy of user data how is protected to receive the concern of more and more people in frequent item set mining process.Difference secret protection normal form^[3]Propose provide a kind of feasible scheme for solving the privacy concern in data analysis process.Anonymous with k-^[4]With l-multiformity^[5]Difference, difference secret protection normal form is by adding noise to provide the privacy of user protection with theoretical guarantee.

Currently, Frequent Pattern Mining is carried out secret protection by existing research and utilization difference privacy technical protection normal form.Algorithm^[6][7]Difference secret protection is utilized to issue transaction database to protect privacy of user.The data base issued can be used for Frequent Pattern Mining.Specifically, document^[6]Propose a kind of Database Publishing algorithm under context-free tree structure instructs.This algorithm partition database in a top-down manner, and finally issue a generated data storehouse for frequent item set mining；Document^[7]Under the scene of data increment change, it is provided that the transaction database meeting difference privacy issues algorithm.Document^[8]Propose the PrivBasis algorithm meeting difference privacy, be used for excavating top-k frequent item set.Document^[9]Find that the length of restriction affairs can improve the balance of availability of data and secret protection.It utilizes method for cutting, devises a Frequent Itemsets Mining Algorithm based on Apriori algorithm meeting difference secret protection.But algorithm above all Shortcomings in the availability and digging efficiency of Result, hinder the application in Frequent Pattern Mining research of the difference secret protection technology.

List of references is as follows:

[1]R.AgrawalandR.Srikant,“Fastalgorithmsforminingassociationrules,”inVLDB,1994.

[2]J.Han,J.Pei,andY.Yin,“Miningfrequentpatternswithoutcandidategeneration,”inSIGMOD,2000.

[3]C.Dwork,“Differentialprivacy,”inICALP,2006.

[4]L.Sweeney,“k-anonymity:Amodelforprotectingprivacy,”Int.J.Uncertain.FuzzinessKnowl.-BaseSyst,2002.

[5]A.Machanavajjhala,J.Gehrke,D.Kifer,andM.

Venkitasubramaniam,“l-diversity:Privacybeyondk-anonymity,”inICDE,2006.

[6]R.Chen,N.Mohammed,B.C.M.Fung,B.C.Desai,andL.Xiong,“Publishingset-valueddataviadifferentialprivacy,”inVLDB,2011.

[7]X.Zhang,X.Meng,andR.Chen,“Differentiallyprivatesetvalueddatareleaseagainstincrementalupdates,”inDASFAA,2013.

[8]N.Li,W.Qardaji,D.Su,andJ.Cao,“Privbasis:frequentitemsetminingwithdifferentialprivacy,”inVLDB,2012.

[9]C.Zeng,J.F.Naughton,andJ.-Y.Cai,“Ondifferentiallyprivatefrequentitemsetmining,”inVLDB,2012.

Summary of the invention

(1) to solve the technical problem that

The technical problem to be solved in the present invention is: how to improve the availability of Result.

(2) technical scheme

For solving above-mentioned technical problem, the invention provides a kind of Mining Frequent Itemsets, including:

S1: transaction length in raw data base is divided into multiple subtransaction more than the affairs of limited length so that after segmentation, in data base, the length of every affairs is not more than described limited length；

S2: according to preassigned support threshold, utilizes the support estimation technique and dynamic descent method Mining Frequent Itemsets Based in data base after singulation.

Wherein, described step S1 specifically includes:

S1.1: based on raw data base, construct a undirected weighted graph, wherein, item in vertex representation data base, when the item collection that the item of two vertex correspondence is constituted occurs in same affairs, connecting the two summit and form a limit, the weight on limit is the item collection of the item composition of vertex correspondence support in transaction database；

S1.2: utilize Louvain algorithm to find community in described undirected weighted graph, utilize middle output result one the tree structure CR-tree of structure in Louvain algorithm iteration process, in CR-tree, the node of each layer represents with the community found in an iterative process, the height of tree represents iterations, father node represents that the community represented by child nodes merges the new community formed, and the shortest path length between the leaf node comprising this of the dependency between item represents；

S1.3: utilize the CR-tree generated, length in raw data base is split more than the affairs of limited length, generating the data base after segmentation, the CR-tree of affairs, transaction length, maximum transaction limited length and structure is designated as t, p, m and T respectively, concrete cutting procedure is as follows:

S1.3.1: calculate the number q of subtransaction after splitting according to p and m, namely

S1.3.2: the result set R after being split by affairs t is set to sky；

S1.3.3: structure i-th affairs t_i, including following a)～e) step:

A) from CR-tree leaf node, choose and comprise the node of element in affairs t, and the item being not included in these nodes in t is removed, constitute set according to these new nodes；

B) from the leaf layer N of CR-tree_lIn choose and comprise the node n that item is maximum_l, and by n_lIn item add t_i；

C) to N_lIn all the other nodes according to n_lDistance in T sorts from big to small, the node that dependency is identical, sorts from big to small according to its capacity；

D) N is traveled through successively_lIn all the other nodes, if its capacity and n_lCapacity sum is not more than m, then by it from N_lMiddle taking-up, and put into t_i；

E) by t_iIt is stored in result set R；

S1.3.4: repeat S1.3.1 process q time；

S1.3.5: if N_lIn still suffer from node, by N_lA length that what item in each node was random put in R is less than in the subtransaction of m；

S1.3.6: return result set R.

Wherein, after singulation to the affairs t after each segmentation_iGive weight 1/q.

Wherein, in step S2, the support estimation technique specifically includes:

S2.1: setting the item in result set R and integrate the length of X as i, its support in raw data base is ω, and the support in data base is ω ' after singulation, the support after adding noise in data base after singulation is

S2.2: according toEstimation ω ', is drawn by bayesian criterion:

\Pr (ω^{'} | \tilde{ω}) = \frac{\Pr (\overset{&OverBar;}{ω} | ω^{'}) \cdot \Pr (ω^{'})}{\Pr (\overset{&OverBar;}{ω})};

If ω ' obeys consistent prior distribution, then its conditional probability distribution meets:

\Pr (ω^{'} | \tilde{ω}) ~ e^{- ϵ | ω^{'} - \overset{&OverBar;}{ω} |};

S2.3: estimation X support ω in raw data base, the maximum limited length of affairs is m, and the affairs t that length is p comprises X and p > m, t is divided intoIndividual subset, the length of each subset, less than m, sets simultaneouslyThe length of individual subset is m, and the length of another one subset is less than m, and its length is a=p-q'm, and after so obtaining segmentation affairs t, the probability that X comprises a subtransaction after singulation is:

β_{p} = \{\begin{matrix} \frac{q^{'} C_{p - i}^{m - i}}{C_{p}^{m}} & ifα < i \\ \frac{q^{'} C_{p - i}^{m - i}}{C_{p}^{m}} + \frac{C_{p - i}^{α - i}}{C_{p}^{α}} & ifα &GreaterEqual; i \end{matrix};

Make α_kRepresenting the number comprising X and affairs that length is k, wherein, n is the maximum of transaction length, calculates being desired for of ω ':

Order:

Estimation X Average Supports in raw data base is:

avg (ω^{'}) = \frac{ω^{'}}{ratio (i)};

Utilize ρ-lower bit line, estimate that X max support in raw data base is:

\max (ω^{'}) = \{\begin{matrix} \frac{ω^{'} - \ln ρ + \sqrt{\ln^{2} ρ - 2 ω^{'} \ln ρ}}{ratio (i)} & if \ln ρ \leq {2 ω}^{'} \\ avg (ω^{'}) & if \ln ρ > {2 ω}^{'} \end{matrix};

The support adding noise in conversion database according to XIts max support in raw data base can be estimated and Average Supports is respectively as follows:

\max_supp (\tilde{ω}) = {&Integral;}_{ω^{'} = \overset{&OverBar;}{ω} - 5}^{ω^{'} = \overset{&OverBar;}{ω} + 5} \Pr (ω^{'} | \tilde{ω}) \max (ω^{'})

avg_supp (\tilde{ω}) = {&Integral;}_{ω^{'} = \overset{&OverBar;}{ω} - 5}^{ω^{'} = \overset{&OverBar;}{ω} + 5} \Pr (ω^{'} | \tilde{ω}) avg (ω^{'});

If the Average Supports of item collection X is more than described support threshold, then X is frequent item set；For frequent item set, if the max support of item collection X is more than given support threshold, then X is used to generate frequent item set candidate collection.

Wherein, in step S2, dynamic descent method specifically includes:

Step 2.4: the upper limit for the quantity of the item collection inquiry that length is i composes initial value；

Assuming that containing s frequent episode in frequent item set Candidate Set, define an arrayStore the transformation of the item collection of the different length of calculating, wherein,Representing the transformation of the item collection that length is i calculated, its initial value is

Step 2.5: in mining process, dynamically reduces the transformation of the item collection that length is i calculated, particularly as follows:

Current at the condition FP-tree excavating item collection β, the item order in the header table of the condition FP-tree of item collection β is { i₁,...,i_k,...,i_n, for kth element i in header table_k, it constitutes new item collection Y=β ∪ i with item collection β_k, remember S₁={ i₁,...,i_k-1, make S₂For newfound non-frequent episode in the conditional pattern base of Y constitute set, due to for S₂Middle arbitrary element j, frequently, the downward closure property according to frequent mode, by X and { S for item collection X=Y ∪ j right and wrong₁The item collection certainty right and wrong that the random subset of-j} is constituted frequently, obtainReduction amount be:

r_{p} = Σ_{u = 1}^{q} C_{| S_{1} | - u}^{p - | Y | - 1}

Wherein q=min{ | S₂|,|S₁|-(p-| Y |-1) }, wherein p is the length of affairs, and q is the number of subtransaction after segmentation；

Step 2.6: using the item collection transformation after renewal as sensitivity, utilizes Laplace mechanism, using the ratio of this sensitivity and safety coefficient as the yardstick of Laplace probability distribution, adds noise content for support.

(3) beneficial effect

The present invention can provide higher digging efficiency and Result availability while meeting difference secret protection.

Accompanying drawing explanation

Fig. 1 is Mining Frequent Itemsets flow chart of the present invention；

Fig. 2 is given transaction database schematic diagram in the present invention；

Fig. 3 is according to Fig. 2 undirected weighted graph completed；

Fig. 4 is the CR-tree according to Fig. 3 structure；

Fig. 5 is PFP, TT and PB F-score and the RE measurement result in Pumsb；

Fig. 6 is PFP, TT and PB F-score and the RE measurement result in Accidents；

Fig. 7 is PFP, TT and PB F-score and the RE measurement result in POS；

Fig. 8 is PFP, TT and PB F-score and the RE measurement result in Retail；

Fig. 9 is PFP, FP, TT and the PB operation time under different pieces of information collection.

Detailed description of the invention

Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following example are used for illustrating the present invention, but are not limited to the scope of the present invention.

The Mining Frequent Itemsets flow process of the embodiment of the present invention is as it is shown in figure 1, comprise the following steps:

Step S1: raw data base is carried out pretreatment.Utilize intelligent scissor method, length in raw data base is divided into multiple subtransaction more than the affairs specifying limited length so that after conversion, in data base, the length of every affairs is not more than appointment limited length.

Step S2: Mining Frequent Itemsets Based under the premise meeting difference secret protection.According to the threshold value that user specifies, in data base in post-conversion, Mining Frequent Itemsets Based.In mining process, utilize support method of estimation, reduce affairs and split the information loss brought；Meanwhile, utilize dynamic descending method, mining process reduces noise addition, to improve result availability.

In detail below the intelligent scissor method in preprocessing process, support method of estimation and dynamic descending method are described.

For step S1, in order to improve Result availability and keep higher secret protection grade, intending raw data base is carried out a kind of new conversion, namely affairs are split.When a record length exceedes limited length, these affairs are divided into a plurality of subtransaction.Each the subtransaction that segmentation produces is satisfied by the restriction of greatest length.But simply raw data base is carried out cutting transformation, a large amount of losses of information can be caused so that item collection is no longer frequent frequently for some, affects the availability of Result.Meanwhile, behind partition data storehouse, can cause that the secret protection grade meeting the mining algorithm of difference secret protection declines.In order to solve problem above, the present invention proposes intelligent scissor method.Specifically comprise the following steps that

Step 1: based on raw data base, constructs a undirected weighted graph.Wherein, the item in vertex representation data base.When the item collection that the item of two vertex correspondence is constituted occurs in same affairs, connect the two summit and form a limit.The weight on limit is the item collection of the item composition of vertex correspondence support in transaction database.Given transaction database (Fig. 1), the undirected weighted graph constituted by the above process is as shown in Figure 2.

Step 2: utilize Louvain algorithm to find community in the undirected weighted graph of structure.Louvain algorithm is an iterative algorithm.Utilize middle output result one the tree structure CR-tree of structure in Louvain algorithm iteration process.In CR-tree, the node of each layer represents that the height of tree represents that iterations, father node represent that the community represented by child nodes merges the new community formed with the community found in an iterative process.Dependency between represents by the shortest path length between the leaf node comprising this.Such as, given such as the undirected weighted graph of Fig. 2, utilize the CR-tree of Louvain algorithm construction as shown in Figure 3.

Step 3: utilize the CR-tree generated, length in raw data base is split more than the affairs of limited length, generates the data base after segmentation.The CR-tree of affairs, transaction length, maximum transaction limited length and structure is designated as t, p, m and T respectively.Concrete cutting procedure is as follows:

1) according to the number q of subtransaction after p and m calculating segmentation, namely

2) the result set R after being split by affairs t is set to sky；

3) structure i-th affairs t_i；

A) from CR-tree leaf node, choose the node comprising element in affairs t (that is: item), and the item being not included in these nodes in t is removed.Set is constituted according to these new nodes；

C) to N_lIn all the other nodes according to n_lDistance in T sorts from big to small.The node that dependency is identical, sorts from big to small according to its capacity；

E) by t_iIt is stored in result set R；

4) 3 are repeated) process q time；

5) if N_LIn still suffer from node, by N_lA length that what item in each node was random put in R is less than in the subtransaction of m；

6) result set R is finally returned to.

So through above step 1 to step 3, raw data base is changed into new data base, and ensures that in the data base after segmentation, the length of every affairs meets the restriction of greatest length.

The analysis found that, if affairs are divided into k bar subtransaction, then the transaction database after converting is met the Frequent Itemsets Mining Algorithm of ε-difference privacy, for raw data base, k ε-difference privacy can only be ensured.For this, the present invention proposes weighting segmentation computing.Its definition can describe as follows: when maximum transaction capacity is m, and an affairs t/length is more than m, and t is cut into t by the computing f (i.e. above-mentioned steps 1～step 3) of segmentation affairs₁...t_kSo that | t_i|≤m, and be t_iDistribution weight w_iIf f meetsAndThen claiming f is weighting segmentation computing.Obtain drawing a conclusion through theoretical proof, when using weighting segmentation computing to convert original transaction data base, the transaction database after converting is met the Frequent Itemsets Mining Algorithm of ε-difference privacy, for raw data base, can guarantee that ε-difference privacy equally.

Further, affairs segmentation in step S1, data base increases a t₁, t₁Support increase 1/q, and be no longer 1, this can cause and calculate t in data base in post-conversion₁Support less than real value, and the part subset in t loses, and in turn results in certain information loss, in order to solve this problem, in order to compensate these information loss, proposes support method of estimation in step S2.The method mainly includes two steps: the support of the noise of addition (be generally adopted Laplace mechanism and add noise) first obtained in mining process according to item collection, estimates the accurate support in its data base after singulation；Then, collect the support in data base after singulation according to the item estimated, estimate its support in raw data base.Concrete calculating process is as follows:

Assuming that item integrates the length of X as i, its support in raw data base is ω, and the support in data base is ω ' after singulation, and the support after adding noise in data base after singulation is

First, according toEstimation ω ', by bayesian criterion, it can be deduced that:

\Pr (ω^{'} | \tilde{ω}) = \frac{\Pr (\overset{&OverBar;}{ω} | ω^{'}) \cdot \Pr (ω^{'})}{\Pr (\overset{&OverBar;}{ω})}

Assuming that ω ' obeys consistent prior distribution, then its conditional probability distribution meets:

\Pr (ω^{'} | \tilde{ω}) ~ e^{- ϵ | ω^{'} - \overset{&OverBar;}{ω} |}

Then, estimation X support ω in raw data base.

The maximum limited length assuming affairs is m, and the affairs t that length is p comprises X and p > m.In order to improve the availability of Result and ensure higher secret protection grade, t is divided intoIndividual subset, each oneself length is less than m.Assume simultaneouslyThe length of individual subset is m, and the length of another one subset is likely less than m, and its length is a=p-q'm.After so obtaining segmentation affairs t, the probability that X comprises a subtransaction after singulation is:

β_{p} = \{\begin{matrix} \frac{q^{'} C_{p - i}^{m - i}}{C_{p}^{m}} & ifα < i \\ \frac{q^{'} C_{p - i}^{m - i}}{C_{p}^{m}} + \frac{C_{p - i}^{α - i}}{C_{p}^{α}} & ifα &GreaterEqual; i \end{matrix};

Make α_kRepresenting the number comprising X and affairs that length is k, wherein, n is the maximum of transaction length, it is possible to calculate being desired for of ω ':

For convenience, latter half on the right of equation (i.e. content in last bracket) is abbreviated as ratio (i), it may be assumed that

So can be evaluated whether that X Average Supports in raw data base is:

avg (ω^{'}) = \frac{ω^{'}}{ratio (i)}

Utilize ρ-lower bit line, it is possible to estimation X max support in raw data base is:

\max (ω^{'}) = \{\begin{matrix} \frac{ω^{'} - \ln ρ + \sqrt{\ln^{2} ρ - 2 ω^{'} \ln ρ}}{ratio (i)} & if \ln ρ \leq {2 ω}^{'} \\ avg (ω^{'}) & if \ln ρ > {2 ω}^{'} \end{matrix}

Based on above analysis, the support adding noise in conversion database according to XIts max support in raw data base can be estimated and Average Supports is respectively as follows:

\max_supp (\tilde{ω}) = {&Integral;}_{ω^{'} = \overset{&OverBar;}{ω} - 5}^{ω^{'} = \overset{&OverBar;}{ω} + 5} \Pr (ω^{'} | \tilde{ω}) \max (ω^{'})

avg_supp (\tilde{ω}) = {&Integral;}_{ω^{'} = \overset{&OverBar;}{ω} - 5}^{ω^{'} = \overset{&OverBar;}{ω} + 5} \Pr (ω^{'} | \tilde{ω}) avg (ω^{'})

Further, after having calculated the support that item collects in data base after singulation, according to difference privacy requirement, appropriate noise is added.Noise level is directly proportional to calculating sensitivity.For the item collection that length is i, the calculating sensitivity of its support is equal to the quantity of the calculating of the item collection that length in mining process is i.Owing to FP-growth is depth-first traversal algorithm, be difficult in mining process accurately statistical length be the item collection quantity of i.If making the item collection of equal length be simultaneously generated by adjusting FP-growth, it will to cause substantial amounts of storage overhead.For solving problem above, the present invention proposes the dynamic descending method of lightweight.Its core concept is to utilize the downward closure property of frequent mode, dynamically reduces the transformation of the item collection that length is i calculated, to reach to reduce the purpose of the sensitivity of the item collection that computational length is i.Detailed process can describe as follows:

Step 1: the upper limit for the quantity of the item collection inquiry that length is i composes initial value；

Containing s frequent episode in assumption database, define an arrayStore the transformation of the item collection of the different length of calculating.Wherein,Representing the transformation of the item collection that length is i calculated, its initial value is

Step 2: in mining process, dynamically reduces the transformation of the item collection that length is i calculated；

The condition FP-tree (the condition FP-tree of β is the FP-tree that the toy data base that the affairs according to all β of comprising form creates) of item collection β is being excavated assuming that current.Item order in the header table of the condition FP-tree of item collection β is { i₁,...,i_k,...,i_n}.For kth element i in header table_k, it constitutes new item collection Y=β ∪ i with item collection β_k.Note S₁={ i₁,...,i_k-1}.It addition, make S₂For the set that newfound non-frequent episode in the conditional pattern base of Y is constituted.Due to for S₂Middle arbitrary element j, item collection X=Y ∪ j right and wrong are frequently.Therefore, the downward closure property according to frequent mode, by X and { S₁The item collection certainty right and wrong that the random subset of-j} is constituted are frequently.Therefore, it can obtainReduction amount be:

r_{p} = Σ_{u = 1}^{q} C_{| S_{1} | - u}^{p - | Y | - 1}

Wherein q=min{ | S₂|,|S₁|-(p-| Y |-1) }, wherein p is the length of affairs, and q is the number of subtransaction after segmentation.

Step 3: using the item collection transformation after renewal as sensitivity, utilizes Laplace mechanism, using the ratio of this sensitivity and safety coefficient as the yardstick of Laplace probability distribution, adds noise content for support.

As seen from the above analysis, dynamic descending method relates only to simply add computing and multiplication, can't bring the computing cost of complexity.By the upper limit that the item collection that constantly declines calculates, reduce the noise content added in mining process.Under the premise meeting difference privacy, improve the availability of Result.It is to say, utilize dynamic descending method constantly to adjust the quantity of candidate's frequent item set, thus reaching, after the support calculated during item collects data after singulation, to add a small amount of noise, improve availability of data.

By form analysis, the Mining Frequent Itemsets (PFP) based on affairs segmentation meeting difference privacy in the present invention can provide higher digging efficiency and Result availability while meeting difference secret protection.

By compared with the algorithm (PB) that the algorithm (TT) that document [8] proposes and document [9] propose, it may be determined that the PFP algorithm of proposition has obvious advantage in Result availability and operational efficiency.In order to the advantage of inventive algorithm is better described, PFP algorithm and TT algorithm, PB algorithm are compared from " Result availability " and " Riming time of algorithm expense " two aspects.Wherein, for " Result availability ", the correctness first against the frequent item set generated uses combination property (F-score index) to weigh.The computing formula of F-score index is as follows:

F - score = 2 \times \frac{precision \times recall}{precision + recall}

Wherein, precision=| U_p∩U_C|/|U_p|, recall=| U_p∩U_C|/|U_C|, U_pIt is the frequent item set generated by privacy algorithm, U_CIt it is real frequent item set.

It addition, in order to the support of the frequent item set of weighing issue is relative to the item collection error in true support, use recall rate (RE index) to weigh.The computing formula of RE index is as follows:

RE = {median}_{X} \frac{| \sup^{'} (x) - \sup (x) |}{\sup (x)}

Wherein, X represents the frequent item set of all of generation, and sup (x) represents the true support of item collection x, sup'(x) represent the noise support of item collection x.

Concrete experiment be provided that first, use four groups of real data set: Accidents to comprise vehicle accident data；Pumsb-star is from the census data of PUMS (PublicUseMicrodataSample)；POS is from the data of the retail point of an electronic retailer；Retail comprises the basket marketing data of one anonymous retail shop of Belgium.Wherein, the first two belongs to density data collection, and latter two belongs to sparse data set.Secondly, all algorithms are realized by JAVA language.Finally, the experimental situation of test is IntelCore2DuoE8400CPU (3.0GHz) and 4GBRAM.

The performance of PFP algorithm is described below by analysis experimental data.

Result availability:

On four group data sets, by the different threshold value selected, algorithm PFP, the F-score index of TT and PP and RE index are measured respectively.Owing to PB algorithm is used to excavate top-k frequent item set, it is impossible to directly it compared with PP, it is contemplated that a kind of scene, namely when given threshold value, user chooses k.Experimental result is such as shown in Fig. 5～9.

From Fig. 5,6,7 and 8 it can be seen that concentrate four data, compared with TT, PFP can reach better effect.Interpretation is as follows.Compared with directly blocking affairs, affairs are divided into each affairs and retain multiple subtransactions and the weight between uniform distribution subtransaction, so can significantly reduce the loss of information.Although PFP degree of accuracy slightly reduces, but the quantity of frequent item set is significantly improved.This is because affairs are divided into each affairs retains multiple subtransactions and the weight between uniform distribution subtransaction, so can significantly reduce the loss of information, and then the quantity generating frequent item set can be increased.Under the scene set, data set Pumsb, POS and Retail, PFP are remained to obtain better F-score value.

Riming time of algorithm expense:

On four group data sets, respectively the operation time of Measurement Algorithm PFP, front k the frequent item set of FP, TT and PB inquiry, wherein k span is [10,200].For PFP, pretreatment Exactly-once, and the threshold value selected with user is unrelated, so the operation time does not include pretreatment time.

From fig. 9, it can be seen that PFP and FP-growth performance is suitable, and PFP can reach better time efficiency than TT and PB.Interpretation of result is as follows: compared with FP-growth, and PFP does not bring too big burden at support method of estimation and the dynamic descending method of excavation phase；Compared with TT, the FP-growth algorithm performance used in PFP is better than Apriori algorithm, so PFP efficiency is more in hgher efficiency than TT.

By form analysis and substantial amounts of it is demonstrated experimentally that find to meet difference privacy based on things segmentation Frequent Itemsets Mining Algorithm privacy, excavate availability and operational efficiency in can obtain better effect.

Embodiment of above is merely to illustrate the present invention; and it is not limitation of the present invention; those of ordinary skill about technical field; without departing from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all equivalent technical schemes fall within scope of the invention, and the scope of patent protection of the present invention should be defined by the claims.

Claims

1. a Mining Frequent Itemsets, it is characterised in that including:

2. Mining Frequent Itemsets as claimed in claim 1, it is characterised in that described step S1 specifically includes:

S1.3.2: the result set R after being split by affairs t is set to sky；

S1.3.3: structure i-th affairs t_i, including following a)～e) step:

E) by t_iIt is stored in result set R；

S1.3.4: repeat S1.3.1 process q time；

S1.3.6: return result set R.

3. Mining Frequent Itemsets as claimed in claim 2, it is characterised in that after singulation to the affairs t after each segmentation_iGive weight 1/q.

4. Mining Frequent Itemsets as claimed in claim 3, it is characterised in that in step S2, the support estimation technique specifically includes:

S2.2: according toEstimation ω ', is drawn by bayesian criterion:

\Pr (ω^{'} | \tilde{ω}) = \frac{\Pr (\overset{&OverBar;}{ω} | ω^{'}) \cdot \Pr (ω^{'})}{\Pr (\overset{&OverBar;}{ω})};

\Pr (ω^{'} | \tilde{ω}) ~ e^{- ϵ | ω^{'} - \overset{&OverBar;}{ω} |};

S2.3: estimation X support ω in raw data base, the maximum limited length of affairs is m, and the affairs t that length is p comprises X and p > m, t is divided into q=Individual subset, the length of each subset, less than m, sets simultaneouslyThe length of individual subset is m, and the length of another one subset is less than m, and its length is a=p-q'm, and after so obtaining segmentation affairs t, the probability that X comprises a subtransaction after singulation is:

β_{p} = \{\begin{matrix} \frac{q^{'} C_{p - i}^{m - i}}{C_{p}^{m}} & if a < i \\ \frac{q^{'} C_{p - i}^{m - i}}{C_{p}^{m}} + \frac{C_{p - i}^{α - i}}{C_{p}^{α}} & ifa &GreaterEqual; i \end{matrix};

Order:

Estimation X Average Supports in raw data base is:

avg (ω^{'}) = \frac{ω^{'}}{ratio (i)};

Utilize ρ-lower bit line, estimate that X max support in raw data base is:

\max (ω^{'}) = \{\begin{matrix} \frac{ω^{'} - \ln ρ + \sqrt{\ln^{2} ρ - 2 ω^{'} \ln ρ}}{ratio (i)} & if \ln ρ \leq 2 ω^{'} \\ avg (ω^{'}) & if \ln ρ > 2 ω^{'} \end{matrix};

\max_supp (\tilde{ω}) = {&Integral;}_{ω^{'} = \overset{&OverBar;}{ω} - 5}^{ω^{'} = \overset{&OverBar;}{ω} + 5} \Pr (ω^{'} | \tilde{ω}) \max (ω^{'})

avg_supp (\tilde{ω}) = {&Integral;}_{ω^{'} = \overset{&OverBar;}{ω} - 5}^{ω^{'} = \overset{&OverBar;}{ω} + 5} \Pr (ω^{'} | \tilde{ω}) avg (ω^{'});

5. Mining Frequent Itemsets as claimed in claim 4, it is characterised in that in step S2, dynamic descent method specifically includes:

r_{p} = Σ_{u = 1}^{q} C_{| S_{1} | - u}^{p - | Y | - 1}