CN110096900A

CN110096900A - A kind of Frequent Pattern Mining method of efficient difference secret protection

Info

Publication number: CN110096900A
Application number: CN201910363516.5A
Authority: CN
Inventors: 张亚玲; 罗沛
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2019-08-06

Abstract

The invention discloses a kind of Frequent Pattern Mining methods of efficient difference secret protection, first input transaction data set (TDS) D, privacy budget ε, frequent mode number k, minimum support threshold value min_sup, and privacy budget ε is split as ε₁、ε₂、ε₃, and meet ε=ε₁+ε₂+ε₃；Set Cset, Sset, Pset are emptied respectively, wherein Cset is candidate frequent item set, and the set that the top-k frequent mode that Sset index mechanism is picked out is formed, Pset is the frequent mode set added after noise；Then truncation, l before only retaining are carried out to transaction data set (TDS) D_opt, on the other hand, after all frequent mode Candidate Sets are excavated in FP-Growth algorithm, frequent mode concept is closed in introducing, about subtracts Candidate Set scale.The present invention solves the problems, such as that information leakage is serious during Frequent Pattern Mining in big data application existing in the prior art.

Description

A kind of Frequent Pattern Mining method of efficient difference secret protection

Technical field

The invention belongs to field of information security technology, and in particular to a kind of frequent mode digging of efficient difference secret protection Pick method.

Background technique

With the high speed development of electronic information technology and the arriving of big data era, storage, collection, analysis and the hair of data Cloth demand is increasing, by carrying out the letter that Information extraction and analysis can allow people to obtain more real worlds to these data Breath.This kind of demand largely promotes the shared of data information, analysis and publication.Although just to the analyses of this kind of data Benefit people's lives bring relevant benefit for the holder of these data, but a serious problem but highlights, That is the privacy of user receives great threat.

Frequent Pattern Mining (Frequent Pattern Mining, FPM) is a kind of technology that data pattern is excavated, and is Cluster, classification and correlation rule are laid a good foundation, and the application systems such as personalized web site, recommender system are widely used in In.Wherein, the information of frequent mode and frequency include the privacy-sensitive data of user.

Guard method to privacy includes method for secret protection and difference method for secret protection etc. based on K- anonymity, In, difference privacy is a kind of effectively to issue frequent mode and its hide user privacy information while frequency.Difference privacy (Differential privacy, DP) [7] are the privacy leakage problems for being directed to staqtistical data base by Dwork [8] in 2006 The new secret protection model of the one kind proposed.Under this new definition, to the processing calculated result phase of database data set Be for some variation specifically recorded it is insensitive, a certain item is individually recorded in data set or not in data set, to calculating As a result influence almost can be ignored.Therefore, some data record because its be added data set in caused by privacy let out Divulging a secret can nearly be limited in an acceptable, minimum range, so that attacker has no idea to pass through analytical calculation As a result accurate individual record information is obtained.

Based on difference secret protection technology, it is hidden that energy effective protection hides user while issuing frequent mode and its frequency Personal letter breath, so that risk of the sensitive information of user from disclosure.Mining Algorithms of Frequent Patterns tool with difference secret protection There is important practical application meaning.

Summary of the invention

The object of the present invention is to provide a kind of Frequent Pattern Mining methods of efficient difference secret protection, solve existing The serious problem of information leakage during Frequent Pattern Mining in the application of big data present in technology.

The technical scheme adopted by the invention is that a kind of Frequent Pattern Mining method of efficient difference secret protection, tool Body follows the steps below to implement:

Step 1, input transaction data set (TDS) D, privacy budget ε, frequent mode number k, minimum support threshold value min_sup, Privacy budget ε is split as ε₁、ε₂、ε₃, wherein ε₁Be using index mechanism pick out distributed when top-k frequent item set it is hidden Private budget, ε₂It is the privacy budget that Laplacian noise distribution is added for selected k frequent item set out, ε₃It is selected for length The privacy budget of Laplacian noise is added when selecting, and meets ε=ε₁+ε₂+ε₃；Set Cset, Sset, Pset are emptied respectively, Wherein Cset is candidate frequent item set, and the set that the top-k frequent mode that Sset index mechanism is picked out is formed, Pset is to add Add the frequent mode set after noise；

Step 2 solves optimal transaction length l_opt, and truncation, l before only retaining are carried out to transaction data set (TDS) D_opt?；

Step 3 excavates the fuzzy frequent itemsets that all support countings are not less than min_sup using FP-Growth method Close Cset；

Step 4 carries out scale compression to the frequent mode set Cset that excavates using closing frequent mode；

Step 5 picks out top-k frequent mode using index mechanism and the true support counting of corresponding modes is formed Set Sset, each mode p in the set sets up following formula:

Pr(p)∝exp(ε₁× Rank (D, p)/2k),

Wherein, Rank (D, p) is the marking value of mode p,Wherein, as p ∈ ti When, c (t_i, p)=1；WhenWhen, c (t_i, p)=0；

Step 6 adds Lap (k/ ε for the support counting of selected k mode out₂) noise, form Pset；

Step 7 carries out consistency constraint processing to the mode support counting in Pset containing noise, and Lifting scheme is available Property；

Step 8, output top-k frequent mode and noise count set RC.

The features of the present invention also characterized in that

Step 2 is specifically implemented according to the following steps:

Step 2.1, input raw data set D, privacy budget ε₃, solve optimization length l_opt；

Step 2.2, by conventions data collectionIt is set as empty；

Step 2.3 records r, l before truncation r retains for each in raw data set D_optItem is simultaneously added toIn.

Step 2.1 solves optimization length l_optIt is specifically implemented according to the following steps:

Step 2.1.1, Z=< z is set₁,z₂,…,z_i,…,z_|D|>, z_iFor the length value of i-th record in D；R=< rank (z₁),rank(z₂),…,rank(z_i),…,rank(z_|D|) >, rank (z_i) it is scoring functions；

Step 2.1.2, each length z is calculated_iWeight W:

W=< exp (ε₃×rank(z₁)/2),exp(ε₃×rank(z₂)/2),…,exp(ε₃×rank(z_|D|)/2)>；

Step 2.1.3, weight in W is arranged to obtain orderly record dictionary according to descending；

Step 2.1.4, computational length z_jThe probability selected:

Step 2.1.5, z is selected from W_jAs optimization length l_opt。

Step 4 is specifically implemented according to the following steps:

The frequent mode set Cset that step 4.1, input are generated through FP-Growth algorithm in the step 3 is as candidate Collect Cset；

Step 4.2 carries out descending sort according to the scale of set of patterns to the Candidate Set Cset of input, then from maximum item Collection scale starts to judge, to obtain the efficiency for improving selection candidate；

Step 4.3, to each of set Cset after sequence mode Cset_i, i=1 ..., n judged, if Mode Cset_iIncluded in mode Cset_j, in j=1 ..., n, and Cset_iWith mode Cset_jSupport counting it is equal, that is, meet The definition for closing frequent mode, then illustrate Cset_iThere are true hyper modes, set Cset_iPosition is sky；Add Cset_jTo set SetAs obtain scale compression and simultaneously include full candidate collection information new set.

The beneficial effects of the invention are as follows a kind of Frequent Pattern Mining methods of efficient difference secret protection, as far as possible Truncation is carried out to original transaction database in the case where more reservation effective informations, obtains the raising in efficiency；Another party Face, after all frequent mode Candidate Sets are excavated in FP-Growth algorithm, frequent mode concept is closed in introducing, about subtracts Candidate Set Scale.By two above step, so that the top-k Frequent Pattern Mining method of difference secret protection is imitated with preferable algorithm Rate, especially in the transaction database that and transaction data set (TDS) more comprising things number is gradually increased, this method can mentioned Frequent Pattern Mining is efficiently carried out while for secret protection.

Detailed description of the invention

Fig. 1 is a kind of Frequent Pattern Mining method of efficient difference secret protection of the present invention in pumsb-star data set Upper excavation accuracy and general difference secret protection top-k Frequent Pattern Mining method excavate accuracy comparison diagram；

Fig. 2 is a kind of Frequent Pattern Mining method of efficient difference secret protection of the present invention in kosarak data set On excavation accuracy and general difference secret protection top-k Frequent Pattern Mining method excavate accuracy comparison diagram；

Fig. 3 is that a kind of Frequent Pattern Mining method of efficient difference secret protection of the present invention and general difference privacy are protected Protect top-k Frequent Pattern Mining method in frequent mode with data scale variation comparison diagram；

Fig. 4 is that a kind of Frequent Pattern Mining method of efficient difference secret protection of the present invention and general difference privacy are protected Protect top-k Frequent Pattern Mining method in the time with data scale variation comparison diagram；

Fig. 5 is that a kind of Frequent Pattern Mining method of efficient difference secret protection of the present invention and general difference privacy are protected Protect top-k Frequent Pattern Mining method in frequent item set scale with min_sup variation comparison diagram；

Fig. 6 is that a kind of Frequent Pattern Mining method of efficient difference secret protection of the present invention and general difference privacy are protected Protect top-k Frequent Pattern Mining method in the time with min_sup variation comparison diagram.

Specific embodiment

The following describes the present invention in detail with reference to the accompanying drawings and specific embodiments.

A kind of Frequent Pattern Mining method of efficient difference secret protection of the present invention, is specifically implemented according to the following steps:

Step 2 solves optimal transaction length l_opt, and truncation, l before only retaining are carried out to transaction data set (TDS) D_opt, It is specifically implemented according to the following steps:

Step 2.2, by conventions data collectionIt is set as empty；

Step 2.1.2, each length z is calculated_iWeight W:

Step 2.1.4, computational length z_jThe probability selected:

Step 2.1.5, z is selected from W_jAs optimization length l_opt。

Step 4 carries out scale compression to the frequent mode set Cset that excavates using closing frequent mode, specifically according to Lower step is implemented:

Pr(p)∝exp(ε₁× Rank (D, p)/2k),

Step 8, output top-k frequent mode and noise count set RC.

For primal algorithm DP-topkP and innovatory algorithm DP-OPtopkP, the experiment carried out includes the availability of algorithm Experiment and efficiency experiment, data record used by wherein first part tests come fromhttp://fimi.ua.ac.be/ data/, wherein data set pumsb-star mainly describes census record data, and data set kosarak is mainly described Web click steam records data, and the feature description of two datasets is as shown in table 1；Second part experimental data is used by program The simulation collection data set automatically generated, the maximum length l=40 of fixed data set record, minimum length s=2, things item number , by the scale of delta data collection, DP-topkP algorithm and DP- are observed in first group of group experiment of efficiency test for n=20 The variation of OPtopkP algorithm；In second group of experiment of efficiency test, simulated data sets are still used, fixed data set is passed through Scale, change minimum threshold min_sup to observe the variation of DP-topkP algorithm and DP-OPtopkP algorithm.Experimental data set Such as table 1:

The description of 1 test data set of table

(1) the usability testing process and result of secret protection mining algorithm

Testing standard in terms of the Result availability of algorithm is the average relative error according to algorithm, and the standard is main It is that error brought by top-k frequent item set, formula are as follows in measurement publication D:

Wherein,Intermediate scheme p_iNoise support counting, TC (p_i, TP_k(D) indicate that it is true Support counting.The smaller availability for illustrating algorithm of ARE value is higher.

It is obtained by testing with the ARE in kosarak data in pumsb-star data, i.e., algorithm operation is average opposite Error, as depicted in figs. 1 and 2.

X-axis indicates that the k value of top-k, y-axis indicate obtained average relative error after algorithm operation in Fig. 1 and Fig. 2 ARE, this two width figure are the postrun average relative error comparison chart of algorithm.By ARE known in figure for DP-topkP algorithm and DP-OPtopkP algorithm all keeps relative stability.It is very big that the ARE value of two kinds of algorithms differs not under same group of experiment.Illustrate this The algorithm DP-OPtopkP of invention maintains the certain availability of output result under the premise of improved in terms of efficiency.

(2) algorithm operational efficiency is tested

1) algorithm changes in simulated data sets with data set scale

Experiment use simulated data sets, set ε=0.6, the k=100 of min_sup=10, tap-k, data set scale by 6KB to 96KB variation.Obtained result difference is as shown in Figure 3, Figure 4.

X-axis indicates the value of data set scale in Fig. 3, Fig. 4.Wherein y-axis indicates gained when two kinds of algorithm operations in Fig. 3 The candidate frequent item set scale arrived.From the analysis of the data of the figure it is found that DP-OPtopkP algorithm and DP-topkP algorithm are with number According to the expansion of collection scale, obtained candidate's frequent mode scale is also in increased trend, and obtained by DP-OPtopkP algorithm Candidate scale be less than DP-topkP algorithm, reason is the on the one hand preposition processing to database, on the other hand for The compression of candidate scale.Y-axis indicates consumed time when two kinds of algorithm operations in Fig. 4, as can be seen from the figure DP- The OPtopkP algorithm operation spent time runs the spent time less than DP-topkP algorithm, this is because reducing time The raising of selected works scale and bring operational efficiency.

2) algorithm is compared in simulated data sets with the result that minimum threshold changes

The experiment of this group also uses simulated data sets, sets ε=0.6, the k=100 of top-k, and data set scale size is 128KB, changes the value of minimum threshold min_sup, and experimental result difference is as shown in Figure 5, Figure 6.

When minimum support threshold value min_sup value range is at [150,450], Fig. 5 gives DP-OPtopkP algorithm With the obtained candidate frequent mode scale of DP-topkP algorithm, the obtained candidate frequent mode scale of the two algorithms is almost It is close consistent, this is because the length of obtained Candidate Set scale and Candidate Set has when minimum threshold min_sup is larger Limit, therefore there is no excessive shadow is generated to Candidate Set scale by the process of truncation transaction database and closed frequent item-sets compression It rings, otherwise consumes a little time in break-in operation and squeeze operation.

As minimum threshold min_sup≤150, from fig. 5, it can be seen that DP-OPtopkP algorithm is obtained candidate frequent Schema size is less than the obtained candidate frequent mode scale of DP-topkP algorithm, while comparison diagram 6, DP-OPtopkP algorithm The spent time is also less than the time spent by DP-topkP algorithm.

The present invention reaches the optimization to DP-topkP algorithm by truncation item data library and reduction Candidate Set scale.Newly Algorithm has the table of preferable aspect of performance on the transaction database that and transaction data set (TDS) more comprising things number is gradually increased Existing, in terms of the availability of Result, innovatory algorithm maintains and maintains comparable level with reference to algorithm.

Claims

1. a kind of Frequent Pattern Mining method of efficient difference secret protection, which is characterized in that specifically real according to the following steps It applies:

Step 1, input transaction data set (TDS) D, privacy budget ε, frequent mode number k, minimum support threshold value min_sup, will be hidden Private budget ε is split as ε₁、ε₂、ε₃, wherein ε₁It is that pick out the privacy distributed when top-k frequent item set pre- using index mechanism It calculates, ε₂It is the privacy budget that Laplacian noise distribution is added for selected k frequent item set out, ε₃When for length selection The privacy budget of Laplacian noise is added, and meets ε=ε₁+ε₂+ε₃；Set Cset, Sset, Pset are emptied respectively, wherein Cset is candidate frequent item set, and the set that the top-k frequent mode that Sset index mechanism is picked out is formed, Pset is that addition is made an uproar Frequent mode set after sound；

Step 3 excavates the frequent mode set that all support countings are not less than min_sup using FP-Growth method Cset；

The collection that step 5, the true support counting that top-k frequent mode and corresponding modes are picked out using index mechanism are formed Sset is closed, each mode p in the set sets up following formula:

Pr(p)∝exp(ε₁× Rank (D, p)/2k),

Wherein, Rank (D, p) is the marking value of mode p,Wherein, as p ∈ t_iWhen, c (t_i, p)=1；WhenWhen, c (t_i, p)=0；

Step 7 carries out consistency constraint processing, Lifting scheme availability to the mode support counting in Pset containing noise；

Step 8, output top-k frequent mode and noise count set RC.

2. a kind of Frequent Pattern Mining method of efficient difference secret protection according to claim 1, which is characterized in that The step 2 is specifically implemented according to the following steps:

Step 2.2, by conventions data collectionIt is set as empty；

3. a kind of Frequent Pattern Mining method of efficient difference secret protection according to claim 2, which is characterized in that The step 2.1 solves optimization length l_optIt is specifically implemented according to the following steps:

Step 2.1.1, Z=< z is set₁,z₂,…,z_i,…,z_|D|>, z_iFor the length value of i-th record in D；R=< rank (z₁), rank(z₂),…,rank(z_i),…,rank(z_|D|) >, rank (z_i) it is scoring functions；

Step 2.1.2, each length z is calculated_iWeight W:

Step 2.1.4, computational length z_jThe probability selected:

Step 2.1.5, z is selected from W_jAs optimization length l_opt。

4. a kind of Frequent Pattern Mining method of efficient difference secret protection according to claim 1, which is characterized in that The step 4 is specifically implemented according to the following steps:

The frequent mode set Cset that step 4.1, input are generated through FP-Growth algorithm in the step 3 is as Candidate Set Cset；

Step 4.2 carries out descending sort according to the scale of set of patterns to the Candidate Set Cset of input, then advises from maximum item collection Mould starts to judge, to obtain the efficiency for improving selection candidate；

Step 4.3, to each of set Cset after sequence mode Cset_i, i=1 ..., n are judged, if mode Cset_iIncluded in mode Cset_j, in j=1 ..., n, and Cset_iWith mode Cset_jSupport counting it is equal, that is, meet close frequency The definition of numerous mode, then illustrate Cset_iThere are true hyper modes, set Cset_iPosition is sky；Add Cset_jTo setSetAs obtain scale compression and simultaneously include full candidate collection information new set.