CN105740245A - Frequent item set mining method - Google Patents

Frequent item set mining method Download PDF

Info

Publication number
CN105740245A
CN105740245A CN201410746488.2A CN201410746488A CN105740245A CN 105740245 A CN105740245 A CN 105740245A CN 201410746488 A CN201410746488 A CN 201410746488A CN 105740245 A CN105740245 A CN 105740245A
Authority
CN
China
Prior art keywords
omega
length
prime
item
affairs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410746488.2A
Other languages
Chinese (zh)
Inventor
程祥
苏森
许胜之
徐鹏
双锴
王玉龙
张忠宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201410746488.2A priority Critical patent/CN105740245A/en
Publication of CN105740245A publication Critical patent/CN105740245A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data mining and data privacy, and discloses a frequent item set mining method. The frequent item set mining method comprises the following steps: S1: segmenting a transaction of which the transaction length is greater than a restriction length in an original database into a plurality of sub-transactions, and causing the length of each transaction in the segmented database to be smaller than or equal to the restriction length; and S2: according to a support degree threshold value which is appointed in advance, utilizing a support degree estimation method and a dynamic descent method to mine the frequent item set in the segmented database. The frequent item set mining method can provide higher mining efficiency and mining result availability while differential privacy protection is met.

Description

Mining Frequent Itemsets
Technical field
The present invention relates to data mining and data-privacy technical field, particularly to a kind of Mining Frequent Itemsets.
Background technology
Frequent item set mining is the Basic Problems of Data Mining, and it has a wide range of applications in numerous areas.Frequent item set mining can describe as follows: a given transaction database, the personal record of each the corresponding user of affairs.Wherein, affairs are the set of item.A given item collection (set of item), its support refers to the number of the affairs comprising this collection.When support of some collection is not less than a certain given threshold value, this is claimed to integrate as frequent item set.When a given transaction database and a threshold value, frequent item set mining is exactly all frequent item sets occurred in mining data storehouse.
In frequent item set mining, FP-growth algorithm[2]It it is the mining algorithm being widely used.FP-growth algorithm is a kind of depth-first traversal algorithm.In mining process, FP-growth algorithm utilizes FP-tree to accelerate the mining process of whole frequent item set.FP-tree is a kind of special prefix trees.Utilize FP-tree, FP-growth algorithm only data base need to be carried out two-pass scan, thus substantially increasing the efficiency of excavation.
In mining process, if the affairs in transaction database (personal record) belong to sensitive information, directly issue frequent episode rally and cause the leakage of individual subscriber record.The privacy of user data how is protected to receive the concern of more and more people in frequent item set mining process.Difference secret protection normal form[3]Propose provide a kind of feasible scheme for solving the privacy concern in data analysis process.Anonymous with k-[4]With l-multiformity[5]Difference, difference secret protection normal form is by adding noise to provide the privacy of user protection with theoretical guarantee.
Currently, Frequent Pattern Mining is carried out secret protection by existing research and utilization difference privacy technical protection normal form.Algorithm[6][7]Difference secret protection is utilized to issue transaction database to protect privacy of user.The data base issued can be used for Frequent Pattern Mining.Specifically, document[6]Propose a kind of Database Publishing algorithm under context-free tree structure instructs.This algorithm partition database in a top-down manner, and finally issue a generated data storehouse for frequent item set mining;Document[7]Under the scene of data increment change, it is provided that the transaction database meeting difference privacy issues algorithm.Document[8]Propose the PrivBasis algorithm meeting difference privacy, be used for excavating top-k frequent item set.Document[9]Find that the length of restriction affairs can improve the balance of availability of data and secret protection.It utilizes method for cutting, devises a Frequent Itemsets Mining Algorithm based on Apriori algorithm meeting difference secret protection.But algorithm above all Shortcomings in the availability and digging efficiency of Result, hinder the application in Frequent Pattern Mining research of the difference secret protection technology.
List of references is as follows:
[1]R.AgrawalandR.Srikant,“Fastalgorithmsforminingassociationrules,”inVLDB,1994.
[2]J.Han,J.Pei,andY.Yin,“Miningfrequentpatternswithoutcandidategeneration,”inSIGMOD,2000.
[3]C.Dwork,“Differentialprivacy,”inICALP,2006.
[4]L.Sweeney,“k-anonymity:Amodelforprotectingprivacy,”Int.J.Uncertain.FuzzinessKnowl.-BaseSyst,2002.
[5]A.Machanavajjhala,J.Gehrke,D.Kifer,andM.
Venkitasubramaniam,“l-diversity:Privacybeyondk-anonymity,”inICDE,2006.
[6]R.Chen,N.Mohammed,B.C.M.Fung,B.C.Desai,andL.Xiong,“Publishingset-valueddataviadifferentialprivacy,”inVLDB,2011.
[7]X.Zhang,X.Meng,andR.Chen,“Differentiallyprivatesetvalueddatareleaseagainstincrementalupdates,”inDASFAA,2013.
[8]N.Li,W.Qardaji,D.Su,andJ.Cao,“Privbasis:frequentitemsetminingwithdifferentialprivacy,”inVLDB,2012.
[9]C.Zeng,J.F.Naughton,andJ.-Y.Cai,“Ondifferentiallyprivatefrequentitemsetmining,”inVLDB,2012.
Summary of the invention
(1) to solve the technical problem that
The technical problem to be solved in the present invention is: how to improve the availability of Result.
(2) technical scheme
For solving above-mentioned technical problem, the invention provides a kind of Mining Frequent Itemsets, including:
S1: transaction length in raw data base is divided into multiple subtransaction more than the affairs of limited length so that after segmentation, in data base, the length of every affairs is not more than described limited length;
S2: according to preassigned support threshold, utilizes the support estimation technique and dynamic descent method Mining Frequent Itemsets Based in data base after singulation.
Wherein, described step S1 specifically includes:
S1.1: based on raw data base, construct a undirected weighted graph, wherein, item in vertex representation data base, when the item collection that the item of two vertex correspondence is constituted occurs in same affairs, connecting the two summit and form a limit, the weight on limit is the item collection of the item composition of vertex correspondence support in transaction database;
S1.2: utilize Louvain algorithm to find community in described undirected weighted graph, utilize middle output result one the tree structure CR-tree of structure in Louvain algorithm iteration process, in CR-tree, the node of each layer represents with the community found in an iterative process, the height of tree represents iterations, father node represents that the community represented by child nodes merges the new community formed, and the shortest path length between the leaf node comprising this of the dependency between item represents;
S1.3: utilize the CR-tree generated, length in raw data base is split more than the affairs of limited length, generating the data base after segmentation, the CR-tree of affairs, transaction length, maximum transaction limited length and structure is designated as t, p, m and T respectively, concrete cutting procedure is as follows:
S1.3.1: calculate the number q of subtransaction after splitting according to p and m, namely
S1.3.2: the result set R after being split by affairs t is set to sky;
S1.3.3: structure i-th affairs ti, including following a)~e) step:
A) from CR-tree leaf node, choose and comprise the node of element in affairs t, and the item being not included in these nodes in t is removed, constitute set according to these new nodes;
B) from the leaf layer N of CR-treelIn choose and comprise the node n that item is maximuml, and by nlIn item add ti
C) to NlIn all the other nodes according to nlDistance in T sorts from big to small, the node that dependency is identical, sorts from big to small according to its capacity;
D) N is traveled through successivelylIn all the other nodes, if its capacity and nlCapacity sum is not more than m, then by it from NlMiddle taking-up, and put into ti
E) by tiIt is stored in result set R;
S1.3.4: repeat S1.3.1 process q time;
S1.3.5: if NlIn still suffer from node, by NlA length that what item in each node was random put in R is less than in the subtransaction of m;
S1.3.6: return result set R.
Wherein, after singulation to the affairs t after each segmentationiGive weight 1/q.
Wherein, in step S2, the support estimation technique specifically includes:
S2.1: setting the item in result set R and integrate the length of X as i, its support in raw data base is ω, and the support in data base is ω ' after singulation, the support after adding noise in data base after singulation is
S2.2: according toEstimation ω ', is drawn by bayesian criterion:
Pr ( ω ′ | ω ~ ) = Pr ( ω ‾ | ω ′ ) · Pr ( ω ′ ) Pr ( ω ‾ ) ;
If ω ' obeys consistent prior distribution, then its conditional probability distribution meets:
Pr ( ω ′ | ω ~ ) ~ e - ϵ | ω ′ - ω ‾ | ;
S2.3: estimation X support ω in raw data base, the maximum limited length of affairs is m, and the affairs t that length is p comprises X and p > m, t is divided intoIndividual subset, the length of each subset, less than m, sets simultaneouslyThe length of individual subset is m, and the length of another one subset is less than m, and its length is a=p-q'm, and after so obtaining segmentation affairs t, the probability that X comprises a subtransaction after singulation is:
&beta; p = q &prime; C p - i m - i C p m if&alpha; < i q &prime; C p - i m - i C p m + C p - i &alpha; - i C p &alpha; if&alpha; &GreaterEqual; i ;
Make αkRepresenting the number comprising X and affairs that length is k, wherein, n is the maximum of transaction length, calculates being desired for of ω ':
Order:
Estimation X Average Supports in raw data base is:
avg ( &omega; &prime; ) = &omega; &prime; ratio ( i ) ;
Utilize ρ-lower bit line, estimate that X max support in raw data base is:
max ( &omega; &prime; ) = &omega; &prime; - ln &rho; + ln 2 &rho; - 2 &omega; &prime; ln &rho; ratio ( i ) if ln &rho; &le; 2 &omega; &prime; avg ( &omega; &prime; ) if ln &rho; > 2 &omega; &prime; ;
The support adding noise in conversion database according to XIts max support in raw data base can be estimated and Average Supports is respectively as follows:
max _ supp ( &omega; ~ ) = &Integral; &omega; &prime; = &omega; &OverBar; - 5 &omega; &prime; = &omega; &OverBar; + 5 Pr ( &omega; &prime; | &omega; ~ ) max ( &omega; &prime; )
avg _ supp ( &omega; ~ ) = &Integral; &omega; &prime; = &omega; &OverBar; - 5 &omega; &prime; = &omega; &OverBar; + 5 Pr ( &omega; &prime; | &omega; ~ ) avg ( &omega; &prime; ) ;
If the Average Supports of item collection X is more than described support threshold, then X is frequent item set;For frequent item set, if the max support of item collection X is more than given support threshold, then X is used to generate frequent item set candidate collection.
Wherein, in step S2, dynamic descent method specifically includes:
Step 2.4: the upper limit for the quantity of the item collection inquiry that length is i composes initial value;
Assuming that containing s frequent episode in frequent item set Candidate Set, define an arrayStore the transformation of the item collection of the different length of calculating, wherein,Representing the transformation of the item collection that length is i calculated, its initial value is
Step 2.5: in mining process, dynamically reduces the transformation of the item collection that length is i calculated, particularly as follows:
Current at the condition FP-tree excavating item collection β, the item order in the header table of the condition FP-tree of item collection β is { i1,...,ik,...,in, for kth element i in header tablek, it constitutes new item collection Y=β ∪ i with item collection βk, remember S1={ i1,...,ik-1, make S2For newfound non-frequent episode in the conditional pattern base of Y constitute set, due to for S2Middle arbitrary element j, frequently, the downward closure property according to frequent mode, by X and { S for item collection X=Y ∪ j right and wrong1The item collection certainty right and wrong that the random subset of-j} is constituted frequently, obtainReduction amount be:
r p = &Sigma; u = 1 q C | S 1 | - u p - | Y | - 1
Wherein q=min{ | S2|,|S1|-(p-| Y |-1) }, wherein p is the length of affairs, and q is the number of subtransaction after segmentation;
Step 2.6: using the item collection transformation after renewal as sensitivity, utilizes Laplace mechanism, using the ratio of this sensitivity and safety coefficient as the yardstick of Laplace probability distribution, adds noise content for support.
(3) beneficial effect
The present invention can provide higher digging efficiency and Result availability while meeting difference secret protection.
Accompanying drawing explanation
Fig. 1 is Mining Frequent Itemsets flow chart of the present invention;
Fig. 2 is given transaction database schematic diagram in the present invention;
Fig. 3 is according to Fig. 2 undirected weighted graph completed;
Fig. 4 is the CR-tree according to Fig. 3 structure;
Fig. 5 is PFP, TT and PB F-score and the RE measurement result in Pumsb;
Fig. 6 is PFP, TT and PB F-score and the RE measurement result in Accidents;
Fig. 7 is PFP, TT and PB F-score and the RE measurement result in POS;
Fig. 8 is PFP, TT and PB F-score and the RE measurement result in Retail;
Fig. 9 is PFP, FP, TT and the PB operation time under different pieces of information collection.
Detailed description of the invention
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following example are used for illustrating the present invention, but are not limited to the scope of the present invention.
The Mining Frequent Itemsets flow process of the embodiment of the present invention is as it is shown in figure 1, comprise the following steps:
Step S1: raw data base is carried out pretreatment.Utilize intelligent scissor method, length in raw data base is divided into multiple subtransaction more than the affairs specifying limited length so that after conversion, in data base, the length of every affairs is not more than appointment limited length.
Step S2: Mining Frequent Itemsets Based under the premise meeting difference secret protection.According to the threshold value that user specifies, in data base in post-conversion, Mining Frequent Itemsets Based.In mining process, utilize support method of estimation, reduce affairs and split the information loss brought;Meanwhile, utilize dynamic descending method, mining process reduces noise addition, to improve result availability.
In detail below the intelligent scissor method in preprocessing process, support method of estimation and dynamic descending method are described.
For step S1, in order to improve Result availability and keep higher secret protection grade, intending raw data base is carried out a kind of new conversion, namely affairs are split.When a record length exceedes limited length, these affairs are divided into a plurality of subtransaction.Each the subtransaction that segmentation produces is satisfied by the restriction of greatest length.But simply raw data base is carried out cutting transformation, a large amount of losses of information can be caused so that item collection is no longer frequent frequently for some, affects the availability of Result.Meanwhile, behind partition data storehouse, can cause that the secret protection grade meeting the mining algorithm of difference secret protection declines.In order to solve problem above, the present invention proposes intelligent scissor method.Specifically comprise the following steps that
Step 1: based on raw data base, constructs a undirected weighted graph.Wherein, the item in vertex representation data base.When the item collection that the item of two vertex correspondence is constituted occurs in same affairs, connect the two summit and form a limit.The weight on limit is the item collection of the item composition of vertex correspondence support in transaction database.Given transaction database (Fig. 1), the undirected weighted graph constituted by the above process is as shown in Figure 2.
Step 2: utilize Louvain algorithm to find community in the undirected weighted graph of structure.Louvain algorithm is an iterative algorithm.Utilize middle output result one the tree structure CR-tree of structure in Louvain algorithm iteration process.In CR-tree, the node of each layer represents that the height of tree represents that iterations, father node represent that the community represented by child nodes merges the new community formed with the community found in an iterative process.Dependency between represents by the shortest path length between the leaf node comprising this.Such as, given such as the undirected weighted graph of Fig. 2, utilize the CR-tree of Louvain algorithm construction as shown in Figure 3.
Step 3: utilize the CR-tree generated, length in raw data base is split more than the affairs of limited length, generates the data base after segmentation.The CR-tree of affairs, transaction length, maximum transaction limited length and structure is designated as t, p, m and T respectively.Concrete cutting procedure is as follows:
1) according to the number q of subtransaction after p and m calculating segmentation, namely
2) the result set R after being split by affairs t is set to sky;
3) structure i-th affairs ti
A) from CR-tree leaf node, choose the node comprising element in affairs t (that is: item), and the item being not included in these nodes in t is removed.Set is constituted according to these new nodes;
B) from the leaf layer N of CR-treelIn choose and comprise the node n that item is maximuml, and by nlIn item add ti
C) to NlIn all the other nodes according to nlDistance in T sorts from big to small.The node that dependency is identical, sorts from big to small according to its capacity;
D) N is traveled through successivelylIn all the other nodes, if its capacity and nlCapacity sum is not more than m, then by it from NlMiddle taking-up, and put into ti
E) by tiIt is stored in result set R;
4) 3 are repeated) process q time;
5) if NLIn still suffer from node, by NlA length that what item in each node was random put in R is less than in the subtransaction of m;
6) result set R is finally returned to.
So through above step 1 to step 3, raw data base is changed into new data base, and ensures that in the data base after segmentation, the length of every affairs meets the restriction of greatest length.
The analysis found that, if affairs are divided into k bar subtransaction, then the transaction database after converting is met the Frequent Itemsets Mining Algorithm of ε-difference privacy, for raw data base, k ε-difference privacy can only be ensured.For this, the present invention proposes weighting segmentation computing.Its definition can describe as follows: when maximum transaction capacity is m, and an affairs t/length is more than m, and t is cut into t by the computing f (i.e. above-mentioned steps 1~step 3) of segmentation affairs1...tkSo that | ti|≤m, and be tiDistribution weight wiIf f meetsAndThen claiming f is weighting segmentation computing.Obtain drawing a conclusion through theoretical proof, when using weighting segmentation computing to convert original transaction data base, the transaction database after converting is met the Frequent Itemsets Mining Algorithm of ε-difference privacy, for raw data base, can guarantee that ε-difference privacy equally.
Further, affairs segmentation in step S1, data base increases a t1, t1Support increase 1/q, and be no longer 1, this can cause and calculate t in data base in post-conversion1Support less than real value, and the part subset in t loses, and in turn results in certain information loss, in order to solve this problem, in order to compensate these information loss, proposes support method of estimation in step S2.The method mainly includes two steps: the support of the noise of addition (be generally adopted Laplace mechanism and add noise) first obtained in mining process according to item collection, estimates the accurate support in its data base after singulation;Then, collect the support in data base after singulation according to the item estimated, estimate its support in raw data base.Concrete calculating process is as follows:
Assuming that item integrates the length of X as i, its support in raw data base is ω, and the support in data base is ω ' after singulation, and the support after adding noise in data base after singulation is
First, according toEstimation ω ', by bayesian criterion, it can be deduced that:
Pr ( &omega; &prime; | &omega; ~ ) = Pr ( &omega; &OverBar; | &omega; &prime; ) &CenterDot; Pr ( &omega; &prime; ) Pr ( &omega; &OverBar; )
Assuming that ω ' obeys consistent prior distribution, then its conditional probability distribution meets:
Pr ( &omega; &prime; | &omega; ~ ) ~ e - &epsiv; | &omega; &prime; - &omega; &OverBar; |
Then, estimation X support ω in raw data base.
The maximum limited length assuming affairs is m, and the affairs t that length is p comprises X and p > m.In order to improve the availability of Result and ensure higher secret protection grade, t is divided intoIndividual subset, each oneself length is less than m.Assume simultaneouslyThe length of individual subset is m, and the length of another one subset is likely less than m, and its length is a=p-q'm.After so obtaining segmentation affairs t, the probability that X comprises a subtransaction after singulation is:
&beta; p = q &prime; C p - i m - i C p m if&alpha; < i q &prime; C p - i m - i C p m + C p - i &alpha; - i C p &alpha; if&alpha; &GreaterEqual; i ;
Make αkRepresenting the number comprising X and affairs that length is k, wherein, n is the maximum of transaction length, it is possible to calculate being desired for of ω ':
For convenience, latter half on the right of equation (i.e. content in last bracket) is abbreviated as ratio (i), it may be assumed that
So can be evaluated whether that X Average Supports in raw data base is:
avg ( &omega; &prime; ) = &omega; &prime; ratio ( i )
Utilize ρ-lower bit line, it is possible to estimation X max support in raw data base is:
max ( &omega; &prime; ) = &omega; &prime; - ln &rho; + ln 2 &rho; - 2 &omega; &prime; ln &rho; ratio ( i ) if ln &rho; &le; 2 &omega; &prime; avg ( &omega; &prime; ) if ln &rho; > 2 &omega; &prime;
Based on above analysis, the support adding noise in conversion database according to XIts max support in raw data base can be estimated and Average Supports is respectively as follows:
max _ supp ( &omega; ~ ) = &Integral; &omega; &prime; = &omega; &OverBar; - 5 &omega; &prime; = &omega; &OverBar; + 5 Pr ( &omega; &prime; | &omega; ~ ) max ( &omega; &prime; )
avg _ supp ( &omega; ~ ) = &Integral; &omega; &prime; = &omega; &OverBar; - 5 &omega; &prime; = &omega; &OverBar; + 5 Pr ( &omega; &prime; | &omega; ~ ) avg ( &omega; &prime; )
If the Average Supports of item collection X is more than described support threshold, then X is frequent item set;For frequent item set, if the max support of item collection X is more than given support threshold, then X is used to generate frequent item set candidate collection.
Further, after having calculated the support that item collects in data base after singulation, according to difference privacy requirement, appropriate noise is added.Noise level is directly proportional to calculating sensitivity.For the item collection that length is i, the calculating sensitivity of its support is equal to the quantity of the calculating of the item collection that length in mining process is i.Owing to FP-growth is depth-first traversal algorithm, be difficult in mining process accurately statistical length be the item collection quantity of i.If making the item collection of equal length be simultaneously generated by adjusting FP-growth, it will to cause substantial amounts of storage overhead.For solving problem above, the present invention proposes the dynamic descending method of lightweight.Its core concept is to utilize the downward closure property of frequent mode, dynamically reduces the transformation of the item collection that length is i calculated, to reach to reduce the purpose of the sensitivity of the item collection that computational length is i.Detailed process can describe as follows:
Step 1: the upper limit for the quantity of the item collection inquiry that length is i composes initial value;
Containing s frequent episode in assumption database, define an arrayStore the transformation of the item collection of the different length of calculating.Wherein,Representing the transformation of the item collection that length is i calculated, its initial value is
Step 2: in mining process, dynamically reduces the transformation of the item collection that length is i calculated;
The condition FP-tree (the condition FP-tree of β is the FP-tree that the toy data base that the affairs according to all β of comprising form creates) of item collection β is being excavated assuming that current.Item order in the header table of the condition FP-tree of item collection β is { i1,...,ik,...,in}.For kth element i in header tablek, it constitutes new item collection Y=β ∪ i with item collection βk.Note S1={ i1,...,ik-1}.It addition, make S2For the set that newfound non-frequent episode in the conditional pattern base of Y is constituted.Due to for S2Middle arbitrary element j, item collection X=Y ∪ j right and wrong are frequently.Therefore, the downward closure property according to frequent mode, by X and { S1The item collection certainty right and wrong that the random subset of-j} is constituted are frequently.Therefore, it can obtainReduction amount be:
r p = &Sigma; u = 1 q C | S 1 | - u p - | Y | - 1
Wherein q=min{ | S2|,|S1|-(p-| Y |-1) }, wherein p is the length of affairs, and q is the number of subtransaction after segmentation.
Step 3: using the item collection transformation after renewal as sensitivity, utilizes Laplace mechanism, using the ratio of this sensitivity and safety coefficient as the yardstick of Laplace probability distribution, adds noise content for support.
As seen from the above analysis, dynamic descending method relates only to simply add computing and multiplication, can't bring the computing cost of complexity.By the upper limit that the item collection that constantly declines calculates, reduce the noise content added in mining process.Under the premise meeting difference privacy, improve the availability of Result.It is to say, utilize dynamic descending method constantly to adjust the quantity of candidate's frequent item set, thus reaching, after the support calculated during item collects data after singulation, to add a small amount of noise, improve availability of data.
By form analysis, the Mining Frequent Itemsets (PFP) based on affairs segmentation meeting difference privacy in the present invention can provide higher digging efficiency and Result availability while meeting difference secret protection.
By compared with the algorithm (PB) that the algorithm (TT) that document [8] proposes and document [9] propose, it may be determined that the PFP algorithm of proposition has obvious advantage in Result availability and operational efficiency.In order to the advantage of inventive algorithm is better described, PFP algorithm and TT algorithm, PB algorithm are compared from " Result availability " and " Riming time of algorithm expense " two aspects.Wherein, for " Result availability ", the correctness first against the frequent item set generated uses combination property (F-score index) to weigh.The computing formula of F-score index is as follows:
F - score = 2 &times; precision &times; recall precision + recall
Wherein, precision=| Up∩UC|/|Up|, recall=| Up∩UC|/|UC|, UpIt is the frequent item set generated by privacy algorithm, UCIt it is real frequent item set.
It addition, in order to the support of the frequent item set of weighing issue is relative to the item collection error in true support, use recall rate (RE index) to weigh.The computing formula of RE index is as follows:
RE = median X | sup &prime; ( x ) - sup ( x ) | sup ( x )
Wherein, X represents the frequent item set of all of generation, and sup (x) represents the true support of item collection x, sup'(x) represent the noise support of item collection x.
Concrete experiment be provided that first, use four groups of real data set: Accidents to comprise vehicle accident data;Pumsb-star is from the census data of PUMS (PublicUseMicrodataSample);POS is from the data of the retail point of an electronic retailer;Retail comprises the basket marketing data of one anonymous retail shop of Belgium.Wherein, the first two belongs to density data collection, and latter two belongs to sparse data set.Secondly, all algorithms are realized by JAVA language.Finally, the experimental situation of test is IntelCore2DuoE8400CPU (3.0GHz) and 4GBRAM.
The performance of PFP algorithm is described below by analysis experimental data.
Result availability:
On four group data sets, by the different threshold value selected, algorithm PFP, the F-score index of TT and PP and RE index are measured respectively.Owing to PB algorithm is used to excavate top-k frequent item set, it is impossible to directly it compared with PP, it is contemplated that a kind of scene, namely when given threshold value, user chooses k.Experimental result is such as shown in Fig. 5~9.
From Fig. 5,6,7 and 8 it can be seen that concentrate four data, compared with TT, PFP can reach better effect.Interpretation is as follows.Compared with directly blocking affairs, affairs are divided into each affairs and retain multiple subtransactions and the weight between uniform distribution subtransaction, so can significantly reduce the loss of information.Although PFP degree of accuracy slightly reduces, but the quantity of frequent item set is significantly improved.This is because affairs are divided into each affairs retains multiple subtransactions and the weight between uniform distribution subtransaction, so can significantly reduce the loss of information, and then the quantity generating frequent item set can be increased.Under the scene set, data set Pumsb, POS and Retail, PFP are remained to obtain better F-score value.
Riming time of algorithm expense:
On four group data sets, respectively the operation time of Measurement Algorithm PFP, front k the frequent item set of FP, TT and PB inquiry, wherein k span is [10,200].For PFP, pretreatment Exactly-once, and the threshold value selected with user is unrelated, so the operation time does not include pretreatment time.
From fig. 9, it can be seen that PFP and FP-growth performance is suitable, and PFP can reach better time efficiency than TT and PB.Interpretation of result is as follows: compared with FP-growth, and PFP does not bring too big burden at support method of estimation and the dynamic descending method of excavation phase;Compared with TT, the FP-growth algorithm performance used in PFP is better than Apriori algorithm, so PFP efficiency is more in hgher efficiency than TT.
By form analysis and substantial amounts of it is demonstrated experimentally that find to meet difference privacy based on things segmentation Frequent Itemsets Mining Algorithm privacy, excavate availability and operational efficiency in can obtain better effect.
Embodiment of above is merely to illustrate the present invention; and it is not limitation of the present invention; those of ordinary skill about technical field; without departing from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all equivalent technical schemes fall within scope of the invention, and the scope of patent protection of the present invention should be defined by the claims.

Claims (5)

1. a Mining Frequent Itemsets, it is characterised in that including:
S1: transaction length in raw data base is divided into multiple subtransaction more than the affairs of limited length so that after segmentation, in data base, the length of every affairs is not more than described limited length;
S2: according to preassigned support threshold, utilizes the support estimation technique and dynamic descent method Mining Frequent Itemsets Based in data base after singulation.
2. Mining Frequent Itemsets as claimed in claim 1, it is characterised in that described step S1 specifically includes:
S1.1: based on raw data base, construct a undirected weighted graph, wherein, item in vertex representation data base, when the item collection that the item of two vertex correspondence is constituted occurs in same affairs, connecting the two summit and form a limit, the weight on limit is the item collection of the item composition of vertex correspondence support in transaction database;
S1.2: utilize Louvain algorithm to find community in described undirected weighted graph, utilize middle output result one the tree structure CR-tree of structure in Louvain algorithm iteration process, in CR-tree, the node of each layer represents with the community found in an iterative process, the height of tree represents iterations, father node represents that the community represented by child nodes merges the new community formed, and the shortest path length between the leaf node comprising this of the dependency between item represents;
S1.3: utilize the CR-tree generated, length in raw data base is split more than the affairs of limited length, generating the data base after segmentation, the CR-tree of affairs, transaction length, maximum transaction limited length and structure is designated as t, p, m and T respectively, concrete cutting procedure is as follows:
S1.3.1: calculate the number q of subtransaction after splitting according to p and m, namely
S1.3.2: the result set R after being split by affairs t is set to sky;
S1.3.3: structure i-th affairs ti, including following a)~e) step:
A) from CR-tree leaf node, choose and comprise the node of element in affairs t, and the item being not included in these nodes in t is removed, constitute set according to these new nodes;
B) from the leaf layer N of CR-treelIn choose and comprise the node n that item is maximuml, and by nlIn item add ti
C) to NlIn all the other nodes according to nlDistance in T sorts from big to small, the node that dependency is identical, sorts from big to small according to its capacity;
D) N is traveled through successivelylIn all the other nodes, if its capacity and nlCapacity sum is not more than m, then by it from NlMiddle taking-up, and put into ti
E) by tiIt is stored in result set R;
S1.3.4: repeat S1.3.1 process q time;
S1.3.5: if NlIn still suffer from node, by NlA length that what item in each node was random put in R is less than in the subtransaction of m;
S1.3.6: return result set R.
3. Mining Frequent Itemsets as claimed in claim 2, it is characterised in that after singulation to the affairs t after each segmentationiGive weight 1/q.
4. Mining Frequent Itemsets as claimed in claim 3, it is characterised in that in step S2, the support estimation technique specifically includes:
S2.1: setting the item in result set R and integrate the length of X as i, its support in raw data base is ω, and the support in data base is ω ' after singulation, the support after adding noise in data base after singulation is
S2.2: according toEstimation ω ', is drawn by bayesian criterion:
Pr ( &omega; &prime; | &omega; ~ ) = Pr ( &omega; &OverBar; | &omega; &prime; ) &CenterDot; Pr ( &omega; &prime; ) Pr ( &omega; &OverBar; ) ;
If ω ' obeys consistent prior distribution, then its conditional probability distribution meets:
Pr ( &omega; &prime; | &omega; ~ ) ~ e - &epsiv; | &omega; &prime; - &omega; &OverBar; | ;
S2.3: estimation X support ω in raw data base, the maximum limited length of affairs is m, and the affairs t that length is p comprises X and p > m, t is divided into q=Individual subset, the length of each subset, less than m, sets simultaneouslyThe length of individual subset is m, and the length of another one subset is less than m, and its length is a=p-q'm, and after so obtaining segmentation affairs t, the probability that X comprises a subtransaction after singulation is:
&beta; p = q &prime; C p - i m - i C p m if a < i q &prime; C p - i m - i C p m + C p - i &alpha; - i C p &alpha; ifa &GreaterEqual; i ;
Make αkRepresenting the number comprising X and affairs that length is k, wherein, n is the maximum of transaction length, calculates being desired for of ω ':
Order:
Estimation X Average Supports in raw data base is:
avg ( &omega; &prime; ) = &omega; &prime; ratio ( i ) ;
Utilize ρ-lower bit line, estimate that X max support in raw data base is:
max ( &omega; &prime; ) = &omega; &prime; - ln &rho; + ln 2 &rho; - 2 &omega; &prime; ln &rho; ratio ( i ) if ln &rho; &le; 2 &omega; &prime; avg ( &omega; &prime; ) if ln &rho; > 2 &omega; &prime; ;
The support adding noise in conversion database according to XIts max support in raw data base can be estimated and Average Supports is respectively as follows:
max _ supp ( &omega; ~ ) = &Integral; &omega; &prime; = &omega; &OverBar; - 5 &omega; &prime; = &omega; &OverBar; + 5 Pr ( &omega; &prime; | &omega; ~ ) max ( &omega; &prime; )
avg _ supp ( &omega; ~ ) = &Integral; &omega; &prime; = &omega; &OverBar; - 5 &omega; &prime; = &omega; &OverBar; + 5 Pr ( &omega; &prime; | &omega; ~ ) avg ( &omega; &prime; ) ;
If the Average Supports of item collection X is more than described support threshold, then X is frequent item set;For frequent item set, if the max support of item collection X is more than given support threshold, then X is used to generate frequent item set candidate collection.
5. Mining Frequent Itemsets as claimed in claim 4, it is characterised in that in step S2, dynamic descent method specifically includes:
Step 2.4: the upper limit for the quantity of the item collection inquiry that length is i composes initial value;
Assuming that containing s frequent episode in frequent item set Candidate Set, define an arrayStore the transformation of the item collection of the different length of calculating, wherein,Representing the transformation of the item collection that length is i calculated, its initial value is
Step 2.5: in mining process, dynamically reduces the transformation of the item collection that length is i calculated, particularly as follows:
Current at the condition FP-tree excavating item collection β, the item order in the header table of the condition FP-tree of item collection β is { i1,...,ik,...,in, for kth element i in header tablek, it constitutes new item collection Y=β ∪ i with item collection βk, remember S1={ i1,...,ik-1, make S2For newfound non-frequent episode in the conditional pattern base of Y constitute set, due to for S2Middle arbitrary element j, frequently, the downward closure property according to frequent mode, by X and { S for item collection X=Y ∪ j right and wrong1The item collection certainty right and wrong that the random subset of-j} is constituted frequently, obtainReduction amount be:
r p = &Sigma; u = 1 q C | S 1 | - u p - | Y | - 1
Wherein q=min{ | S2|,|S1|-(p-| Y |-1) }, wherein p is the length of affairs, and q is the number of subtransaction after segmentation;
Step 2.6: using the item collection transformation after renewal as sensitivity, utilizes Laplace mechanism, using the ratio of this sensitivity and safety coefficient as the yardstick of Laplace probability distribution, adds noise content for support.
CN201410746488.2A 2014-12-08 2014-12-08 Frequent item set mining method Pending CN105740245A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410746488.2A CN105740245A (en) 2014-12-08 2014-12-08 Frequent item set mining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410746488.2A CN105740245A (en) 2014-12-08 2014-12-08 Frequent item set mining method

Publications (1)

Publication Number Publication Date
CN105740245A true CN105740245A (en) 2016-07-06

Family

ID=56237954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410746488.2A Pending CN105740245A (en) 2014-12-08 2014-12-08 Frequent item set mining method

Country Status (1)

Country Link
CN (1) CN105740245A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066587A (en) * 2017-04-17 2017-08-18 贵州大学 A kind of efficient Mining Frequent Itemsets based on group chained list
CN107092837A (en) * 2017-04-25 2017-08-25 华中科技大学 A kind of Mining Frequent Itemsets and system for supporting difference privacy
CN107247995A (en) * 2016-09-29 2017-10-13 上海交通大学 Transmission line of electricity running status association rule mining and Forecasting Methodology based on Bayesian model
CN107590733A (en) * 2017-08-08 2018-01-16 杭州灵皓科技有限公司 Platform methods of risk assessment is borrowed based on the net of geographical economy and social networks
CN107870913A (en) * 2016-09-23 2018-04-03 腾讯科技(深圳)有限公司 The high of effective time it is expected weight item collection method for digging, device and processing equipment
CN107908665A (en) * 2017-10-20 2018-04-13 国网浙江省电力公司经济技术研究院 A kind of frequent node method for digging of directed acyclic graph power grid enterprises and digging system
CN108346085A (en) * 2018-01-30 2018-07-31 南京邮电大学 Electric business platform personalized recommendation method based on weighted frequent items mining algorithm
CN108475292A (en) * 2018-03-20 2018-08-31 深圳大学 Mining Frequent Itemsets, device, equipment and the medium of large-scale dataset
CN108932658A (en) * 2018-07-13 2018-12-04 北京京东金融科技控股有限公司 Data processing method, device and computer readable storage medium
CN109299436A (en) * 2018-09-17 2019-02-01 北京邮电大学 A kind of ordering of optimization preference method of data capture meeting local difference privacy
CN109657498A (en) * 2018-12-28 2019-04-19 广西师范大学 The difference method for secret protection that top-k Symbiotic Model excavates in a plurality of stream
CN109783464A (en) * 2018-12-21 2019-05-21 昆明理工大学 A kind of Mining Frequent Itemsets based on Spark platform
CN110096629A (en) * 2019-05-15 2019-08-06 重庆大学 A method of the Mining Frequent based on effective weight tree weights item collection
CN110287240A (en) * 2019-06-27 2019-09-27 浪潮软件集团有限公司 A kind of mining algorithm based on Top-K frequent item set
CN110471957A (en) * 2019-08-16 2019-11-19 安徽大学 Localization difference secret protection Mining Frequent Itemsets based on frequent pattern tree (fp tree)
CN110490000A (en) * 2019-08-23 2019-11-22 广西师范大学 The difference method for secret protection that Frequent tree mining excavates in more diagram datas
WO2020253221A1 (en) * 2019-06-19 2020-12-24 江南大学 Method for analyzing relationship between communication path and heat resistance of lipase
CN112434089A (en) * 2020-12-23 2021-03-02 龙马智芯(珠海横琴)科技有限公司 Frequent item mining method and device, server and readable storage medium
CN113282686A (en) * 2021-06-03 2021-08-20 光大科技有限公司 Method and device for determining association rule of unbalanced sample
CN115810272A (en) * 2023-02-09 2023-03-17 北京华录高诚科技有限公司 Vehicle safety supervision method and system

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107870913A (en) * 2016-09-23 2018-04-03 腾讯科技(深圳)有限公司 The high of effective time it is expected weight item collection method for digging, device and processing equipment
CN107870913B (en) * 2016-09-23 2021-12-14 腾讯科技(深圳)有限公司 Efficient time high expectation weight item set mining method and device and processing equipment
CN107247995A (en) * 2016-09-29 2017-10-13 上海交通大学 Transmission line of electricity running status association rule mining and Forecasting Methodology based on Bayesian model
CN107066587A (en) * 2017-04-17 2017-08-18 贵州大学 A kind of efficient Mining Frequent Itemsets based on group chained list
CN107092837A (en) * 2017-04-25 2017-08-25 华中科技大学 A kind of Mining Frequent Itemsets and system for supporting difference privacy
CN107590733A (en) * 2017-08-08 2018-01-16 杭州灵皓科技有限公司 Platform methods of risk assessment is borrowed based on the net of geographical economy and social networks
CN107908665A (en) * 2017-10-20 2018-04-13 国网浙江省电力公司经济技术研究院 A kind of frequent node method for digging of directed acyclic graph power grid enterprises and digging system
CN108346085A (en) * 2018-01-30 2018-07-31 南京邮电大学 Electric business platform personalized recommendation method based on weighted frequent items mining algorithm
CN108475292B (en) * 2018-03-20 2021-08-24 深圳大学 Frequent item set mining method, device, equipment and medium for large-scale data set
WO2019178733A1 (en) * 2018-03-20 2019-09-26 深圳大学 Method and apparatus for mining frequent item sets of large-scale data set, device, and medium
CN108475292A (en) * 2018-03-20 2018-08-31 深圳大学 Mining Frequent Itemsets, device, equipment and the medium of large-scale dataset
CN108932658B (en) * 2018-07-13 2021-07-06 京东数字科技控股有限公司 Data processing method, device and computer readable storage medium
CN108932658A (en) * 2018-07-13 2018-12-04 北京京东金融科技控股有限公司 Data processing method, device and computer readable storage medium
CN109299436A (en) * 2018-09-17 2019-02-01 北京邮电大学 A kind of ordering of optimization preference method of data capture meeting local difference privacy
CN109299436B (en) * 2018-09-17 2021-10-15 北京邮电大学 Preference sorting data collection method meeting local differential privacy
CN109783464A (en) * 2018-12-21 2019-05-21 昆明理工大学 A kind of Mining Frequent Itemsets based on Spark platform
CN109783464B (en) * 2018-12-21 2022-11-04 昆明理工大学 Spark platform-based frequent item set mining method
CN109657498A (en) * 2018-12-28 2019-04-19 广西师范大学 The difference method for secret protection that top-k Symbiotic Model excavates in a plurality of stream
CN109657498B (en) * 2018-12-28 2021-09-24 广西师范大学 Differential privacy protection method for top-k symbiotic mode mining in multiple streams
CN110096629A (en) * 2019-05-15 2019-08-06 重庆大学 A method of the Mining Frequent based on effective weight tree weights item collection
CN110096629B (en) * 2019-05-15 2023-07-28 重庆大学 Memory optimization method for transaction processing
WO2020253221A1 (en) * 2019-06-19 2020-12-24 江南大学 Method for analyzing relationship between communication path and heat resistance of lipase
CN110287240A (en) * 2019-06-27 2019-09-27 浪潮软件集团有限公司 A kind of mining algorithm based on Top-K frequent item set
CN110471957B (en) * 2019-08-16 2021-10-26 安徽大学 Localized differential privacy protection frequent item set mining method based on frequent pattern tree
CN110471957A (en) * 2019-08-16 2019-11-19 安徽大学 Localization difference secret protection Mining Frequent Itemsets based on frequent pattern tree (fp tree)
CN110490000B (en) * 2019-08-23 2022-04-05 广西师范大学 Differential privacy protection method for frequent subgraph mining in multi-graph data
CN110490000A (en) * 2019-08-23 2019-11-22 广西师范大学 The difference method for secret protection that Frequent tree mining excavates in more diagram datas
CN112434089A (en) * 2020-12-23 2021-03-02 龙马智芯(珠海横琴)科技有限公司 Frequent item mining method and device, server and readable storage medium
CN113282686A (en) * 2021-06-03 2021-08-20 光大科技有限公司 Method and device for determining association rule of unbalanced sample
CN113282686B (en) * 2021-06-03 2023-11-07 光大科技有限公司 Association rule determining method and device for unbalanced sample
CN115810272A (en) * 2023-02-09 2023-03-17 北京华录高诚科技有限公司 Vehicle safety supervision method and system

Similar Documents

Publication Publication Date Title
CN105740245A (en) Frequent item set mining method
Sattari et al. Prediction of groundwater level in Ardebil plain using support vector regression and M5 tree model
Yang et al. Application of a triangular fuzzy AHP approach for flood risk evaluation and response measures analysis
CN104537025B (en) Frequent episodes method for digging
CN102810113B (en) A kind of mixed type clustering method for complex network
Xu et al. Mobile cellular big data: Linking cyberspace and the physical world with social ecology
Miao et al. Triggering factors and threshold analysis of baishuihe landslide based on the data mining methods
CN105184307A (en) Medical field image semantic similarity matrix generation method
CN103678671A (en) Dynamic community detection method in social network
CN103020163A (en) Node-similarity-based network community division method in network
CN102122291A (en) Blog friend recommendation method based on tree log pattern analysis
CN105469315A (en) Dynamic social network community structure evolution method based on incremental clustering
CN104317904A (en) Generalization method for weighted social network
CN103020283B (en) A kind of semantic retrieving method of the dynamic restructuring based on background knowledge
CN114662157B (en) Block compressed sensing indistinguishable protection method and device for social text data stream
CN103020319A (en) Real-time mobile space keyword approximate Top-k query method
Liu et al. Spotting significant changing subgraphs in evolving graphs
CN114092729A (en) Heterogeneous electricity consumption data publishing method based on cluster anonymization and differential privacy protection
CN109947597A (en) A kind of network flow data restoration methods and system
CN117034051B (en) Water conservancy information aggregation method, device and medium based on BIRCH algorithm
CN104899283A (en) Frequent sub-graph mining and optimizing method for single uncertain graph
CN103200034B (en) Network user structure disturbance method based on spectral constraint and sensitive area partition
CN104268270A (en) Map Reduce based method for mining triangles in massive social network data
CN104657473A (en) Large-scale data mining method capable of guaranteeing quality monotony
CN111241054B (en) Power communication network heterogeneous data source integration method based on virtual database

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160706

RJ01 Rejection of invention patent application after publication