CN105740245A - Frequent item set mining method - Google Patents

Frequent item set mining method Download PDF

Info

Publication number
CN105740245A
CN105740245A CN201410746488.2A CN201410746488A CN105740245A CN 105740245 A CN105740245 A CN 105740245A CN 201410746488 A CN201410746488 A CN 201410746488A CN 105740245 A CN105740245 A CN 105740245A
Authority
CN
China
Prior art keywords
omega
length
prime
transaction
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410746488.2A
Other languages
Chinese (zh)
Inventor
程祥
苏森
许胜之
徐鹏
双锴
王玉龙
张忠宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201410746488.2A priority Critical patent/CN105740245A/en
Publication of CN105740245A publication Critical patent/CN105740245A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data mining and data privacy, and discloses a frequent item set mining method. The frequent item set mining method comprises the following steps: S1: segmenting a transaction of which the transaction length is greater than a restriction length in an original database into a plurality of sub-transactions, and causing the length of each transaction in the segmented database to be smaller than or equal to the restriction length; and S2: according to a support degree threshold value which is appointed in advance, utilizing a support degree estimation method and a dynamic descent method to mine the frequent item set in the segmented database. The frequent item set mining method can provide higher mining efficiency and mining result availability while differential privacy protection is met.

Description

Frequent item set mining method
Technical Field
The invention relates to the technical field of data mining and data privacy, in particular to a frequent item set mining method.
Background
Frequent item set mining is a fundamental problem in the field of data mining and has wide application in many fields. Frequent itemset mining can be described as follows: given a database of transactions, each transaction corresponds to a personal record of the user. Where a transaction is a collection of items. Given a set of items (a collection of items), its support refers to the number of transactions that contain the set of items. When the support degree of a certain item set is not less than a given threshold value, the item set is called a frequent item set. Frequent item set mining, when given a transactional database and a threshold, mines all the frequent item sets that appear in the database.
FP-growth algorithm in frequent item set mining[2]Is a widely used mining algorithm. The FP-growth algorithm is a depth-first traversal algorithm. In the mining process, the FP-growth algorithm accelerates the whole frequent item set by using the FP-treeAnd (5) excavating. The FP-tree is a special prefix tree. And the database is scanned for two times by utilizing the FP-tree and FP-growth algorithm, so that the mining efficiency is greatly improved.
In the mining process, if the affairs (personal records) in the affair database belong to sensitive information, the frequent item sets are directly issued, so that the personal records of the user are leaked. How to protect the privacy of user data during frequent item set mining is receiving more and more attention. Differential privacy protection paradigm[3]The proposal provides a feasible scheme for solving the privacy problem in the data analysis process. And k-anonymity[4]And l-diversity[5]In contrast, the differential privacy preserving paradigm provides user privacy protection with theoretical guarantees by adding noise.
Currently, there is research on privacy protection for frequent pattern mining using a differential privacy protection paradigm. Algorithm[6][7]The distributed transaction database is protected by differential privacy protection to protect user privacy. The published database may be used for frequent pattern mining. In particular, the literature[6]A database publishing algorithm under the guidance of a context-free tree structure is provided. The algorithm divides the database in a top-down mode and finally issues a synthetic database for frequent item set mining; literature reference[7]In the scene of data increment change, a transaction database publishing algorithm meeting the difference privacy is provided. Literature reference[8]The PrivBasis algorithm meeting the difference privacy is provided for mining top-k frequent item sets. Literature reference[9]It was found that limiting the length of the transaction may improve the balance of data availability and privacy protection. A frequent item set mining algorithm which meets the requirement of differential privacy protection and is based on an Apriori algorithm is designed by utilizing a truncation method. However, the above algorithms have shortcomings in the usability of mining results and mining efficiency, and the application of the differential privacy protection technology in frequent pattern mining research is hindered.
The references are as follows:
[1]R.AgrawalandR.Srikant,“Fastalgorithmsforminingassociationrules,”inVLDB,1994.
[2]J.Han,J.Pei,andY.Yin,“Miningfrequentpatternswithoutcandidategeneration,”inSIGMOD,2000.
[3]C.Dwork,“Differentialprivacy,”inICALP,2006.
[4]L.Sweeney,“k-anonymity:Amodelforprotectingprivacy,”Int.J.Uncertain.FuzzinessKnowl.-BaseSyst,2002.
[5]A.Machanavajjhala,J.Gehrke,D.Kifer,andM.
Venkitasubramaniam,“l-diversity:Privacybeyondk-anonymity,”inICDE,2006.
[6]R.Chen,N.Mohammed,B.C.M.Fung,B.C.Desai,andL.Xiong,“Publishingset-valueddataviadifferentialprivacy,”inVLDB,2011.
[7]X.Zhang,X.Meng,andR.Chen,“Differentiallyprivatesetvalueddatareleaseagainstincrementalupdates,”inDASFAA,2013.
[8]N.Li,W.Qardaji,D.Su,andJ.Cao,“Privbasis:frequentitemsetminingwithdifferentialprivacy,”inVLDB,2012.
[9]C.Zeng,J.F.Naughton,andJ.-Y.Cai,“Ondifferentiallyprivatefrequentitemsetmining,”inVLDB,2012.
disclosure of Invention
Technical problem to be solved
The technical problem to be solved by the invention is as follows: how to improve the availability of mining results.
(II) technical scheme
In order to solve the technical problem, the invention provides a frequent item set mining method, which comprises the following steps:
s1: dividing the affairs with the affair length larger than the limit length in the original database into a plurality of sub-affairs, and enabling the length of each affair in the divided database to be not larger than the limit length;
s2: and (4) mining a frequent item set in the segmented database by using a support estimation method and a dynamic descent method according to a support threshold value specified in advance.
Wherein, the step S1 specifically includes:
s1.1: constructing an undirected weighted graph based on an original database, wherein vertexes represent items in the database, when item sets formed by items corresponding to the two vertexes appear in the same transaction, the two vertexes are connected to form an edge, and the weight of the edge is the support degree of the item sets formed by the items corresponding to the vertexes in the transaction database;
s1.2: discovering communities in the undirected weighted graph by using a Louvain algorithm, constructing a tree-structured CR-tree by using an intermediate output result in an iterative process of the Louvain algorithm, wherein nodes of each layer in the CR-tree represent the communities discovered in the same iterative process, the height of the tree represents the iterative times, a father node represents a new community formed by merging communities represented by child nodes, and the correlation among items is represented by the length of the shortest path among leaf nodes containing the item;
s1.3: utilizing the generated CR-tree to segment the transactions with the length larger than the limit length in the original database to generate a segmented database, and recording the transactions, the transaction length, the maximum transaction limit length and the constructed CR-tree as T, p, m and T respectively, wherein the specific segmentation process is as follows:
s1.3.1: calculating the number q of the sub-transactions after the division according to p and m, namely
S1.3.2: setting a result set R after the transaction t is segmented as null;
s1.3.3: construct the ith transaction tiThe method comprises the following steps a) to e):
a) selecting nodes containing elements in the transaction t from CR-tree leaf nodes, removing items which are not contained in the t from the nodes, and forming a set according to the new nodes;
b) leaf layer N from CR-treelIn which the node n containing the most items is selectedlAnd n islTerm (iii) add to ti
c) To NlThe other nodes are in accordance with nlSorting the distances in the T from large to small, and sorting the nodes with the same correlation from large to small according to the capacity of the nodes;
d) traverse N in sequencelThe remaining nodes, if their capacity is equal to nlIf the sum of the capacities is not more than m, the sum is changed from NlIs taken out and put ini
e) Will tiStoring the result into a result set R;
s1.3.4: s1.3.1 repeating the process q times;
s1.3.5: if N is presentlIn which there is still a node, NlRandomly putting the item in each node into a sub-transaction with the length smaller than m in the R;
s1.3.6: and returning the result set R.
Wherein for each split transaction t after splittingiThe weight is given 1/q.
The support degree estimation method in step S2 specifically includes:
s2.1: let the length of the item set X in the result set R be i, the support in the original database be ω, the support in the segmented database be ω', and the support after adding noise in the segmented database be ω
S2.2: according toEstimate ω', derived from bayesian criterion:
Pr ( ω ′ | ω ~ ) = Pr ( ω ‾ | ω ′ ) · Pr ( ω ′ ) Pr ( ω ‾ ) ;
assuming ω' obeys a uniform prior distribution, its conditional probability distribution satisfies:
Pr ( ω ′ | ω ~ ) ~ e - ϵ | ω ′ - ω ‾ | ;
s2.3: estimating the support degree omega of X in the original database, the maximum limit length of the transaction is m, a transaction t with the length p contains X, and p>m, dividing t intoSubsets, each subset having a length not exceeding m, andthe length of one subset is m, the length of the other subset is less than m, and the length of the other subset is a ═ p-q' · m, so that after the split transaction t is obtained, the probability that X is included in the split sub-transaction is as follows:
&beta; p = q &prime; C p - i m - i C p m if&alpha; < i q &prime; C p - i m - i C p m + C p - i &alpha; - i C p &alpha; if&alpha; &GreaterEqual; i ;
let αkRepresenting the number of transactions containing X and having a length k, where n is the maximum value of the transaction length, the expectation to calculate ω' is:
order:
the average support of X in the raw database was estimated as:
avg ( &omega; &prime; ) = &omega; &prime; ratio ( i ) ;
using the ρ -lower bit line, the maximum support of X in the original database is estimated as:
max ( &omega; &prime; ) = &omega; &prime; - ln &rho; + ln 2 &rho; - 2 &omega; &prime; ln &rho; ratio ( i ) if ln &rho; &le; 2 &omega; &prime; avg ( &omega; &prime; ) if ln &rho; > 2 &omega; &prime; ;
support of adding noise to the conversion database according to XThe maximum support and the average support in the original database can be estimated as follows:
max _ supp ( &omega; ~ ) = &Integral; &omega; &prime; = &omega; &OverBar; - 5 &omega; &prime; = &omega; &OverBar; + 5 Pr ( &omega; &prime; | &omega; ~ ) max ( &omega; &prime; )
avg _ supp ( &omega; ~ ) = &Integral; &omega; &prime; = &omega; &OverBar; - 5 &omega; &prime; = &omega; &OverBar; + 5 Pr ( &omega; &prime; | &omega; ~ ) avg ( &omega; &prime; ) ;
if the average support degree of the item set X is larger than the support degree threshold value, X is a frequent item set; for a frequent item set, if the maximum support of item set X is greater than a given support threshold, then X is used to generate a frequent item set candidate.
The dynamic descent method in step S2 specifically includes:
step 2.4: assigning an initial value to the upper limit of the number of item set queries with the length i;
assuming that the frequent item set candidate set contains s frequent items, defining an arrayTo store an upper limit on the number of computed sets of items of different lengths, wherein,representing an upper limit on the number of computed sets of items of length i, with an initial value of
Step 2.5: in the mining process, dynamically reducing the upper limit of the number of the item sets with the length of i, specifically:
the order of the items currently in the header table of the conditional FP-tree of item set β, item set β is { i } [ -i [ - ]1,...,ik,...,inFor the k-th element i in the header tablekWhich together with the item set β form a new item set Y β∪ ikRemember S1={i1,...,ik-1Let S2Is a set of infrequently items newly found in the conditional schema base of Y, since for S2Any element j, the item set X-Y ∪ j is infrequent, according to a frequent patternIs defined by X and { S1The set of items made up of any subset of-j } must be infrequent, resulting inThe reduction of (a) is:
r p = &Sigma; u = 1 q C | S 1 | - u p - | Y | - 1
wherein q is min { | S2|,|S1- | - (p-Y | -1) }, wherein p is the length of the transaction and q is the number of the sub-transactions after segmentation;
step 2.6: and taking the upper limit of the number of the updated item sets as sensitivity, and adding the noise volume for the support degree by using a Laplace mechanism and taking the ratio of the sensitivity to the safety coefficient as the scale of Laplace probability distribution.
(III) advantageous effects
The method and the device can provide higher mining efficiency and mining result usability while meeting the differential privacy protection.
Drawings
FIG. 1 is a flow chart of a frequent itemset mining method of the present invention;
FIG. 2 is a schematic diagram of a given transaction database in the present invention;
FIG. 3 is a undirected weighting graph accomplished in accordance with FIG. 2;
FIG. 4 is a CR-tree constructed in accordance with FIG. 3;
FIG. 5 is the F-score and RE measurements of PFP, TT and PB in Pulsb;
FIG. 6 is the F-score and RE measurements of PFP, TT and PB in Accidents;
FIG. 7 is the F-score and RE measurements of PFP, TT and PB in POS;
FIG. 8 is the F-score and RE measurements of PFP, TT and PB in Retail;
FIG. 9 is the runtime of PFP, FP, TT, and PB under different data sets.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
The flow of the frequent item set mining method of the embodiment of the invention is shown in FIG. 1, and the method comprises the following steps:
step S1: the raw database is preprocessed. And by using an intelligent segmentation method, segmenting the transaction with the length greater than the specified limit length in the original database into a plurality of sub-transactions, so that the length of each transaction in the converted database is not greater than the specified limit length.
Step S2: and digging a frequent item set on the premise of meeting the differential privacy protection. And mining a frequent item set in the converted database according to a threshold value specified by a user. In the mining process, the information loss caused by transaction segmentation is reduced by using a support degree estimation method; meanwhile, the dynamic descending method is utilized to reduce the noise addition amount in the excavation process so as to improve the result usability.
The following describes an intelligent segmentation method, a support estimation method, and a dynamic descent method in the preprocessing process.
For step S1, in order to improve the usability of the mining result and maintain a high privacy protection level, a new transformation, i.e. transaction segmentation, is performed on the original database. When a record length exceeds a limit length, the transaction is divided into a plurality of sub-transactions. Each sub-transaction resulting from the splitting satisfies the maximum length constraint. However, simply performing a partition transformation on the original database may cause a large loss of information, so that some frequent item sets are no longer frequent, which affects the availability of mining results. Meanwhile, after the database is partitioned, the privacy protection level of the mining algorithm meeting the differential privacy protection is lowered. In order to solve the above problems, the present invention proposes an intelligent segmentation method. The method comprises the following specific steps:
step 1: based on the original database, an undirected weighted graph is constructed. Where the vertices represent entries in the database. An edge is formed by joining two vertices when a set of items, which are formed by items corresponding to the two vertices, appear in the same transaction. The weight of the edge is the support degree of the item set formed by the items corresponding to the vertex in the transaction database. Given a transactional database (fig. 1), an undirected weighted graph constructed in the above manner is shown in fig. 2.
Step 2: communities are found in the constructed undirected weighted graph using the Louvain algorithm. The Louvain algorithm is an iterative algorithm. And constructing a tree structure CR-tree by using an intermediate output result in the iteration process of the Louvain algorithm. The nodes of each layer in the CR-tree represent communities found in the same iteration process, the height of the tree represents the iteration number, and the parent node represents a new community formed by merging communities represented by child nodes. The correlation between items is represented by the shortest path length between leaf nodes containing the item. For example, given an undirected weighted graph as in FIG. 2, a CR-tree constructed using the Louvain algorithm is shown in FIG. 3.
And step 3: and utilizing the generated CR-tree to segment the transactions with the length larger than the limit length in the original database to generate a segmented database. The CR-trees for transactions, transaction lengths, maximum transaction limit lengths, and constructs are denoted as T, p, m, and T, respectively. The specific segmentation process is as follows:
1) calculating the number q of the sub-transactions after the division according to p and m, namely
2) Setting a result set R after the transaction t is segmented as null;
3) construct the ith transaction ti
a) From the CR-tree leaf node, the node containing the element in the transaction t (i.e.: items) and removes items not included in t from among the nodes. Forming a set according to the new nodes;
b) leaf layer N from CR-treelIn which the node n containing the most items is selectedlAnd n islTerm (iii) add to ti
c) To NlThe other nodes are in accordance with nlThe distances in T are ordered from large to small. Sorting the nodes with the same correlation according to the capacity from large to small;
d) traverse N in sequencelThe remaining nodes, if their capacity is equal to nlIf the sum of the capacities is not more than m, the sum is changed from NlIs taken out and put ini
e) Will tiStoring the result into a result set R;
4) repeatedly executing the process 3) q times;
5) if N is presentLIn which there is still a node, NlRandomly putting the item in each node into a sub-transaction with the length smaller than m in the R;
6) and finally returning a result set R.
Thus, through the steps 1 to 3, the original database is converted into a new database, and the length of each transaction in the segmented database is ensured to meet the limit of the maximum length.
Through analysis, if a transaction is divided into k sub-transactions, the converted transaction database meets a frequent item set mining algorithm of differential privacy, and for the original database, only k-differential privacy can be guaranteed. To this end, the present invention proposes a weighted partitioning operation. The definition can be described as follows: when the maximum transaction capacity is m and the length of a transaction t exceeds m, the operation f of dividing the transaction (i.e. the steps 1 to 3) divides t into t1...tkSo that | ti| is less than or equal to m and is tiAssigning a weight wiIf f satisfiesAnd isThen f is called the weighted partitioning operation. The theory proves that when the original transaction database is converted by using the weighted segmentation operation, the converted transaction database meets a frequent item set mining algorithm of differential privacy, and the differential privacy can be ensured for the original database.
Further, in step S1, the transaction is divided, and a t is added to the database1,t1Is increased by 1/q instead of 1, which results in the calculation of t in the post-conversion database1Is smaller than the true value, and a part of the subset in t is lost, thereby causing a certain information loss, and in order to solve this problem, in order to compensate for the information loss, a support degree estimation method is proposed in step S2. The method mainly comprises the following two steps: firstly, estimating the accurate support degree of the item set in a segmented database according to the support degree of added noise (the noise is added by adopting a Laplace mechanism generally) obtained in the mining process of the item set; then, according to the support degree of the estimated item set in the partitioned database, the support degree of the item set in the original database is estimated. Detailed computing procedureThe following were used:
assuming that the length of the item set X is i, the support in the original database is ω, the support in the segmented database is ω', and the support after adding noise in the segmented database is ω
First, according toEstimating ω', from bayesian criterion, one can derive:
Pr ( &omega; &prime; | &omega; ~ ) = Pr ( &omega; &OverBar; | &omega; &prime; ) &CenterDot; Pr ( &omega; &prime; ) Pr ( &omega; &OverBar; )
assuming ω' obeys a consistent prior distribution, its conditional probability distribution satisfies:
Pr ( &omega; &prime; | &omega; ~ ) ~ e - &epsiv; | &omega; &prime; - &omega; &OverBar; |
then, the support ω of X in the original database is estimated.
Assuming that the maximum limit length of a transaction is m, a transaction t of length p contains X, and p>And m is selected. To improve the usability of the mining results and to guarantee a higher privacy protection level, t is split intoSubsets, each of which is no longer than m in length. Simultaneous assumptionOne subset is m in length, another subset may be less than m in length, and a is p-q' · m in length. Thus, after the split transaction t is obtained, the probability that X is included in one split sub-transaction is:
&beta; p = q &prime; C p - i m - i C p m if&alpha; < i q &prime; C p - i m - i C p m + C p - i &alpha; - i C p &alpha; if&alpha; &GreaterEqual; i ;
let αkRepresenting the number of transactions containing X and having a length k, where n is the maximum value of the transaction length, the expectation that ω' can be calculated is:
for convenience of description, the second half of the right side of the equation (i.e., the contents of the last parenthesis) will be referred to simply as ratio (i), i.e.:
this can estimate the average support of X in the raw database as:
avg ( &omega; &prime; ) = &omega; &prime; ratio ( i )
using the ρ -lower bit line, the maximum support of X in the original database can be estimated as:
max ( &omega; &prime; ) = &omega; &prime; - ln &rho; + ln 2 &rho; - 2 &omega; &prime; ln &rho; ratio ( i ) if ln &rho; &le; 2 &omega; &prime; avg ( &omega; &prime; ) if ln &rho; > 2 &omega; &prime;
based on the above analysis, according to XSupport for adding noise to the conversion databaseThe maximum support and the average support in the original database can be estimated as follows:
max _ supp ( &omega; ~ ) = &Integral; &omega; &prime; = &omega; &OverBar; - 5 &omega; &prime; = &omega; &OverBar; + 5 Pr ( &omega; &prime; | &omega; ~ ) max ( &omega; &prime; )
avg _ supp ( &omega; ~ ) = &Integral; &omega; &prime; = &omega; &OverBar; - 5 &omega; &prime; = &omega; &OverBar; + 5 Pr ( &omega; &prime; | &omega; ~ ) avg ( &omega; &prime; )
if the average support degree of the item set X is larger than the support degree threshold value, X is a frequent item set; for a frequent item set, if the maximum support of item set X is greater than a given support threshold, then X is used to generate a frequent item set candidate.
Furthermore, after the support degree of the complete item set in the partitioned database is calculated, a proper amount of noise is added according to the difference privacy requirement. The noise level is proportional to the computational sensitivity. For the item set with the length of i, the calculation sensitivity of the support degree is equal to the number of the calculation of the item set with the length of i in the mining process. Since FP-growth is a depth-first traversal algorithm, it is difficult to accurately count the number of item sets with length i in the mining process. If the FP-growth is adjusted to generate the item sets with the same length at the same time, a large amount of storage overhead is caused. In order to solve the above problems, the present invention provides a lightweight dynamic descent method. The core idea is to dynamically reduce the upper limit of the number of the item set with the length i in the calculation by utilizing the downward closure property of the frequent mode so as to achieve the purpose of reducing the sensitivity of the item set with the length i in the calculation. The specific process can be described as follows:
step 1: assigning an initial value to the upper limit of the number of item set queries with the length i;
assuming that the database contains s frequent items, an array is definedTo store an upper bound on the number of computed sets of items of different lengths. Wherein,representing an upper limit on the number of computed sets of items of length i, with an initial value of
Step 2: in the mining process, dynamically reducing the upper limit of the number of the item sets with the length of i;
assume that the conditional FP-tree of item set β is currently being mined (the conditional FP-tree of β is an FP-tree created from a small database of all transaction compositions containing β.) the order of the items in the header table of the conditional FP-tree of item set β is { i }1,...,ik,...,in}. For the k-th element i in the header tablekWhich together with the item set β form a new item set Y β∪ ik. Note S1={i1,...,ik-1}. In addition, let S2A set of infrequently items newly found in the conditional schema base of Y. Due to S2Any element j, the set of items X-Y ∪ j is infrequent, and thus, according to the downward closure nature of the frequent pattern, is represented by X and S1The set of items made up of any subset of-j } is necessarily infrequent. Thus, can obtainThe reduction of (a) is:
r p = &Sigma; u = 1 q C | S 1 | - u p - | Y | - 1
wherein q is min { | S2|,|S1- | - (p-Y | -1) }, where p is the length of the transaction and q is the number of sub-transactions after splitting.
And step 3: and taking the upper limit of the number of the updated item sets as sensitivity, and adding the noise volume for the support degree by using a Laplace mechanism and taking the ratio of the sensitivity to the safety coefficient as the scale of Laplace probability distribution.
From the above analysis, it can be seen that the dynamic descent method only involves simple addition and multiplication operations, and does not bring about complex computation overhead. And the amount of noise added in the mining process is reduced by continuously reducing the upper limit of the item set calculation. On the premise of meeting the differential privacy, the usability of the mining result is improved. That is to say, the number of candidate frequent item sets is continuously adjusted by using a dynamic descent method, so that a small amount of noise is added after the support degree of the item sets in the segmented data is calculated, and the data availability is improved.
Through formal analysis, the transaction segmentation-based frequent item set mining method (PFP) meeting the differential privacy can provide higher mining efficiency and mining result availability while meeting the differential privacy protection.
By comparing the algorithm (TT) proposed by the document [8] and the algorithm (PB) proposed by the document [9], it can be determined that the proposed PFP algorithm has significant advantages in mining result availability and operation efficiency. To better illustrate the advantages of the algorithm of the present invention, the PFP algorithm is compared with the TT algorithm and the PB algorithm in terms of "availability of mining results" and "algorithm runtime overhead". Wherein, for the 'mining result availability', firstly, the comprehensive performance (F-score index) is used for measuring the correctness of the generated frequent item set. The formula for the calculation of the F-score index is as follows:
F - score = 2 &times; precision &times; recall precision + recall
wherein precision ═ Up∩UC|/|Up|,recall=|Up∩UC|/|UC|,UpIs a frequent set of terms, U, generated by a privacy algorithmCIs a real frequent item set.
In addition, in order to measure the error of the support degree of the published frequent item set relative to the real support degree of the item set, the recall ratio (RE index) is used for measurement. The calculation formula of the RE index is as follows:
RE = median X | sup &prime; ( x ) - sup ( x ) | sup ( x )
where X represents all the generated frequent item sets, sup (X) represents the true support of item set X, and sup' (X) represents the noise support of item set X.
The specific experimental setup was as follows: first, four sets of real data are used: the accounts contain traffic accident data; pumsb-star census data from PUMS (PublicUseMicrodataSample); POS data from a retail outlet of a large electronic retailer; retail contains basket market data for an anonymous Retail store in Belgium. The first two of which belong to the dense data set and the last two of which belong to the sparse data set. Second, all algorithms are implemented in the JAVA language. Finally, the experimental environments tested were IntelCore2DuoE8400CPU (3.0GHz) and 4 GBRAM.
The performance of the PFP algorithm is illustrated by analyzing the experimental data below.
Result availability:
on four sets of data, the F-score and RE indices of the algorithms PFP, TT and PP were measured separately by selecting different thresholds. Since the PB algorithm is used to mine the top-k frequent set of items, which cannot be directly compared to PP, consider a scenario where the user chooses k given a threshold. The experimental results are shown in FIGS. 5 to 9.
As can be seen from fig. 5, 6, 7 and 8, PFP can achieve better results than TT in the four data sets. The results of the experiment were analyzed as follows. Compared with the direct truncation transaction, the transaction segmentation reserves a plurality of sub-transactions for each transaction, and the weight among the sub-transactions is uniformly distributed, so that the loss of information can be obviously reduced. Although PFP accuracy is slightly reduced, the number of frequent itemsets is significantly increased. This is because the transaction is divided into a plurality of sub-transactions reserved for each transaction, and the weights among the sub-transactions are uniformly distributed, which can significantly reduce the loss of information, and thus can increase the number of frequent item sets generated. Under a set scenario, PFP still can obtain better F-score values for the data sets Pumsb, POS and Retail.
Algorithm runtime overhead:
the running times of k frequent item sets before algorithm PFP, FP, TT and PB query are measured on four sets of data respectively, wherein k is in a value range of [10, 200 ]. For PFP, preprocessing is performed only once, and is independent of the user-selected threshold, so runtime does not include preprocessing time.
As can be seen from FIG. 9, PFP is comparable to FP-growth performance, and PFP achieves better time efficiency than TT and PB. The results were analyzed as follows: compared with FP-growth, the PFP does not bring much burden in the support degree estimation method and the dynamic descent method in the excavation stage; compared with TT, the FP-growth algorithm used in PFP has better performance than the Apriori algorithm, so PFP efficiency is higher than TT efficiency.
Through formal analysis and a large number of experiments, the frequent item set mining algorithm based on object segmentation, which meets the differential privacy, is found to have better effects in the aspects of privacy, mining availability and operating efficiency.
The above embodiments are only for illustrating the invention and are not to be construed as limiting the invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention, therefore, all equivalent technical solutions also belong to the scope of the invention, and the scope of the invention is defined by the claims.

Claims (5)

1. A frequent item set mining method is characterized by comprising the following steps:
s1: dividing the affairs with the affair length larger than the limit length in the original database into a plurality of sub-affairs, and enabling the length of each affair in the divided database to be not larger than the limit length;
s2: and (4) mining a frequent item set in the segmented database by using a support estimation method and a dynamic descent method according to a support threshold value specified in advance.
2. The frequent itemset mining method of claim 1, wherein the step S1 specifically includes:
s1.1: constructing an undirected weighted graph based on an original database, wherein vertexes represent items in the database, when item sets formed by items corresponding to the two vertexes appear in the same transaction, the two vertexes are connected to form an edge, and the weight of the edge is the support degree of the item sets formed by the items corresponding to the vertexes in the transaction database;
s1.2: discovering communities in the undirected weighted graph by using a Louvain algorithm, constructing a tree-structured CR-tree by using an intermediate output result in an iterative process of the Louvain algorithm, wherein nodes of each layer in the CR-tree represent the communities discovered in the same iterative process, the height of the tree represents the iterative times, a father node represents a new community formed by merging communities represented by child nodes, and the correlation among items is represented by the length of the shortest path among leaf nodes containing the item;
s1.3: utilizing the generated CR-tree to segment the transactions with the length larger than the limit length in the original database to generate a segmented database, and recording the transactions, the transaction length, the maximum transaction limit length and the constructed CR-tree as T, p, m and T respectively, wherein the specific segmentation process is as follows:
s1.3.1: calculating the number q of the sub-transactions after the division according to p and m, namely
S1.3.2: setting a result set R after the transaction t is segmented as null;
s1.3.3: construct the ith transaction tiThe method comprises the following steps a) to e):
a) selecting nodes containing elements in the transaction t from CR-tree leaf nodes, removing items which are not contained in the t from the nodes, and forming a set according to the new nodes;
b) leaf layer N from CR-treelIn which the node n containing the most items is selectedlAnd n islTerm (iii) add to ti
c) To NlThe other nodes are in accordance with nlSorting the distances in the T from large to small, and sorting the nodes with the same correlation from large to small according to the capacity of the nodes;
d) traverse N in sequencelThe remaining nodes, if their capacity is equal to nlIf the sum of the capacities is not more than m, the sum is changed from NlIs taken out and put ini
e) Will tiStoring the result into a result set R;
s1.3.4: s1.3.1 repeating the process q times;
s1.3.5: if N is presentlIn which there is still a node, NlRandomly putting the item in each node into a sub-transaction with the length smaller than m in the R;
s1.3.6: and returning the result set R.
3. The frequent itemset mining method of claim 2, characterized in that after splitting, for each split transaction tiThe weight is given 1/q.
4. The frequent itemset mining method of claim 3, wherein the support degree estimation method in step S2 specifically includes:
s2.1: let the length of the item set X in the result set R be i, the support in the original database be ω, the support in the segmented database be ω', and the support after adding noise in the segmented database be ω
S2.2: according toEstimate ω', derived from bayesian criterion:
Pr ( &omega; &prime; | &omega; ~ ) = Pr ( &omega; &OverBar; | &omega; &prime; ) &CenterDot; Pr ( &omega; &prime; ) Pr ( &omega; &OverBar; ) ;
assuming ω' obeys a uniform prior distribution, its conditional probability distribution satisfies:
Pr ( &omega; &prime; | &omega; ~ ) ~ e - &epsiv; | &omega; &prime; - &omega; &OverBar; | ;
s2.3: estimating the support degree omega of X in the original database, the maximum limit length of the transaction is m, a transaction t with the length p contains X, and p>m, dividing t into q ═ sSubsets, each subset having a length not exceeding m, andthe length of one subset is m, the length of the other subset is less than m, and the length of the other subset is a ═ p-q' · m, so that after the split transaction t is obtained, the probability that X is included in the split sub-transaction is as follows:
&beta; p = q &prime; C p - i m - i C p m if a < i q &prime; C p - i m - i C p m + C p - i &alpha; - i C p &alpha; ifa &GreaterEqual; i ;
let αkRepresenting the number of transactions containing X and having a length k, where n is the maximum value of the transaction length, the expectation to calculate ω' is:
order:
the average support of X in the raw database was estimated as:
avg ( &omega; &prime; ) = &omega; &prime; ratio ( i ) ;
using the ρ -lower bit line, the maximum support of X in the original database is estimated as:
max ( &omega; &prime; ) = &omega; &prime; - ln &rho; + ln 2 &rho; - 2 &omega; &prime; ln &rho; ratio ( i ) if ln &rho; &le; 2 &omega; &prime; avg ( &omega; &prime; ) if ln &rho; > 2 &omega; &prime; ;
support of adding noise to the conversion database according to XThe maximum support and the average support in the original database can be estimated as follows:
max _ supp ( &omega; ~ ) = &Integral; &omega; &prime; = &omega; &OverBar; - 5 &omega; &prime; = &omega; &OverBar; + 5 Pr ( &omega; &prime; | &omega; ~ ) max ( &omega; &prime; )
avg _ supp ( &omega; ~ ) = &Integral; &omega; &prime; = &omega; &OverBar; - 5 &omega; &prime; = &omega; &OverBar; + 5 Pr ( &omega; &prime; | &omega; ~ ) avg ( &omega; &prime; ) ;
if the average support degree of the item set X is larger than the support degree threshold value, X is a frequent item set; for a frequent item set, if the maximum support of item set X is greater than a given support threshold, then X is used to generate a frequent item set candidate.
5. The frequent itemset mining method of claim 4, wherein the dynamic descent method in step S2 specifically includes:
step 2.4: assigning an initial value to the upper limit of the number of item set queries with the length i;
assume that the frequent item set candidate contains s frequent itemsDefining an arrayTo store an upper limit on the number of computed sets of items of different lengths, wherein,representing an upper limit on the number of computed sets of items of length i, with an initial value of
Step 2.5: in the mining process, dynamically reducing the upper limit of the number of the item sets with the length of i, specifically:
the order of the items currently in the header table of the conditional FP-tree of item set β, item set β is { i } [ -i [ - ]1,...,ik,...,inFor the k-th element i in the header tablekWhich together with the item set β form a new item set Y β∪ ikRemember S1={i1,...,ik-1Let S2Is a set of infrequently items newly found in the conditional schema base of Y, since for S2Any element j, item set X-Y ∪ j is infrequent, according to the downward closure nature of the frequent pattern, from X to S1The set of items made up of any subset of-j } must be infrequent, resulting inThe reduction of (a) is:
r p = &Sigma; u = 1 q C | S 1 | - u p - | Y | - 1
wherein q is min { | S2|,|S1- | - (p-Y | -1) }, wherein p is the length of the transaction and q is the number of the sub-transactions after segmentation;
step 2.6: and taking the upper limit of the number of the updated item sets as sensitivity, and adding the noise volume for the support degree by using a Laplace mechanism and taking the ratio of the sensitivity to the safety coefficient as the scale of Laplace probability distribution.
CN201410746488.2A 2014-12-08 2014-12-08 Frequent item set mining method Pending CN105740245A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410746488.2A CN105740245A (en) 2014-12-08 2014-12-08 Frequent item set mining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410746488.2A CN105740245A (en) 2014-12-08 2014-12-08 Frequent item set mining method

Publications (1)

Publication Number Publication Date
CN105740245A true CN105740245A (en) 2016-07-06

Family

ID=56237954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410746488.2A Pending CN105740245A (en) 2014-12-08 2014-12-08 Frequent item set mining method

Country Status (1)

Country Link
CN (1) CN105740245A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066587A (en) * 2017-04-17 2017-08-18 贵州大学 A kind of efficient Mining Frequent Itemsets based on group chained list
CN107092837A (en) * 2017-04-25 2017-08-25 华中科技大学 A kind of Mining Frequent Itemsets and system for supporting difference privacy
CN107247995A (en) * 2016-09-29 2017-10-13 上海交通大学 Transmission line of electricity running status association rule mining and Forecasting Methodology based on Bayesian model
CN107590733A (en) * 2017-08-08 2018-01-16 杭州灵皓科技有限公司 Platform methods of risk assessment is borrowed based on the net of geographical economy and social networks
CN107870913A (en) * 2016-09-23 2018-04-03 腾讯科技(深圳)有限公司 The high of effective time it is expected weight item collection method for digging, device and processing equipment
CN107908665A (en) * 2017-10-20 2018-04-13 国网浙江省电力公司经济技术研究院 A kind of frequent node method for digging of directed acyclic graph power grid enterprises and digging system
CN108346085A (en) * 2018-01-30 2018-07-31 南京邮电大学 Electric business platform personalized recommendation method based on weighted frequent items mining algorithm
CN108475292A (en) * 2018-03-20 2018-08-31 深圳大学 Mining Frequent Itemsets, device, equipment and the medium of large-scale dataset
CN108932658A (en) * 2018-07-13 2018-12-04 北京京东金融科技控股有限公司 Data processing method, device and computer readable storage medium
CN109299436A (en) * 2018-09-17 2019-02-01 北京邮电大学 A kind of ordering of optimization preference method of data capture meeting local difference privacy
CN109657498A (en) * 2018-12-28 2019-04-19 广西师范大学 The difference method for secret protection that top-k Symbiotic Model excavates in a plurality of stream
CN109783464A (en) * 2018-12-21 2019-05-21 昆明理工大学 A kind of Mining Frequent Itemsets based on Spark platform
CN110096629A (en) * 2019-05-15 2019-08-06 重庆大学 A method of the Mining Frequent based on effective weight tree weights item collection
CN110287240A (en) * 2019-06-27 2019-09-27 浪潮软件集团有限公司 A kind of mining algorithm based on Top-K frequent item set
CN110471957A (en) * 2019-08-16 2019-11-19 安徽大学 Localization difference secret protection Mining Frequent Itemsets based on frequent pattern tree (fp tree)
CN110490000A (en) * 2019-08-23 2019-11-22 广西师范大学 The difference method for secret protection that Frequent tree mining excavates in more diagram datas
WO2020253221A1 (en) * 2019-06-19 2020-12-24 江南大学 Method for analyzing relationship between communication path and heat resistance of lipase
CN112434089A (en) * 2020-12-23 2021-03-02 龙马智芯(珠海横琴)科技有限公司 Frequent item mining method and device, server and readable storage medium
CN113282686A (en) * 2021-06-03 2021-08-20 光大科技有限公司 Method and device for determining association rule of unbalanced sample
CN115810272A (en) * 2023-02-09 2023-03-17 北京华录高诚科技有限公司 Vehicle safety supervision method and system

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107870913A (en) * 2016-09-23 2018-04-03 腾讯科技(深圳)有限公司 The high of effective time it is expected weight item collection method for digging, device and processing equipment
CN107870913B (en) * 2016-09-23 2021-12-14 腾讯科技(深圳)有限公司 Efficient time high expectation weight item set mining method and device and processing equipment
CN107247995A (en) * 2016-09-29 2017-10-13 上海交通大学 Transmission line of electricity running status association rule mining and Forecasting Methodology based on Bayesian model
CN107066587A (en) * 2017-04-17 2017-08-18 贵州大学 A kind of efficient Mining Frequent Itemsets based on group chained list
CN107092837A (en) * 2017-04-25 2017-08-25 华中科技大学 A kind of Mining Frequent Itemsets and system for supporting difference privacy
CN107590733A (en) * 2017-08-08 2018-01-16 杭州灵皓科技有限公司 Platform methods of risk assessment is borrowed based on the net of geographical economy and social networks
CN107908665A (en) * 2017-10-20 2018-04-13 国网浙江省电力公司经济技术研究院 A kind of frequent node method for digging of directed acyclic graph power grid enterprises and digging system
CN108346085A (en) * 2018-01-30 2018-07-31 南京邮电大学 Electric business platform personalized recommendation method based on weighted frequent items mining algorithm
CN108475292B (en) * 2018-03-20 2021-08-24 深圳大学 Frequent item set mining method, device, equipment and medium for large-scale data set
WO2019178733A1 (en) * 2018-03-20 2019-09-26 深圳大学 Method and apparatus for mining frequent item sets of large-scale data set, device, and medium
CN108475292A (en) * 2018-03-20 2018-08-31 深圳大学 Mining Frequent Itemsets, device, equipment and the medium of large-scale dataset
CN108932658B (en) * 2018-07-13 2021-07-06 京东数字科技控股有限公司 Data processing method, device and computer readable storage medium
CN108932658A (en) * 2018-07-13 2018-12-04 北京京东金融科技控股有限公司 Data processing method, device and computer readable storage medium
CN109299436A (en) * 2018-09-17 2019-02-01 北京邮电大学 A kind of ordering of optimization preference method of data capture meeting local difference privacy
CN109299436B (en) * 2018-09-17 2021-10-15 北京邮电大学 Preference sorting data collection method meeting local differential privacy
CN109783464A (en) * 2018-12-21 2019-05-21 昆明理工大学 A kind of Mining Frequent Itemsets based on Spark platform
CN109783464B (en) * 2018-12-21 2022-11-04 昆明理工大学 Spark platform-based frequent item set mining method
CN109657498A (en) * 2018-12-28 2019-04-19 广西师范大学 The difference method for secret protection that top-k Symbiotic Model excavates in a plurality of stream
CN109657498B (en) * 2018-12-28 2021-09-24 广西师范大学 Differential privacy protection method for top-k symbiotic mode mining in multiple streams
CN110096629A (en) * 2019-05-15 2019-08-06 重庆大学 A method of the Mining Frequent based on effective weight tree weights item collection
CN110096629B (en) * 2019-05-15 2023-07-28 重庆大学 Memory optimization method for transaction processing
WO2020253221A1 (en) * 2019-06-19 2020-12-24 江南大学 Method for analyzing relationship between communication path and heat resistance of lipase
CN110287240A (en) * 2019-06-27 2019-09-27 浪潮软件集团有限公司 A kind of mining algorithm based on Top-K frequent item set
CN110471957B (en) * 2019-08-16 2021-10-26 安徽大学 Localized differential privacy protection frequent item set mining method based on frequent pattern tree
CN110471957A (en) * 2019-08-16 2019-11-19 安徽大学 Localization difference secret protection Mining Frequent Itemsets based on frequent pattern tree (fp tree)
CN110490000B (en) * 2019-08-23 2022-04-05 广西师范大学 Differential privacy protection method for frequent subgraph mining in multi-graph data
CN110490000A (en) * 2019-08-23 2019-11-22 广西师范大学 The difference method for secret protection that Frequent tree mining excavates in more diagram datas
CN112434089A (en) * 2020-12-23 2021-03-02 龙马智芯(珠海横琴)科技有限公司 Frequent item mining method and device, server and readable storage medium
CN113282686A (en) * 2021-06-03 2021-08-20 光大科技有限公司 Method and device for determining association rule of unbalanced sample
CN113282686B (en) * 2021-06-03 2023-11-07 光大科技有限公司 Association rule determining method and device for unbalanced sample
CN115810272A (en) * 2023-02-09 2023-03-17 北京华录高诚科技有限公司 Vehicle safety supervision method and system

Similar Documents

Publication Publication Date Title
CN105740245A (en) Frequent item set mining method
US10436940B2 (en) Systems and methods for the quantitative estimate of production-forecast uncertainty
US7801924B2 (en) Decision tree construction via frequent predictive itemsets and best attribute splits
Leung et al. A data science solution for mining interesting patterns from uncertain big data
CN109409128B (en) Differential privacy protection-oriented frequent item set mining method
CN109726587B (en) Spatial data partitioning method based on differential privacy
CN105184307A (en) Medical field image semantic similarity matrix generation method
JP2018536909A (en) System and method for automatically inferring a cube schema used in a multidimensional database environment from tabular data
Perez et al. A filtered bucket-clustering method for projection onto the simplex and the ℓ 1 ball
CN106598999A (en) Method and device for calculating text theme membership degree
US20160203105A1 (en) Information processing device, information processing method, and information processing program
Iqbal et al. Groundwater level prediction model using correlation and difference mechanisms based on boreholes data for sustainable hydraulic resource management
You et al. Eulerian methods for visualizing continuous dynamical systems using Lyapunov exponents
CN105205052A (en) Method and device for mining data
Trimble et al. A strongly coupled, fully implicit, three-dimensional, three-phase well coning model
CN114092729A (en) Heterogeneous electricity consumption data publishing method based on cluster anonymization and differential privacy protection
Ghane et al. Publishing spatial histograms under differential privacy
CN109408643B (en) Fund similarity calculation method, system, computer equipment and storage medium
Serre et al. A BME solution of the inverse problem for saturated groundwater flow
Cavoretto et al. Node-bound communities for partition of unity interpolation on graphs
Yan et al. The application of the intelligent algorithm in the prevention and early warning of mountain mass landslide disaster
Vassilevski et al. Commuting projections on graphs
CN108492014B (en) Data processing method and device for determining geological resource amount
Dhiman et al. Frequent subgraph mining algorithms for single large graphs—A brief survey
Li et al. Rockburst estimation model based on IEWM-SCM and its application

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160706

RJ01 Rejection of invention patent application after publication