CN105740245A - Frequent item set mining method - Google Patents
Frequent item set mining method Download PDFInfo
- Publication number
- CN105740245A CN105740245A CN201410746488.2A CN201410746488A CN105740245A CN 105740245 A CN105740245 A CN 105740245A CN 201410746488 A CN201410746488 A CN 201410746488A CN 105740245 A CN105740245 A CN 105740245A
- Authority
- CN
- China
- Prior art keywords
- omega
- length
- prime
- transaction
- database
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 67
- 238000005065 mining Methods 0.000 title claims abstract description 62
- 230000008569 process Effects 0.000 claims description 22
- 230000011218 segmentation Effects 0.000 claims description 14
- 230000035945 sensitivity Effects 0.000 claims description 9
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000012804 iterative process Methods 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 claims description 4
- 230000009467 reduction Effects 0.000 claims description 3
- 238000007418 data mining Methods 0.000 abstract description 3
- 238000004364 calculation method Methods 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 5
- 238000005259 measurement Methods 0.000 description 5
- 238000007781 pre-processing Methods 0.000 description 3
- 238000009412 basement excavation Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 101100001678 Emericella variicolor andM gene Proteins 0.000 description 1
- 206010039203 Road traffic accident Diseases 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of data mining and data privacy, and discloses a frequent item set mining method. The frequent item set mining method comprises the following steps: S1: segmenting a transaction of which the transaction length is greater than a restriction length in an original database into a plurality of sub-transactions, and causing the length of each transaction in the segmented database to be smaller than or equal to the restriction length; and S2: according to a support degree threshold value which is appointed in advance, utilizing a support degree estimation method and a dynamic descent method to mine the frequent item set in the segmented database. The frequent item set mining method can provide higher mining efficiency and mining result availability while differential privacy protection is met.
Description
Technical Field
The invention relates to the technical field of data mining and data privacy, in particular to a frequent item set mining method.
Background
Frequent item set mining is a fundamental problem in the field of data mining and has wide application in many fields. Frequent itemset mining can be described as follows: given a database of transactions, each transaction corresponds to a personal record of the user. Where a transaction is a collection of items. Given a set of items (a collection of items), its support refers to the number of transactions that contain the set of items. When the support degree of a certain item set is not less than a given threshold value, the item set is called a frequent item set. Frequent item set mining, when given a transactional database and a threshold, mines all the frequent item sets that appear in the database.
FP-growth algorithm in frequent item set mining[2]Is a widely used mining algorithm. The FP-growth algorithm is a depth-first traversal algorithm. In the mining process, the FP-growth algorithm accelerates the whole frequent item set by using the FP-treeAnd (5) excavating. The FP-tree is a special prefix tree. And the database is scanned for two times by utilizing the FP-tree and FP-growth algorithm, so that the mining efficiency is greatly improved.
In the mining process, if the affairs (personal records) in the affair database belong to sensitive information, the frequent item sets are directly issued, so that the personal records of the user are leaked. How to protect the privacy of user data during frequent item set mining is receiving more and more attention. Differential privacy protection paradigm[3]The proposal provides a feasible scheme for solving the privacy problem in the data analysis process. And k-anonymity[4]And l-diversity[5]In contrast, the differential privacy preserving paradigm provides user privacy protection with theoretical guarantees by adding noise.
Currently, there is research on privacy protection for frequent pattern mining using a differential privacy protection paradigm. Algorithm[6][7]The distributed transaction database is protected by differential privacy protection to protect user privacy. The published database may be used for frequent pattern mining. In particular, the literature[6]A database publishing algorithm under the guidance of a context-free tree structure is provided. The algorithm divides the database in a top-down mode and finally issues a synthetic database for frequent item set mining; literature reference[7]In the scene of data increment change, a transaction database publishing algorithm meeting the difference privacy is provided. Literature reference[8]The PrivBasis algorithm meeting the difference privacy is provided for mining top-k frequent item sets. Literature reference[9]It was found that limiting the length of the transaction may improve the balance of data availability and privacy protection. A frequent item set mining algorithm which meets the requirement of differential privacy protection and is based on an Apriori algorithm is designed by utilizing a truncation method. However, the above algorithms have shortcomings in the usability of mining results and mining efficiency, and the application of the differential privacy protection technology in frequent pattern mining research is hindered.
The references are as follows:
[1]R.AgrawalandR.Srikant,“Fastalgorithmsforminingassociationrules,”inVLDB,1994.
[2]J.Han,J.Pei,andY.Yin,“Miningfrequentpatternswithoutcandidategeneration,”inSIGMOD,2000.
[3]C.Dwork,“Differentialprivacy,”inICALP,2006.
[4]L.Sweeney,“k-anonymity:Amodelforprotectingprivacy,”Int.J.Uncertain.FuzzinessKnowl.-BaseSyst,2002.
[5]A.Machanavajjhala,J.Gehrke,D.Kifer,andM.
Venkitasubramaniam,“l-diversity:Privacybeyondk-anonymity,”inICDE,2006.
[6]R.Chen,N.Mohammed,B.C.M.Fung,B.C.Desai,andL.Xiong,“Publishingset-valueddataviadifferentialprivacy,”inVLDB,2011.
[7]X.Zhang,X.Meng,andR.Chen,“Differentiallyprivatesetvalueddatareleaseagainstincrementalupdates,”inDASFAA,2013.
[8]N.Li,W.Qardaji,D.Su,andJ.Cao,“Privbasis:frequentitemsetminingwithdifferentialprivacy,”inVLDB,2012.
[9]C.Zeng,J.F.Naughton,andJ.-Y.Cai,“Ondifferentiallyprivatefrequentitemsetmining,”inVLDB,2012.
disclosure of Invention
Technical problem to be solved
The technical problem to be solved by the invention is as follows: how to improve the availability of mining results.
(II) technical scheme
In order to solve the technical problem, the invention provides a frequent item set mining method, which comprises the following steps:
s1: dividing the affairs with the affair length larger than the limit length in the original database into a plurality of sub-affairs, and enabling the length of each affair in the divided database to be not larger than the limit length;
s2: and (4) mining a frequent item set in the segmented database by using a support estimation method and a dynamic descent method according to a support threshold value specified in advance.
Wherein, the step S1 specifically includes:
s1.1: constructing an undirected weighted graph based on an original database, wherein vertexes represent items in the database, when item sets formed by items corresponding to the two vertexes appear in the same transaction, the two vertexes are connected to form an edge, and the weight of the edge is the support degree of the item sets formed by the items corresponding to the vertexes in the transaction database;
s1.2: discovering communities in the undirected weighted graph by using a Louvain algorithm, constructing a tree-structured CR-tree by using an intermediate output result in an iterative process of the Louvain algorithm, wherein nodes of each layer in the CR-tree represent the communities discovered in the same iterative process, the height of the tree represents the iterative times, a father node represents a new community formed by merging communities represented by child nodes, and the correlation among items is represented by the length of the shortest path among leaf nodes containing the item;
s1.3: utilizing the generated CR-tree to segment the transactions with the length larger than the limit length in the original database to generate a segmented database, and recording the transactions, the transaction length, the maximum transaction limit length and the constructed CR-tree as T, p, m and T respectively, wherein the specific segmentation process is as follows:
s1.3.1: calculating the number q of the sub-transactions after the division according to p and m, namely
S1.3.2: setting a result set R after the transaction t is segmented as null;
s1.3.3: construct the ith transaction tiThe method comprises the following steps a) to e):
a) selecting nodes containing elements in the transaction t from CR-tree leaf nodes, removing items which are not contained in the t from the nodes, and forming a set according to the new nodes;
b) leaf layer N from CR-treelIn which the node n containing the most items is selectedlAnd n islTerm (iii) add to ti;
c) To NlThe other nodes are in accordance with nlSorting the distances in the T from large to small, and sorting the nodes with the same correlation from large to small according to the capacity of the nodes;
d) traverse N in sequencelThe remaining nodes, if their capacity is equal to nlIf the sum of the capacities is not more than m, the sum is changed from NlIs taken out and put ini;
e) Will tiStoring the result into a result set R;
s1.3.4: s1.3.1 repeating the process q times;
s1.3.5: if N is presentlIn which there is still a node, NlRandomly putting the item in each node into a sub-transaction with the length smaller than m in the R;
s1.3.6: and returning the result set R.
Wherein for each split transaction t after splittingiThe weight is given 1/q.
The support degree estimation method in step S2 specifically includes:
s2.1: let the length of the item set X in the result set R be i, the support in the original database be ω, the support in the segmented database be ω', and the support after adding noise in the segmented database be ω
S2.2: according toEstimate ω', derived from bayesian criterion:
assuming ω' obeys a uniform prior distribution, its conditional probability distribution satisfies:
s2.3: estimating the support degree omega of X in the original database, the maximum limit length of the transaction is m, a transaction t with the length p contains X, and p>m, dividing t intoSubsets, each subset having a length not exceeding m, andthe length of one subset is m, the length of the other subset is less than m, and the length of the other subset is a ═ p-q' · m, so that after the split transaction t is obtained, the probability that X is included in the split sub-transaction is as follows:
let αkRepresenting the number of transactions containing X and having a length k, where n is the maximum value of the transaction length, the expectation to calculate ω' is:
order:
the average support of X in the raw database was estimated as:
using the ρ -lower bit line, the maximum support of X in the original database is estimated as:
support of adding noise to the conversion database according to XThe maximum support and the average support in the original database can be estimated as follows:
if the average support degree of the item set X is larger than the support degree threshold value, X is a frequent item set; for a frequent item set, if the maximum support of item set X is greater than a given support threshold, then X is used to generate a frequent item set candidate.
The dynamic descent method in step S2 specifically includes:
step 2.4: assigning an initial value to the upper limit of the number of item set queries with the length i;
assuming that the frequent item set candidate set contains s frequent items, defining an arrayTo store an upper limit on the number of computed sets of items of different lengths, wherein,representing an upper limit on the number of computed sets of items of length i, with an initial value of
Step 2.5: in the mining process, dynamically reducing the upper limit of the number of the item sets with the length of i, specifically:
the order of the items currently in the header table of the conditional FP-tree of item set β, item set β is { i } [ -i [ - ]1,...,ik,...,inFor the k-th element i in the header tablekWhich together with the item set β form a new item set Y β∪ ikRemember S1={i1,...,ik-1Let S2Is a set of infrequently items newly found in the conditional schema base of Y, since for S2Any element j, the item set X-Y ∪ j is infrequent, according to a frequent patternIs defined by X and { S1The set of items made up of any subset of-j } must be infrequent, resulting inThe reduction of (a) is:
wherein q is min { | S2|,|S1- | - (p-Y | -1) }, wherein p is the length of the transaction and q is the number of the sub-transactions after segmentation;
step 2.6: and taking the upper limit of the number of the updated item sets as sensitivity, and adding the noise volume for the support degree by using a Laplace mechanism and taking the ratio of the sensitivity to the safety coefficient as the scale of Laplace probability distribution.
(III) advantageous effects
The method and the device can provide higher mining efficiency and mining result usability while meeting the differential privacy protection.
Drawings
FIG. 1 is a flow chart of a frequent itemset mining method of the present invention;
FIG. 2 is a schematic diagram of a given transaction database in the present invention;
FIG. 3 is a undirected weighting graph accomplished in accordance with FIG. 2;
FIG. 4 is a CR-tree constructed in accordance with FIG. 3;
FIG. 5 is the F-score and RE measurements of PFP, TT and PB in Pulsb;
FIG. 6 is the F-score and RE measurements of PFP, TT and PB in Accidents;
FIG. 7 is the F-score and RE measurements of PFP, TT and PB in POS;
FIG. 8 is the F-score and RE measurements of PFP, TT and PB in Retail;
FIG. 9 is the runtime of PFP, FP, TT, and PB under different data sets.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
The flow of the frequent item set mining method of the embodiment of the invention is shown in FIG. 1, and the method comprises the following steps:
step S1: the raw database is preprocessed. And by using an intelligent segmentation method, segmenting the transaction with the length greater than the specified limit length in the original database into a plurality of sub-transactions, so that the length of each transaction in the converted database is not greater than the specified limit length.
Step S2: and digging a frequent item set on the premise of meeting the differential privacy protection. And mining a frequent item set in the converted database according to a threshold value specified by a user. In the mining process, the information loss caused by transaction segmentation is reduced by using a support degree estimation method; meanwhile, the dynamic descending method is utilized to reduce the noise addition amount in the excavation process so as to improve the result usability.
The following describes an intelligent segmentation method, a support estimation method, and a dynamic descent method in the preprocessing process.
For step S1, in order to improve the usability of the mining result and maintain a high privacy protection level, a new transformation, i.e. transaction segmentation, is performed on the original database. When a record length exceeds a limit length, the transaction is divided into a plurality of sub-transactions. Each sub-transaction resulting from the splitting satisfies the maximum length constraint. However, simply performing a partition transformation on the original database may cause a large loss of information, so that some frequent item sets are no longer frequent, which affects the availability of mining results. Meanwhile, after the database is partitioned, the privacy protection level of the mining algorithm meeting the differential privacy protection is lowered. In order to solve the above problems, the present invention proposes an intelligent segmentation method. The method comprises the following specific steps:
step 1: based on the original database, an undirected weighted graph is constructed. Where the vertices represent entries in the database. An edge is formed by joining two vertices when a set of items, which are formed by items corresponding to the two vertices, appear in the same transaction. The weight of the edge is the support degree of the item set formed by the items corresponding to the vertex in the transaction database. Given a transactional database (fig. 1), an undirected weighted graph constructed in the above manner is shown in fig. 2.
Step 2: communities are found in the constructed undirected weighted graph using the Louvain algorithm. The Louvain algorithm is an iterative algorithm. And constructing a tree structure CR-tree by using an intermediate output result in the iteration process of the Louvain algorithm. The nodes of each layer in the CR-tree represent communities found in the same iteration process, the height of the tree represents the iteration number, and the parent node represents a new community formed by merging communities represented by child nodes. The correlation between items is represented by the shortest path length between leaf nodes containing the item. For example, given an undirected weighted graph as in FIG. 2, a CR-tree constructed using the Louvain algorithm is shown in FIG. 3.
And step 3: and utilizing the generated CR-tree to segment the transactions with the length larger than the limit length in the original database to generate a segmented database. The CR-trees for transactions, transaction lengths, maximum transaction limit lengths, and constructs are denoted as T, p, m, and T, respectively. The specific segmentation process is as follows:
1) calculating the number q of the sub-transactions after the division according to p and m, namely
2) Setting a result set R after the transaction t is segmented as null;
3) construct the ith transaction ti;
a) From the CR-tree leaf node, the node containing the element in the transaction t (i.e.: items) and removes items not included in t from among the nodes. Forming a set according to the new nodes;
b) leaf layer N from CR-treelIn which the node n containing the most items is selectedlAnd n islTerm (iii) add to ti;
c) To NlThe other nodes are in accordance with nlThe distances in T are ordered from large to small. Sorting the nodes with the same correlation according to the capacity from large to small;
d) traverse N in sequencelThe remaining nodes, if their capacity is equal to nlIf the sum of the capacities is not more than m, the sum is changed from NlIs taken out and put ini;
e) Will tiStoring the result into a result set R;
4) repeatedly executing the process 3) q times;
5) if N is presentLIn which there is still a node, NlRandomly putting the item in each node into a sub-transaction with the length smaller than m in the R;
6) and finally returning a result set R.
Thus, through the steps 1 to 3, the original database is converted into a new database, and the length of each transaction in the segmented database is ensured to meet the limit of the maximum length.
Through analysis, if a transaction is divided into k sub-transactions, the converted transaction database meets a frequent item set mining algorithm of differential privacy, and for the original database, only k-differential privacy can be guaranteed. To this end, the present invention proposes a weighted partitioning operation. The definition can be described as follows: when the maximum transaction capacity is m and the length of a transaction t exceeds m, the operation f of dividing the transaction (i.e. the steps 1 to 3) divides t into t1...tkSo that | ti| is less than or equal to m and is tiAssigning a weight wiIf f satisfiesAnd isThen f is called the weighted partitioning operation. The theory proves that when the original transaction database is converted by using the weighted segmentation operation, the converted transaction database meets a frequent item set mining algorithm of differential privacy, and the differential privacy can be ensured for the original database.
Further, in step S1, the transaction is divided, and a t is added to the database1,t1Is increased by 1/q instead of 1, which results in the calculation of t in the post-conversion database1Is smaller than the true value, and a part of the subset in t is lost, thereby causing a certain information loss, and in order to solve this problem, in order to compensate for the information loss, a support degree estimation method is proposed in step S2. The method mainly comprises the following two steps: firstly, estimating the accurate support degree of the item set in a segmented database according to the support degree of added noise (the noise is added by adopting a Laplace mechanism generally) obtained in the mining process of the item set; then, according to the support degree of the estimated item set in the partitioned database, the support degree of the item set in the original database is estimated. Detailed computing procedureThe following were used:
assuming that the length of the item set X is i, the support in the original database is ω, the support in the segmented database is ω', and the support after adding noise in the segmented database is ω
First, according toEstimating ω', from bayesian criterion, one can derive:
assuming ω' obeys a consistent prior distribution, its conditional probability distribution satisfies:
then, the support ω of X in the original database is estimated.
Assuming that the maximum limit length of a transaction is m, a transaction t of length p contains X, and p>And m is selected. To improve the usability of the mining results and to guarantee a higher privacy protection level, t is split intoSubsets, each of which is no longer than m in length. Simultaneous assumptionOne subset is m in length, another subset may be less than m in length, and a is p-q' · m in length. Thus, after the split transaction t is obtained, the probability that X is included in one split sub-transaction is:
let αkRepresenting the number of transactions containing X and having a length k, where n is the maximum value of the transaction length, the expectation that ω' can be calculated is:
for convenience of description, the second half of the right side of the equation (i.e., the contents of the last parenthesis) will be referred to simply as ratio (i), i.e.:
this can estimate the average support of X in the raw database as:
using the ρ -lower bit line, the maximum support of X in the original database can be estimated as:
based on the above analysis, according to XSupport for adding noise to the conversion databaseThe maximum support and the average support in the original database can be estimated as follows:
if the average support degree of the item set X is larger than the support degree threshold value, X is a frequent item set; for a frequent item set, if the maximum support of item set X is greater than a given support threshold, then X is used to generate a frequent item set candidate.
Furthermore, after the support degree of the complete item set in the partitioned database is calculated, a proper amount of noise is added according to the difference privacy requirement. The noise level is proportional to the computational sensitivity. For the item set with the length of i, the calculation sensitivity of the support degree is equal to the number of the calculation of the item set with the length of i in the mining process. Since FP-growth is a depth-first traversal algorithm, it is difficult to accurately count the number of item sets with length i in the mining process. If the FP-growth is adjusted to generate the item sets with the same length at the same time, a large amount of storage overhead is caused. In order to solve the above problems, the present invention provides a lightweight dynamic descent method. The core idea is to dynamically reduce the upper limit of the number of the item set with the length i in the calculation by utilizing the downward closure property of the frequent mode so as to achieve the purpose of reducing the sensitivity of the item set with the length i in the calculation. The specific process can be described as follows:
step 1: assigning an initial value to the upper limit of the number of item set queries with the length i;
assuming that the database contains s frequent items, an array is definedTo store an upper bound on the number of computed sets of items of different lengths. Wherein,representing an upper limit on the number of computed sets of items of length i, with an initial value of
Step 2: in the mining process, dynamically reducing the upper limit of the number of the item sets with the length of i;
assume that the conditional FP-tree of item set β is currently being mined (the conditional FP-tree of β is an FP-tree created from a small database of all transaction compositions containing β.) the order of the items in the header table of the conditional FP-tree of item set β is { i }1,...,ik,...,in}. For the k-th element i in the header tablekWhich together with the item set β form a new item set Y β∪ ik. Note S1={i1,...,ik-1}. In addition, let S2A set of infrequently items newly found in the conditional schema base of Y. Due to S2Any element j, the set of items X-Y ∪ j is infrequent, and thus, according to the downward closure nature of the frequent pattern, is represented by X and S1The set of items made up of any subset of-j } is necessarily infrequent. Thus, can obtainThe reduction of (a) is:
wherein q is min { | S2|,|S1- | - (p-Y | -1) }, where p is the length of the transaction and q is the number of sub-transactions after splitting.
And step 3: and taking the upper limit of the number of the updated item sets as sensitivity, and adding the noise volume for the support degree by using a Laplace mechanism and taking the ratio of the sensitivity to the safety coefficient as the scale of Laplace probability distribution.
From the above analysis, it can be seen that the dynamic descent method only involves simple addition and multiplication operations, and does not bring about complex computation overhead. And the amount of noise added in the mining process is reduced by continuously reducing the upper limit of the item set calculation. On the premise of meeting the differential privacy, the usability of the mining result is improved. That is to say, the number of candidate frequent item sets is continuously adjusted by using a dynamic descent method, so that a small amount of noise is added after the support degree of the item sets in the segmented data is calculated, and the data availability is improved.
Through formal analysis, the transaction segmentation-based frequent item set mining method (PFP) meeting the differential privacy can provide higher mining efficiency and mining result availability while meeting the differential privacy protection.
By comparing the algorithm (TT) proposed by the document [8] and the algorithm (PB) proposed by the document [9], it can be determined that the proposed PFP algorithm has significant advantages in mining result availability and operation efficiency. To better illustrate the advantages of the algorithm of the present invention, the PFP algorithm is compared with the TT algorithm and the PB algorithm in terms of "availability of mining results" and "algorithm runtime overhead". Wherein, for the 'mining result availability', firstly, the comprehensive performance (F-score index) is used for measuring the correctness of the generated frequent item set. The formula for the calculation of the F-score index is as follows:
wherein precision ═ Up∩UC|/|Up|,recall=|Up∩UC|/|UC|,UpIs a frequent set of terms, U, generated by a privacy algorithmCIs a real frequent item set.
In addition, in order to measure the error of the support degree of the published frequent item set relative to the real support degree of the item set, the recall ratio (RE index) is used for measurement. The calculation formula of the RE index is as follows:
where X represents all the generated frequent item sets, sup (X) represents the true support of item set X, and sup' (X) represents the noise support of item set X.
The specific experimental setup was as follows: first, four sets of real data are used: the accounts contain traffic accident data; pumsb-star census data from PUMS (PublicUseMicrodataSample); POS data from a retail outlet of a large electronic retailer; retail contains basket market data for an anonymous Retail store in Belgium. The first two of which belong to the dense data set and the last two of which belong to the sparse data set. Second, all algorithms are implemented in the JAVA language. Finally, the experimental environments tested were IntelCore2DuoE8400CPU (3.0GHz) and 4 GBRAM.
The performance of the PFP algorithm is illustrated by analyzing the experimental data below.
Result availability:
on four sets of data, the F-score and RE indices of the algorithms PFP, TT and PP were measured separately by selecting different thresholds. Since the PB algorithm is used to mine the top-k frequent set of items, which cannot be directly compared to PP, consider a scenario where the user chooses k given a threshold. The experimental results are shown in FIGS. 5 to 9.
As can be seen from fig. 5, 6, 7 and 8, PFP can achieve better results than TT in the four data sets. The results of the experiment were analyzed as follows. Compared with the direct truncation transaction, the transaction segmentation reserves a plurality of sub-transactions for each transaction, and the weight among the sub-transactions is uniformly distributed, so that the loss of information can be obviously reduced. Although PFP accuracy is slightly reduced, the number of frequent itemsets is significantly increased. This is because the transaction is divided into a plurality of sub-transactions reserved for each transaction, and the weights among the sub-transactions are uniformly distributed, which can significantly reduce the loss of information, and thus can increase the number of frequent item sets generated. Under a set scenario, PFP still can obtain better F-score values for the data sets Pumsb, POS and Retail.
Algorithm runtime overhead:
the running times of k frequent item sets before algorithm PFP, FP, TT and PB query are measured on four sets of data respectively, wherein k is in a value range of [10, 200 ]. For PFP, preprocessing is performed only once, and is independent of the user-selected threshold, so runtime does not include preprocessing time.
As can be seen from FIG. 9, PFP is comparable to FP-growth performance, and PFP achieves better time efficiency than TT and PB. The results were analyzed as follows: compared with FP-growth, the PFP does not bring much burden in the support degree estimation method and the dynamic descent method in the excavation stage; compared with TT, the FP-growth algorithm used in PFP has better performance than the Apriori algorithm, so PFP efficiency is higher than TT efficiency.
Through formal analysis and a large number of experiments, the frequent item set mining algorithm based on object segmentation, which meets the differential privacy, is found to have better effects in the aspects of privacy, mining availability and operating efficiency.
The above embodiments are only for illustrating the invention and are not to be construed as limiting the invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention, therefore, all equivalent technical solutions also belong to the scope of the invention, and the scope of the invention is defined by the claims.
Claims (5)
1. A frequent item set mining method is characterized by comprising the following steps:
s1: dividing the affairs with the affair length larger than the limit length in the original database into a plurality of sub-affairs, and enabling the length of each affair in the divided database to be not larger than the limit length;
s2: and (4) mining a frequent item set in the segmented database by using a support estimation method and a dynamic descent method according to a support threshold value specified in advance.
2. The frequent itemset mining method of claim 1, wherein the step S1 specifically includes:
s1.1: constructing an undirected weighted graph based on an original database, wherein vertexes represent items in the database, when item sets formed by items corresponding to the two vertexes appear in the same transaction, the two vertexes are connected to form an edge, and the weight of the edge is the support degree of the item sets formed by the items corresponding to the vertexes in the transaction database;
s1.2: discovering communities in the undirected weighted graph by using a Louvain algorithm, constructing a tree-structured CR-tree by using an intermediate output result in an iterative process of the Louvain algorithm, wherein nodes of each layer in the CR-tree represent the communities discovered in the same iterative process, the height of the tree represents the iterative times, a father node represents a new community formed by merging communities represented by child nodes, and the correlation among items is represented by the length of the shortest path among leaf nodes containing the item;
s1.3: utilizing the generated CR-tree to segment the transactions with the length larger than the limit length in the original database to generate a segmented database, and recording the transactions, the transaction length, the maximum transaction limit length and the constructed CR-tree as T, p, m and T respectively, wherein the specific segmentation process is as follows:
s1.3.1: calculating the number q of the sub-transactions after the division according to p and m, namely
S1.3.2: setting a result set R after the transaction t is segmented as null;
s1.3.3: construct the ith transaction tiThe method comprises the following steps a) to e):
a) selecting nodes containing elements in the transaction t from CR-tree leaf nodes, removing items which are not contained in the t from the nodes, and forming a set according to the new nodes;
b) leaf layer N from CR-treelIn which the node n containing the most items is selectedlAnd n islTerm (iii) add to ti;
c) To NlThe other nodes are in accordance with nlSorting the distances in the T from large to small, and sorting the nodes with the same correlation from large to small according to the capacity of the nodes;
d) traverse N in sequencelThe remaining nodes, if their capacity is equal to nlIf the sum of the capacities is not more than m, the sum is changed from NlIs taken out and put ini;
e) Will tiStoring the result into a result set R;
s1.3.4: s1.3.1 repeating the process q times;
s1.3.5: if N is presentlIn which there is still a node, NlRandomly putting the item in each node into a sub-transaction with the length smaller than m in the R;
s1.3.6: and returning the result set R.
3. The frequent itemset mining method of claim 2, characterized in that after splitting, for each split transaction tiThe weight is given 1/q.
4. The frequent itemset mining method of claim 3, wherein the support degree estimation method in step S2 specifically includes:
s2.1: let the length of the item set X in the result set R be i, the support in the original database be ω, the support in the segmented database be ω', and the support after adding noise in the segmented database be ω
S2.2: according toEstimate ω', derived from bayesian criterion:
assuming ω' obeys a uniform prior distribution, its conditional probability distribution satisfies:
s2.3: estimating the support degree omega of X in the original database, the maximum limit length of the transaction is m, a transaction t with the length p contains X, and p>m, dividing t into q ═ sSubsets, each subset having a length not exceeding m, andthe length of one subset is m, the length of the other subset is less than m, and the length of the other subset is a ═ p-q' · m, so that after the split transaction t is obtained, the probability that X is included in the split sub-transaction is as follows:
let αkRepresenting the number of transactions containing X and having a length k, where n is the maximum value of the transaction length, the expectation to calculate ω' is:
order:
the average support of X in the raw database was estimated as:
using the ρ -lower bit line, the maximum support of X in the original database is estimated as:
support of adding noise to the conversion database according to XThe maximum support and the average support in the original database can be estimated as follows:
if the average support degree of the item set X is larger than the support degree threshold value, X is a frequent item set; for a frequent item set, if the maximum support of item set X is greater than a given support threshold, then X is used to generate a frequent item set candidate.
5. The frequent itemset mining method of claim 4, wherein the dynamic descent method in step S2 specifically includes:
step 2.4: assigning an initial value to the upper limit of the number of item set queries with the length i;
assume that the frequent item set candidate contains s frequent itemsDefining an arrayTo store an upper limit on the number of computed sets of items of different lengths, wherein,representing an upper limit on the number of computed sets of items of length i, with an initial value of
Step 2.5: in the mining process, dynamically reducing the upper limit of the number of the item sets with the length of i, specifically:
the order of the items currently in the header table of the conditional FP-tree of item set β, item set β is { i } [ -i [ - ]1,...,ik,...,inFor the k-th element i in the header tablekWhich together with the item set β form a new item set Y β∪ ikRemember S1={i1,...,ik-1Let S2Is a set of infrequently items newly found in the conditional schema base of Y, since for S2Any element j, item set X-Y ∪ j is infrequent, according to the downward closure nature of the frequent pattern, from X to S1The set of items made up of any subset of-j } must be infrequent, resulting inThe reduction of (a) is:
wherein q is min { | S2|,|S1- | - (p-Y | -1) }, wherein p is the length of the transaction and q is the number of the sub-transactions after segmentation;
step 2.6: and taking the upper limit of the number of the updated item sets as sensitivity, and adding the noise volume for the support degree by using a Laplace mechanism and taking the ratio of the sensitivity to the safety coefficient as the scale of Laplace probability distribution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410746488.2A CN105740245A (en) | 2014-12-08 | 2014-12-08 | Frequent item set mining method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410746488.2A CN105740245A (en) | 2014-12-08 | 2014-12-08 | Frequent item set mining method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105740245A true CN105740245A (en) | 2016-07-06 |
Family
ID=56237954
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410746488.2A Pending CN105740245A (en) | 2014-12-08 | 2014-12-08 | Frequent item set mining method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105740245A (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107066587A (en) * | 2017-04-17 | 2017-08-18 | 贵州大学 | A kind of efficient Mining Frequent Itemsets based on group chained list |
CN107092837A (en) * | 2017-04-25 | 2017-08-25 | 华中科技大学 | A kind of Mining Frequent Itemsets and system for supporting difference privacy |
CN107247995A (en) * | 2016-09-29 | 2017-10-13 | 上海交通大学 | Transmission line of electricity running status association rule mining and Forecasting Methodology based on Bayesian model |
CN107590733A (en) * | 2017-08-08 | 2018-01-16 | 杭州灵皓科技有限公司 | Platform methods of risk assessment is borrowed based on the net of geographical economy and social networks |
CN107870913A (en) * | 2016-09-23 | 2018-04-03 | 腾讯科技(深圳)有限公司 | The high of effective time it is expected weight item collection method for digging, device and processing equipment |
CN107908665A (en) * | 2017-10-20 | 2018-04-13 | 国网浙江省电力公司经济技术研究院 | A kind of frequent node method for digging of directed acyclic graph power grid enterprises and digging system |
CN108346085A (en) * | 2018-01-30 | 2018-07-31 | 南京邮电大学 | Electric business platform personalized recommendation method based on weighted frequent items mining algorithm |
CN108475292A (en) * | 2018-03-20 | 2018-08-31 | 深圳大学 | Mining Frequent Itemsets, device, equipment and the medium of large-scale dataset |
CN108932658A (en) * | 2018-07-13 | 2018-12-04 | 北京京东金融科技控股有限公司 | Data processing method, device and computer readable storage medium |
CN109299436A (en) * | 2018-09-17 | 2019-02-01 | 北京邮电大学 | A kind of ordering of optimization preference method of data capture meeting local difference privacy |
CN109657498A (en) * | 2018-12-28 | 2019-04-19 | 广西师范大学 | The difference method for secret protection that top-k Symbiotic Model excavates in a plurality of stream |
CN109783464A (en) * | 2018-12-21 | 2019-05-21 | 昆明理工大学 | A kind of Mining Frequent Itemsets based on Spark platform |
CN110096629A (en) * | 2019-05-15 | 2019-08-06 | 重庆大学 | A method of the Mining Frequent based on effective weight tree weights item collection |
CN110287240A (en) * | 2019-06-27 | 2019-09-27 | 浪潮软件集团有限公司 | A kind of mining algorithm based on Top-K frequent item set |
CN110471957A (en) * | 2019-08-16 | 2019-11-19 | 安徽大学 | Localization difference secret protection Mining Frequent Itemsets based on frequent pattern tree (fp tree) |
CN110490000A (en) * | 2019-08-23 | 2019-11-22 | 广西师范大学 | The difference method for secret protection that Frequent tree mining excavates in more diagram datas |
WO2020253221A1 (en) * | 2019-06-19 | 2020-12-24 | 江南大学 | Method for analyzing relationship between communication path and heat resistance of lipase |
CN112434089A (en) * | 2020-12-23 | 2021-03-02 | 龙马智芯(珠海横琴)科技有限公司 | Frequent item mining method and device, server and readable storage medium |
CN113282686A (en) * | 2021-06-03 | 2021-08-20 | 光大科技有限公司 | Method and device for determining association rule of unbalanced sample |
CN115810272A (en) * | 2023-02-09 | 2023-03-17 | 北京华录高诚科技有限公司 | Vehicle safety supervision method and system |
-
2014
- 2014-12-08 CN CN201410746488.2A patent/CN105740245A/en active Pending
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107870913A (en) * | 2016-09-23 | 2018-04-03 | 腾讯科技(深圳)有限公司 | The high of effective time it is expected weight item collection method for digging, device and processing equipment |
CN107870913B (en) * | 2016-09-23 | 2021-12-14 | 腾讯科技(深圳)有限公司 | Efficient time high expectation weight item set mining method and device and processing equipment |
CN107247995A (en) * | 2016-09-29 | 2017-10-13 | 上海交通大学 | Transmission line of electricity running status association rule mining and Forecasting Methodology based on Bayesian model |
CN107066587A (en) * | 2017-04-17 | 2017-08-18 | 贵州大学 | A kind of efficient Mining Frequent Itemsets based on group chained list |
CN107092837A (en) * | 2017-04-25 | 2017-08-25 | 华中科技大学 | A kind of Mining Frequent Itemsets and system for supporting difference privacy |
CN107590733A (en) * | 2017-08-08 | 2018-01-16 | 杭州灵皓科技有限公司 | Platform methods of risk assessment is borrowed based on the net of geographical economy and social networks |
CN107908665A (en) * | 2017-10-20 | 2018-04-13 | 国网浙江省电力公司经济技术研究院 | A kind of frequent node method for digging of directed acyclic graph power grid enterprises and digging system |
CN108346085A (en) * | 2018-01-30 | 2018-07-31 | 南京邮电大学 | Electric business platform personalized recommendation method based on weighted frequent items mining algorithm |
CN108475292B (en) * | 2018-03-20 | 2021-08-24 | 深圳大学 | Frequent item set mining method, device, equipment and medium for large-scale data set |
WO2019178733A1 (en) * | 2018-03-20 | 2019-09-26 | 深圳大学 | Method and apparatus for mining frequent item sets of large-scale data set, device, and medium |
CN108475292A (en) * | 2018-03-20 | 2018-08-31 | 深圳大学 | Mining Frequent Itemsets, device, equipment and the medium of large-scale dataset |
CN108932658B (en) * | 2018-07-13 | 2021-07-06 | 京东数字科技控股有限公司 | Data processing method, device and computer readable storage medium |
CN108932658A (en) * | 2018-07-13 | 2018-12-04 | 北京京东金融科技控股有限公司 | Data processing method, device and computer readable storage medium |
CN109299436A (en) * | 2018-09-17 | 2019-02-01 | 北京邮电大学 | A kind of ordering of optimization preference method of data capture meeting local difference privacy |
CN109299436B (en) * | 2018-09-17 | 2021-10-15 | 北京邮电大学 | Preference sorting data collection method meeting local differential privacy |
CN109783464A (en) * | 2018-12-21 | 2019-05-21 | 昆明理工大学 | A kind of Mining Frequent Itemsets based on Spark platform |
CN109783464B (en) * | 2018-12-21 | 2022-11-04 | 昆明理工大学 | Spark platform-based frequent item set mining method |
CN109657498A (en) * | 2018-12-28 | 2019-04-19 | 广西师范大学 | The difference method for secret protection that top-k Symbiotic Model excavates in a plurality of stream |
CN109657498B (en) * | 2018-12-28 | 2021-09-24 | 广西师范大学 | Differential privacy protection method for top-k symbiotic mode mining in multiple streams |
CN110096629A (en) * | 2019-05-15 | 2019-08-06 | 重庆大学 | A method of the Mining Frequent based on effective weight tree weights item collection |
CN110096629B (en) * | 2019-05-15 | 2023-07-28 | 重庆大学 | Memory optimization method for transaction processing |
WO2020253221A1 (en) * | 2019-06-19 | 2020-12-24 | 江南大学 | Method for analyzing relationship between communication path and heat resistance of lipase |
CN110287240A (en) * | 2019-06-27 | 2019-09-27 | 浪潮软件集团有限公司 | A kind of mining algorithm based on Top-K frequent item set |
CN110471957B (en) * | 2019-08-16 | 2021-10-26 | 安徽大学 | Localized differential privacy protection frequent item set mining method based on frequent pattern tree |
CN110471957A (en) * | 2019-08-16 | 2019-11-19 | 安徽大学 | Localization difference secret protection Mining Frequent Itemsets based on frequent pattern tree (fp tree) |
CN110490000B (en) * | 2019-08-23 | 2022-04-05 | 广西师范大学 | Differential privacy protection method for frequent subgraph mining in multi-graph data |
CN110490000A (en) * | 2019-08-23 | 2019-11-22 | 广西师范大学 | The difference method for secret protection that Frequent tree mining excavates in more diagram datas |
CN112434089A (en) * | 2020-12-23 | 2021-03-02 | 龙马智芯(珠海横琴)科技有限公司 | Frequent item mining method and device, server and readable storage medium |
CN113282686A (en) * | 2021-06-03 | 2021-08-20 | 光大科技有限公司 | Method and device for determining association rule of unbalanced sample |
CN113282686B (en) * | 2021-06-03 | 2023-11-07 | 光大科技有限公司 | Association rule determining method and device for unbalanced sample |
CN115810272A (en) * | 2023-02-09 | 2023-03-17 | 北京华录高诚科技有限公司 | Vehicle safety supervision method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105740245A (en) | Frequent item set mining method | |
US10436940B2 (en) | Systems and methods for the quantitative estimate of production-forecast uncertainty | |
US7801924B2 (en) | Decision tree construction via frequent predictive itemsets and best attribute splits | |
Leung et al. | A data science solution for mining interesting patterns from uncertain big data | |
CN109409128B (en) | Differential privacy protection-oriented frequent item set mining method | |
CN109726587B (en) | Spatial data partitioning method based on differential privacy | |
CN105184307A (en) | Medical field image semantic similarity matrix generation method | |
JP2018536909A (en) | System and method for automatically inferring a cube schema used in a multidimensional database environment from tabular data | |
Perez et al. | A filtered bucket-clustering method for projection onto the simplex and the ℓ 1 ball | |
CN106598999A (en) | Method and device for calculating text theme membership degree | |
US20160203105A1 (en) | Information processing device, information processing method, and information processing program | |
Iqbal et al. | Groundwater level prediction model using correlation and difference mechanisms based on boreholes data for sustainable hydraulic resource management | |
You et al. | Eulerian methods for visualizing continuous dynamical systems using Lyapunov exponents | |
CN105205052A (en) | Method and device for mining data | |
Trimble et al. | A strongly coupled, fully implicit, three-dimensional, three-phase well coning model | |
CN114092729A (en) | Heterogeneous electricity consumption data publishing method based on cluster anonymization and differential privacy protection | |
Ghane et al. | Publishing spatial histograms under differential privacy | |
CN109408643B (en) | Fund similarity calculation method, system, computer equipment and storage medium | |
Serre et al. | A BME solution of the inverse problem for saturated groundwater flow | |
Cavoretto et al. | Node-bound communities for partition of unity interpolation on graphs | |
Yan et al. | The application of the intelligent algorithm in the prevention and early warning of mountain mass landslide disaster | |
Vassilevski et al. | Commuting projections on graphs | |
CN108492014B (en) | Data processing method and device for determining geological resource amount | |
Dhiman et al. | Frequent subgraph mining algorithms for single large graphs—A brief survey | |
Li et al. | Rockburst estimation model based on IEWM-SCM and its application |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160706 |
|
RJ01 | Rejection of invention patent application after publication |