CN104537025A - Frequent sequence mining method - Google Patents
Frequent sequence mining method Download PDFInfo
- Publication number
- CN104537025A CN104537025A CN201410802280.8A CN201410802280A CN104537025A CN 104537025 A CN104537025 A CN 104537025A CN 201410802280 A CN201410802280 A CN 201410802280A CN 104537025 A CN104537025 A CN 104537025A
- Authority
- CN
- China
- Prior art keywords
- sequence
- frequent
- candidate
- node
- length
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2474—Sequence data queries, e.g. querying versioned data
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to the technical field of data privacy and data mining and discloses a frequent sequence mining method. The method includes the first step of calculating the maximum constraint length lmax of sequences from a raw data base and obtaining beta= {beta1...betai...betan}, wherein betai stands for the maximum support degree of sequences of lengths i; the second step of searching the raw data base for frequent sequences on the basis of sampled candidate set pruning technology according to lmax and beta= {beta1...betai...betan} under the condition of a differential privacy protection paradigm. According to the frequent sequence mining method (PFS2) achieving candidate set pruning on the basis of sampling under the condition of meeting differential privacy, high usability of mining results can be achieved while differential privacy protection is satisfied.
Description
Technical field
The present invention relates to data-privacy and data mining technology field, particularly a kind of Frequent episodes method for digging.
Background technology
Frequent episodes excavation is the Basic Problems of Data Mining, and it has a wide range of applications in numerous areas.Frequent episodes excavates and can be described below: a given sequence library, wherein sequence is the ordered list of the item with some, and a sequence can regard the record of unique user as.For two sequence S=s
1s
2s
| S|and T=t
1t
2t
| T|if there is integer w
1<w
2< ... <w
| s|, make
t is then claimed to comprise S.The support of sequence refers in database the number comprising this sequence.When the support of some sequences is not less than a certain given threshold value, this sequence is claimed to be Frequent episodes.When a given sequence library and a threshold value, it is exactly all Frequent episodes occurred in mining data storehouse that Frequent episodes excavates.
In mining process, if the sequence in sequence library relates to sensitive information, directly issue the leakage that Frequent episodes can cause privacy of user.In Frequent episodes mining process, how to protect privacy of user to receive the extensive concern of academia and industry member.Difference secret protection normal form
[1]proposition be solve in data analysis process the privacy concern produced to provide a kind of feasible scheme.Difference secret protection normal form has the theoretical privacy of user protection ensured by adding noise to provide.
Current, utilize difference privacy technical protection normal form Frequent Pattern Mining to be carried out to the research of secret protection, can be divided three classes according to by the type excavating object, i.e. Frequent episodes excavation, frequent item set mining and frequent graph excavate.Frequent episodes is excavated, document
[2]for the problem solving Mining Frequent continuous items sequence proposes two stage difference Privacy preserving algorithms, its first time utilizes prefix trees to search candidate's Frequent episodes, and utilizes database transformation technology to improve the support of candidate sequence.Document
[3] [4]solve the sequence library RELEASE PROBLEM meeting difference privacy.Wherein, document
[3]propose a sequence library meeting difference privacy based on prefix trees and issue algorithm, document
[4]utilize an elongated n-gram model from sequence library, extract necessary information, and utilize parsing tree to reduce adding of noise.Frequent episodes is excavated, document
[5]propose the PrivBasis algorithm meeting difference privacy, for excavating top-k frequent item set.Document
[6]find that the length of restriction affairs effectively can improve the availability of data and the balance of secret protection.It utilizes method for cutting, devises a kind of Frequent Itemsets Mining Algorithm based on Apriori algorithm meeting difference secret protection.Frequent tree mining is excavated, document
[7]the mining process of Frequent tree mining and secret protection are incorporated in a Markov Chain Monte Carlo framework, propose a kind of Frequent Subgraph Mining meeting difference secret protection newly.
But the availability of the Result of said method and secret protection aspect all Shortcomings, hinder the application of difference secret protection technology in Frequent Pattern Mining research.
Below the pertinent literature of background technology:
[1]C.Dwork,“Differential privacy,”in ICALP,2006.
[2]L.Sweeney,“k-anonymity:A model for protecting privacy,”Int.J.Uncertain.Fuzziness Knowl.-Base Syst,2002.
[3]R.Chen,B.C.M.Fung,and B.C.Desai,“Differentially privatetransit data publication:A case study on the montreal transportationsystem,”in KDD,2012.
[4]R.Chen,G.Acs,and C.Castelluccia,“Differentially privatesequential data publication via variable-length n-grams,”in CCS,2012.
[5]N.Li,W.Qardaji,D.Su,and J.Cao,“Privbasis:frequent itemsetmining with differential privacy,”in VLDB,2012.
[6]C.Zeng,J.F.Naughton,and J.-Y.Cai,“On differentially privatefrequent itemset mining,”in VLDB,2012.
[7]E.Shen and T.Yu,“Mining frequent graph patterns withdifferential privacy,”in KDD,2013.
[8]R.Srikant and R.Agrawal,“Mining sequential patterns:Generalizations and performance improvements,”in EDBT,1996.
Summary of the invention
(1) technical matters that will solve
The technical problem to be solved in the present invention is: how can provide higher Result availability meeting difference secret protection while.
(2) technical scheme
For solving the problems of the technologies described above, the invention provides a kind of Frequent episodes method for digging, comprising step:
S1: the maximum limited length l of the sequence of calculation from raw data base
max, and obtain β={ β
1... β
i, β
n, β
irepresent that length is the max support of the sequence of i;
S2: according to described l
maxwith β={ β
1... β
i, β
n, based on the Candidate Set technology of prunning branches of sampling, under the condition meeting difference secret protection normal form, from described raw data base, search Frequent episodes.
Wherein, in described step S1, the maximum limited length of the sequence of calculation specifically comprises:
Obtain the maximum length l of the sequence in described raw data base
1;
Calculate l
2, and make
be not less than predetermined value, wherein α
iand α
jrepresent that in described raw data base, length is respectively the number of the sequence of i and j;
Calculate the maximum limited length l of described sequence
max=min{l
1, l
2.
Wherein, described predetermined value is 85%.
Wherein, at calculating l
2time, be each α
iadd noise.
Wherein, also comprise in described step S1 for each β
iadd noise.
Wherein, described step S2 specifically comprises:
S2.1: for given threshold value θ, utilizes β to estimate Maximal frequent sequence length L
f, make L
ffor integer y, make β
yfor being greater than the minimum value of θ in β;
S2.2: random for raw data base is divided into L
findividual mutually disjoint database is as sample database, and composition set dbSet, each database comprises | D|/L
findividual sequence, wherein | D| represents the number of sequence in database;
S2.3: generate candidate Frequent episodes, when Mining Frequent 1-sequence, the frequent 1-sequence of candidate is the item in database, later according to downward closure property, uses frequent (k-1)-sequence to generate the frequent k-sequence of candidate, is used for Mining Frequent k-sequence;
S2.4: in sample database, exceedes the maximum limited length l of sequence for length
maxsequence, adopting sequence contraction method to limit its length, the threshold value that user specifies being relaxed, for judging the frequent attribute of sequence in sample database meanwhile;
S2.5: calculated candidate sequence adds the support of noise in raw data base, exports the candidate sequence that the support adding noise is greater than described θ as Frequent episodes.
Wherein, described sequence contraction method specifically comprises:
Sequence contraction method mainly comprises the following steps:
Step one, outlier are deleted, and the item be not included in arbitrary candidate sequence in sequence is called outlier;
Step 2, continuous mode compress, and in a sequence, if a certain pattern may occur continuously, this pattern are called continuous mode, carry out continuous mode compression in the following manner:
Make p
kintermediate scheme p occurs k time continuously, T
1| T
2| T
3represent by sequence T
1, T
2and T
3the sequence connected into, Contain
k(T) represent by the set being included in the frequent k-sequence of candidate in T and forming.Like this, for sequence T
1| p
j| T
2and T
1| p
k| T
2(wherein j>k), has Contain
k(T
1| p
j| T
2)=Contain
k(T
1| p
k| T
2);
Step 3, sequence reconstruct, if sequence length still can not meet the requirement of maximum length restriction after above step one and step 2, carry out sequence reconstruct in the following manner:
Form candidate sequence tree CS-tree, the common prefix of different candidate sequence is put into same branch; The height of tree equals the length of candidate sequence, and in tree, each vertex ticks is an item, and this node is associated with the sequence that all items form to its path from root; For a CS-tree by one group of frequent k-Sequence composition of candidate, node on kth layer is associated with the frequent k-sequence of candidate, these nodes are claimed to be c-node, node on k-1 layer is associated with (k-1)-sequence, these nodes are claimed to be g-node, the sequence be associated with them is called as formation sequence, specifically introduces sequence restructuring procedure:
A) the frequent k-sequence C of candidate is utilized
kbuild a CS-tree, be designated as CT, for sequence S, find out the frequent k-sequence C of the candidate be included in s '
k, and generate a new CS-tree by these candidates frequent k-sequence, be designated as subCT, subCT is a stalk tree of CT, builds an empty sequence S' simultaneously;
B) from C'
kin choose a candidate sequence and be appended in S', for C'
kin arbitrary candidate sequence cs, the set that its (k-1)-subsequence is formed may comprise some C'
kin the formation sequence of other candidate sequences, the corresponding g-node of each candidate sequence, each g-node has one group of child nodes, the number sum of the child nodes of these g-nodes is called the scoring of candidate sequence cs, be designated as c-score, so choose the candidate sequence that its c-score is maximum, if multiple candidate sequence has identical c-score, choose and comprise maximum that of different item;
C) from subCT, remove step b) in the sequence pair the chosen c-node of answering;
D) use subCT find the formation sequence be included in (the k-1)-son sequence set of S', described in the formation sequence be included in (the k-1)-son sequence set of S' can corresponding one group of g-node.If have node to have child nodes and c-node in this group g-node, the all corresponding tag entry of these child nodes, and the tag entry of different child nodes may be identical, the item choosing corresponding c-nodes more is appended to S', and is removed from subCT by these corresponding all c-nodes; If all nodes all do not have child nodes, then according to step b in this group g-node) from C'
kin choose a candidate sequence and be appended in S', and the candidate sequence be selected is not comprised by S'; Constantly for S' adds item, until make it meet the requirement of maximum limited length.
Wherein, described threshold value is relaxed method and is specifically comprised:
For a given sample database D
swith one group of frequent k-sequence of candidate, make a frequent k-sequence t of candidate
k, make its true support be the described threshold value θ specified; Then, t is calculated
kat D
sin add the cumulative distribution function F of the support of noise
k; Finally set F
k(θ ')=ζ, calculates corresponding θ ', and in sample database, support is greater than the sequence of θ ' is potential Frequent episodes.
(3) beneficial effect
Frequent episodes method for digging (the PFS realizing Candidate Set beta pruning based on sampling meeting difference privacy in the present invention
2) higher Result availability can be provided while meeting difference secret protection.
Accompanying drawing explanation
Fig. 1 is a kind of Frequent episodes method for digging schematic flow sheet of the embodiment of the present invention;
Fig. 2 is CS-tree schematic diagram;
In Fig. 3, (a) ~ (f) is method of the present invention (i.e. PFS
2algorithm) scoring under different threshold value and the comparison diagram of recall rate from Prefix algorithm and n-gram algorithm;
In Fig. 4, (a) ~ (d) is method of the present invention (i.e. PFS
2algorithm) scoring under different privacy parameters and the comparison diagram of recall rate from Prefix algorithm and n-gram algorithm;
In Fig. 5, (a) ~ (d) is that sequence contraction method and threshold value relax method to PFS
2the impact of algorithm performance.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples for illustration of the present invention, but are not used for limiting the scope of the invention.
The Frequent episodes method for digging of the embodiment of the present invention is the Frequent episodes method for digging realizing Candidate Set beta pruning based on sampling meeting difference privacy, specifically comprises:
Step S1: the maximum limited length l of the sequence of calculation from raw data base
max, and obtain β={ β
1... β
i, β
n, β
irepresent that length is the max support of the sequence of i;
Step S2: use the Candidate Set technology of prunning branches based on sampling, search Frequent episodes under the condition meeting difference secret protection normal form.In the process of candidate sequence beta pruning, employ in the present embodiment sequence shrink and threshold value relax method.
The implementation procedure of step 1 and step 2 will be introduced in detail below.Step S1 specifically comprises:
Step S1.1: calculate maximum limited length.A given database, by setting l
max=min{l
1, l
2, determine l in the didactic mode of one
max.Wherein l
1the maximum length of sequence in representation database, which determines the maximal value of the error that the noise that adds in support computation process brings; L is calculated from database
2.L
2computing method can be described as: first, make α={ α
1..., α
n, wherein α
iit is the number of the list entries of i for length in raw data base; Then l is calculated
2make
be not less than 85%.Because the calculating of α relates to data-privacy, be each α
iadd appropriate noise.The mode adding noise is according to Laplace mechanism, on the basis of precise results, namely add the random number that is obeyed Laplace distribution.
Step S1.2: calculate β={ β
1... β
i, β
n, wherein β
iit is the max support of the sequence of i for all length.It will be used for estimating Maximal frequent sequence length L in step 2
f.Because accurate Calculation β is that calculating is upper infeasible, in the present embodiment, use non-privacy Frequent episodes mining algorithm GSP
[8]calculate β.Due to the build-in attribute that β is database, may leak data privacy, be therefore each β
iadd appropriate noise.
The maximum limited length l of required sequence is just extracted like this through step 1
maxand β.Step S2 specifically comprises:
S2.1: for given threshold value θ, utilizes β to estimate Maximal frequent sequence length L
f, make L
ffor integer y, make β
yfor being greater than the minimum value of θ in β.
S2.2: random for raw data base is divided into L
findividual mutually disjoint database is as sample database, and their compositions gather dbSet, and each database approximately comprises | D|/L
findividual sequence, wherein | D| represents the number of sequence in database;
S2.3: generate candidate's Frequent episodes.When Mining Frequent 1-sequence (this expression formula of x-sequence represents that length is the sequence of x), the frequent 1-sequence of candidate is the item in database, after, according to downward closure property, use frequent (k-1)-sequence to generate the frequent k-sequence of candidate, be used for Mining Frequent k-sequence.
S2.4: given sample database, the Candidate Set pruning algorithms (comprise sequence contraction method and threshold value relaxes method) based on sampling utilizing the present invention to propose carries out beta pruning to Candidate Set.Specifically, in sampling database, the maximum limited length l of sequence is exceeded for length
maxsequence, adopt sequence contraction method to limit its length., the threshold value that user specifies is relaxed, for judging the frequent attribute of sequence in sampling database meanwhile.
S2.5: calculated candidate sequence adds the support of noise in raw data base, exports the candidate sequence that the support adding noise is greater than θ as Frequent episodes.
To shrink introducing the sequence used in this step respectively and threshold value relaxes method below.Sequence contraction method specifically comprises:
In sample database, the support of sequence in this sample database i.e. local support is used to estimate whether it is potential Frequent episodes.Due to privacy requirement, noise must be added for the local support of each sequence.In order to make to estimate more accurately, accomplish that the noise added is the least possible.In the frequent item set mining meeting difference privacy, find that the length limiting affairs can effectively reduce the quantity adding noise.But due to the difference of sequence and item collection essence, its method proposed is excavated no longer applicable to Frequent episodes.For this reason, the present invention proposes sequence contraction method.
For convenience, lift an example: have the frequent 2-sequence ab of one sequence A=abcbbce and four candidate, be, bb and ae, can find out sequence A
1=abbbe and A
2=abbe and sequence A comprise the frequent 2-sequence of identical candidate.
Sequence contraction method mainly comprises the following steps:
Step one: outlier is deleted.The item be not included in arbitrary candidate sequence in sequence is called outlier, as above c in example.Find that outlier does not affect candidate sequence, its deletion can not be caused information loss.A is converted into by A
1.
Step 2: continuous mode compresses.In a sequence, a certain pattern may occur continuously, and this pattern is called continuous mode, below theorem ensure that by carrying out certain compression to continuous mode and can not cause damage to frequent information.
Theorem: make p
kintermediate scheme p occurs k time continuously, T
1| T
2| T
3represent by sequence T
1, T
2and T
3the sequence connected into, Contain
k(T) represent by the set being included in the frequent k-sequence of candidate in T and forming.Like this, for sequence T
1| p
j| T
2and T
1| p
k| T
2(wherein j>k), has Contain
k(T
1| p
j| T
2)=Contain
k(T
1| p
k| T
2).Like this, can compress continuous mode.
Step 3: sequence reconstructs.If still can not meet the requirement of maximum length restriction through above step one and step 2 sequence length, some in sequence must be removed further.If removed at random, a large amount of frequent information dropout will be caused, this will cause potential Frequent episodes misjudgment, if and use the method enumerated to determine to meet maximum length restriction to require and the maximum sequence of the quantity of the identical candidate sequence comprised with original series, be computationally infeasible.Therefore the present invention proposes a kind of sequence reconstructing method of novelty.
First the definition of candidate sequence tree is proposed, i.e. CS-tree.The common prefix of different candidate sequence is put into same branch; The height of tree equals the length of candidate sequence; In tree, each vertex ticks is an item, and this node is associated with the sequence that all items form to its path from root.Fig. 2 be one by candidate's frequent 3-arrangement set abc, bcd, bda, bdb} form CS-tree.
For a CS-tree by one group of frequent k-Sequence composition of candidate, node on kth layer is associated with the frequent k-sequence of candidate, these nodes are claimed to be c-node, equally, node on k-1 layer is associated with (k-1)-sequence, claim these nodes to be g-node, the sequence be associated with them is called as formation sequence.Lower mask body introduces sequence restructuring procedure:
A) the frequent k-sequence C of candidate is utilized
kbuild a CS-tree, be designated as CT, for sequence S, find out the frequent k-sequence C of the candidate be included in s '
k, and generate a new CS-tree by these candidates frequent k-sequence, be designated as subCT.Can find out, subCT is a stalk tree of CT.Build an empty sequence S' simultaneously.
B) from C'
kin choose a candidate sequence and be appended in S'.For C'
kin arbitrary candidate sequence cs, the set that its (k-1)-subsequence is formed may comprise some C'
kin the formation sequence of other candidate sequences, the corresponding g-node of each candidate sequence, each g-node has one group of child nodes, and the number sum of the child nodes of these g-nodes is called and the scoring of candidate sequence cs is designated as c-score.So choose the candidate sequence that its c-score is maximum.If multiple candidate sequence has identical c-score, choose and comprise maximum that of different item.
C) from subCT, remove b) the c-node that the sequence pair chosen is answered.
D) use subCT to find the formation sequence be included in (the k-1)-son sequence set of S', they can corresponding one group of g-node.If have node to have child nodes and c-node in this group g-node, the all corresponding tag entry of these child nodes, and the tag entry of different child nodes may be identical, the item choosing corresponding c-nodes more is appended to S', and is removed from subCT by these corresponding all c-nodes; If all nodes all do not have child nodes in this group g-node, then according to b) from C'
kin choose a candidate sequence and be appended in S', and the candidate sequence be selected is not comprised by S'.
Like this, constantly for S' adds item, until make it meet the requirement of maximum limited length.By this method, can realize sequence reconstruct efficiently, the public candidate sequence simultaneously making S' and S comprise is many as much as possible.Threshold value is relaxed method and is specifically comprised:
Because the sequence in sequence library is randomly drawed from raw data base, this can cause a lot of Frequent episodes in sample database, become no longer frequent.If still use user's threshold value of specifying to judge which sequence is potential Frequent episodes, the estimated result that can lead to errors.In order to address this problem, the present invention proposes threshold value and relaxing method.Its process specifically can specifically describe: for a given sample database D
swith one group of frequent k-sequence of candidate, first suppose a frequent k-sequence t of candidate
k, make its true support be the threshold value θ specified; Then, t is calculated
kat D
sin add the cumulative distribution function F of the support of noise
k.T
kat D
sin add noise support be designated as z, z is two stochastic variable x (t
kat D
sin true support) and Laplace noise y sum, wherein x Gaussian distributed Normal (μ, σ
2), y obeys Laplace distribution Laplace
so by the Cumulative Distribution Function calculating z be:
Wherein, μ, σ
2for the parameter of Gaussian distribution, wherein μ is the expectation of distribution, σ
2for the variance of distribution; λ,
for the parameter of Laplace distribution, wherein λ is the location parameter of distribution,
for the scale parameter of distribution.Erfc is error function, and its expression formula is:
P is integration variable.Finally set F
k(θ ')=ζ (F
kthe expression formula of (θ ') is as above formula, z in the suitable above formula of θ '), calculate corresponding θ ', and be potential Frequent episodes with which candidate sequence in its sample estimates database, the sequence that namely in sample database, support is greater than θ ' is potential Frequent episodes.Find in experiment, when ζ is set to 0.3, usually can obtain good result.And drawn by theoretical analysis, the method meets the requirement of difference privacy normal form.
In sample database, the maximum limited length l of sequence is exceeded for length
maxsequence, adopt sequence contraction method to limit its length, thus desensitising size, and then reduce Laplace noise and add, the final availability improving frequent item set support.And by sample database to an item integrate whether judge in advance as frequent item set time, a part of frequent item set may by misjudgement for nonmatching grids, and in order to reduce this error, by threshold value reduction to a certain degree, this method is called that threshold value relaxes method.Namely sequence contraction method and threshold value are relaxed method and are acted on the different stages, sequence contraction method is when utilizing sample database statistical items collection support, reduce noise to add, it is to reduce by the quantity of the frequent item set misjudged as nonmatching grids that threshold value relaxes method.
By form analysis, the Frequent episodes mining algorithm (PFS realizing Candidate Set beta pruning based on sampling meeting difference privacy in the present invention
2) higher Result availability can be provided while meeting difference secret protection.
By with document
[3]the algorithm (Prefix) proposed and document
[4]the algorithm (n-gram) proposed is compared, and can determine that the PFP algorithm proposed has obvious advantage in the availability and secret protection level of Result.In order to the advantage of algorithm of the present invention is described better, adopt widely used module F-score
[6]with relative error (RE)
[5]come PFS
2algorithm and Prefix algorithm, n-gram algorithm compare.Wherein, weigh the availability of the Frequent episodes of generation with F-score, weigh its error relative to the accurate support of sequence with RE.
Use Prefix and n-gram complete privacy search Frequent episodes time, have employed general method: on original database, first run them generate anonymous database, on the database of anonymity, then run the Frequent episodes mining algorithm GSP of non-privacy.
Arranging of concrete experiment is as follows.
Use three groups of real data sets, because the data in data set House-Power are time series, these values of discretize also successfully construct a sequence from every 50 samples.In these three databases, the specific features of data is as shown in table 1 below:
Data characteristics in table 1 database
Database | Sequence number | Item number | Maximum length | Average length |
MSNBC | 989818 | 17 | 14795 | 4.75 |
BIBLE | 36369 | 19305 | 100 | 21.64 |
House_Power | 40986 | 21 | 50 | 50 |
All algorithms are realized by JAVA language.The experimental situation of test is Intel Core2DuoE8400CPU (3.0GHz) and 4GB RAM.
Below by analysis design mothod data, PFS is described
2the performance of algorithm.
First PFS is compared
2algorithm, Prefix algorithm and the performance of n-gram algorithm under different threshold value.Because the quantity of BIBLE middle term is too large, concerning Prefix algorithm, be difficult to it and build a prefix trees, so do not show the performance of Prefix algorithm in BIBLE.Experimental result is as shown in (a) ~ (f) in Fig. 3.
As can be seen from Figure 3 PFS
2the performance of algorithm is obviously better than Prefix algorithm and n-gram algorithm.Interpretation of result is as follows.For meeting the requirement of maximum limited length, the item exceeding restriction in list entries is directly deleted by Prefix algorithm and n-gram algorithm, causes a large amount of frequent information dropout like this.Comparatively speaking, PFS
2algorithm employs sequence contraction method, and it effectively can protect the frequent information in each list entries, so it can significantly improve the availability of Frequent episodes.
Then, data set MSNBC (θ=0.015) and House_Power (θ=0.34) is utilized to compare PFS
2algorithm, Prefix algorithm and the performance of n-gram algorithm under different parameters.Experimental result is as shown in (a) ~ (d) in Fig. 4.
4 can find out from the graph, under the privacy parameters of phase same level, and PFS
2the performance of algorithm is better than Prefix algorithm and n-gram algorithm all the time.Find, three algorithm table reveal consistent characteristic simultaneously, and namely along with the increase of ε, the quality of Frequent episodes improves constantly.ε refers to the privacy parameters in ε-difference privacy herein, and its represents difference secret protection intensity, and ε is larger, and secret protection intensity is more weak, needed for the noise that adds fewer, so the quality of Frequent episodes is higher.
Finally, database BIBLE and House_Power measurement sequence contraction method and threshold value is utilized to relax method to PFS
2the impact of algorithm performance.Represent deleted entry random from sequence with RR in the present embodiment, and the threshold value that user specifies is not relaxed in sample database, represent with SR and employ sequence contraction method but do not use threshold value to relax method.
(a) ~ (d) as can be seen from Fig. 5, the RR not using sequence contraction method and threshold value to relax method can not produce rational result.Meanwhile, find that sequence contraction method and threshold value are relaxed method and can effectively be improved algorithm PFS
2the performance of algorithm in F-score, in RE, although after employing these two kinds of methods, performance slightly reduces, and they can reach good performance.This is because by using sequence contraction method and threshold value to relax method, more real Frequent episodes is retained in sample database, and this can increase a little by causing the noise content added for each candidate sequence.
By above great many of experiments, the PFS proposed can be determined
2algorithm has obvious advantage in the availability and secret protection level of Result.
Above embodiment is only for illustration of the present invention; and be not limitation of the present invention; the those of ordinary skill of relevant technical field; without departing from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all equivalent technical schemes also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.
Claims (8)
1. a Frequent episodes method for digging, is characterized in that, comprises step:
S1: the maximum limited length l of the sequence of calculation from raw data base
max, and obtain β={ β
1... β
i, β
n, β
irepresent that length is the max support of the sequence of i;
S2: according to described l
maxwith β={ β
1... β
i, β
n, based on the Candidate Set technology of prunning branches of sampling, under the condition meeting difference secret protection normal form, from described raw data base, search Frequent episodes.
2. Frequent episodes method for digging as claimed in claim 1, it is characterized in that, in described step S1, the maximum limited length of the sequence of calculation specifically comprises:
Obtain the maximum length l of the sequence in described raw data base
1;
Calculate l
2, and make
be not less than predetermined value, wherein α
iand α
jrepresent that in described raw data base, length is respectively the number of the sequence of i and j;
Calculate the maximum limited length l of described sequence
max=min{l
1, l
2.
3. Frequent episodes method for digging as claimed in claim 2, it is characterized in that, described predetermined value is 85%.
4. Frequent episodes method for digging as claimed in claim 2, is characterized in that, at calculating l
2time, be each α
iadd noise.
5. Frequent episodes method for digging as claimed in claim 2, is characterized in that, also comprise for each β in described step S1
iadd noise.
6. the Frequent episodes method for digging according to any one of Claims 1 to 5, is characterized in that, described step S2 specifically comprises:
S2.1: for given threshold value θ, utilizes β to estimate Maximal frequent sequence length L
f, make L
ffor integer y, make β
yfor being greater than the minimum value of θ in β;
S2.2: random for raw data base is divided into L
findividual mutually disjoint database is as sample database, and composition set dbSet, each database comprises | D|/L
findividual sequence, wherein | D| represents the number of sequence in database;
S2.3: generate candidate Frequent episodes, when Mining Frequent 1-sequence, the frequent 1-sequence of candidate is the item in database, later according to downward closure property, uses frequent (k-1)-sequence to generate the frequent k-sequence of candidate, is used for Mining Frequent k-sequence;
S2.4: in sample database, exceedes the maximum limited length l of sequence for length
maxsequence, adopting sequence contraction method to limit its length, the threshold value that user specifies being relaxed, for judging the frequent attribute of sequence in sample database meanwhile;
S2.5: calculated candidate sequence adds the support of noise in raw data base, exports the candidate sequence that the support adding noise is greater than described threshold value θ as Frequent episodes.
7. Frequent episodes method for digging as claimed in claim 6, it is characterized in that, described sequence contraction method specifically comprises:
Sequence contraction method mainly comprises the following steps:
Step one, outlier are deleted, and the item be not included in arbitrary candidate sequence in sequence is called outlier;
Step 2, continuous mode compress, and in a sequence, if a certain pattern may occur continuously, this pattern are called continuous mode, carry out continuous mode compression in the following manner:
Make p
kintermediate scheme p occurs k time continuously, T
1| T
2| T
3represent by sequence T
1, T
2and T
3the sequence connected into, Contain
k(T) represent by the set being included in the frequent k-sequence of candidate in T and forming, for sequence T
1| p
j| T
2and T
1| p
k| T
2(wherein j>k), has Contain
k(T
1| p
j| T
2)=Contain
k(T
1| p
k| T
2);
Step 3, sequence reconstruct, if sequence length still can not meet the requirement of maximum length restriction after above step one and step 2, carry out sequence reconstruct in the following manner:
Form candidate sequence tree CS-tree, the common prefix of different candidate sequence is put into same branch; The height of tree equals the length of candidate sequence, and in tree, each vertex ticks is an item, and this node is associated with the sequence that all items form to its path from root; For a CS-tree by one group of frequent k-Sequence composition of candidate, node on kth layer is associated with the frequent k-sequence of candidate, these nodes are claimed to be c-node, node on k-1 layer is associated with (k-1)-sequence, these nodes are claimed to be g-node, the sequence be associated with them is called as formation sequence, specifically introduces sequence restructuring procedure:
A) the frequent k-sequence C of candidate is utilized
kbuild a CS-tree, be designated as CT, for sequence S, find out the frequent k-sequence C of the candidate be included in s '
k, and generate a new CS-tree by these candidates frequent k-sequence, be designated as subCT, subCT is a stalk tree of CT, builds an empty sequence S' simultaneously;
B) from C'
kin choose a candidate sequence and be appended in S', for C'
kin arbitrary candidate sequence cs, the set that its (k-1)-subsequence is formed may comprise some C'
kin the formation sequence of other candidate sequences, the corresponding g-node of each candidate sequence, each g-node has one group of child nodes, the number sum of the child nodes of these g-nodes is called the scoring of candidate sequence cs, be designated as c-score, so choose the candidate sequence that its c-score is maximum, if multiple candidate sequence has identical c-score, choose and comprise maximum that of different item;
C) from subCT, remove step b) in the sequence pair the chosen c-node of answering;
D) subCT is used to find the formation sequence be included in (the k-1)-son sequence set of S', described be included in formation sequence in (the k-1)-son sequence set of S' can corresponding one group of g-node, if have node to have child nodes and c-node in this group g-node, the all corresponding tag entry of these child nodes, and the tag entry of different child nodes may be identical, the item choosing corresponding c-nodes more is appended to S', and is removed from subCT by these corresponding all c-nodes; If all nodes all do not have child nodes, then according to step b in this group g-node) from C'
kin choose a candidate sequence and be appended in S', and the candidate sequence be selected is not comprised by S'; Constantly for S' adds item, until make it meet the requirement of maximum limited length.
8. Frequent episodes method for digging as claimed in claim 6, it is characterized in that, described threshold value is relaxed method and is specifically comprised:
For a given sample database D
swith one group of frequent k-sequence of candidate, make a frequent k-sequence t of candidate
k, make its true support be the described threshold value θ specified; Then, t is calculated
kat D
sin add the cumulative distribution function F of the support of noise
k; Finally set F
k(θ ')=ζ, calculates corresponding θ ', and in sample database, support is greater than the sequence of θ ' is potential Frequent episodes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410802280.8A CN104537025B (en) | 2014-12-19 | 2014-12-19 | Frequent episodes method for digging |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410802280.8A CN104537025B (en) | 2014-12-19 | 2014-12-19 | Frequent episodes method for digging |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104537025A true CN104537025A (en) | 2015-04-22 |
CN104537025B CN104537025B (en) | 2017-10-10 |
Family
ID=52852553
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410802280.8A Active CN104537025B (en) | 2014-12-19 | 2014-12-19 | Frequent episodes method for digging |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104537025B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106339609A (en) * | 2016-09-19 | 2017-01-18 | 四川大学 | Heuristic mining method of optimal comparing sequence mode of free interval constraint |
CN106682514A (en) * | 2016-12-15 | 2017-05-17 | 哈尔滨工程大学 | System call sequence characteristic mode set generation method based on subgraph mining |
CN107491557A (en) * | 2017-09-06 | 2017-12-19 | 徐州医科大学 | A kind of TopN collaborative filtering recommending methods based on difference privacy |
CN107729762A (en) * | 2017-08-31 | 2018-02-23 | 徐州医科大学 | A kind of DNA based on difference secret protection model closes frequent motif discovery method |
CN107844540A (en) * | 2017-10-25 | 2018-03-27 | 电子科技大学 | A kind of time series method for digging for electric power data |
CN108280366A (en) * | 2018-01-17 | 2018-07-13 | 上海理工大学 | A kind of batch linear query method based on difference privacy |
CN109409128A (en) * | 2018-10-30 | 2019-03-01 | 南京邮电大学 | A kind of Mining Frequent Itemsets towards difference secret protection |
CN109861858A (en) * | 2019-01-28 | 2019-06-07 | 北京大学 | Wrong investigation method of the micro services system root because of node |
CN110471957A (en) * | 2019-08-16 | 2019-11-19 | 安徽大学 | Localization difference secret protection Mining Frequent Itemsets based on frequent pattern tree (fp tree) |
CN110490000A (en) * | 2019-08-23 | 2019-11-22 | 广西师范大学 | The difference method for secret protection that Frequent tree mining excavates in more diagram datas |
CN112884614A (en) * | 2019-11-29 | 2021-06-01 | 北京金山云网络技术有限公司 | Frequent sequence based route recommendation method and device and electronic equipment |
US11055492B2 (en) | 2018-06-02 | 2021-07-06 | Apple Inc. | Privatized apriori algorithm for sequential data discovery |
CN113282686A (en) * | 2021-06-03 | 2021-08-20 | 光大科技有限公司 | Method and device for determining association rule of unbalanced sample |
CN115859132A (en) * | 2023-02-27 | 2023-03-28 | 广州帝隆科技股份有限公司 | Big data risk management and control method and system based on neural network model |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050240582A1 (en) * | 2004-04-27 | 2005-10-27 | Nokia Corporation | Processing data in a computerised system |
CN101561854A (en) * | 2009-05-22 | 2009-10-21 | 江苏大学 | Private data guard method in sequential mode mining |
CN101931570A (en) * | 2010-02-08 | 2010-12-29 | 中国航天科技集团公司第七一○研究所 | Method for reconstructing network attack path based on frequent pattern-growth algorithm |
CN102254034A (en) * | 2011-08-08 | 2011-11-23 | 浙江鸿程计算机系统有限公司 | Online analytical processing (OLAP) query log mining and recommending method based on efficient mining of frequent closed sequences (BIDE) |
CN103150311A (en) * | 2011-12-07 | 2013-06-12 | 微软公司 | Frequent object mining method based on data partitioning |
-
2014
- 2014-12-19 CN CN201410802280.8A patent/CN104537025B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050240582A1 (en) * | 2004-04-27 | 2005-10-27 | Nokia Corporation | Processing data in a computerised system |
CN101561854A (en) * | 2009-05-22 | 2009-10-21 | 江苏大学 | Private data guard method in sequential mode mining |
CN101931570A (en) * | 2010-02-08 | 2010-12-29 | 中国航天科技集团公司第七一○研究所 | Method for reconstructing network attack path based on frequent pattern-growth algorithm |
CN102254034A (en) * | 2011-08-08 | 2011-11-23 | 浙江鸿程计算机系统有限公司 | Online analytical processing (OLAP) query log mining and recommending method based on efficient mining of frequent closed sequences (BIDE) |
CN103150311A (en) * | 2011-12-07 | 2013-06-12 | 微软公司 | Frequent object mining method based on data partitioning |
Non-Patent Citations (1)
Title |
---|
丁丽萍等: ""面向频繁模式挖掘的差分隐私保护研究综述"", 《通信学报》 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106339609A (en) * | 2016-09-19 | 2017-01-18 | 四川大学 | Heuristic mining method of optimal comparing sequence mode of free interval constraint |
CN106682514A (en) * | 2016-12-15 | 2017-05-17 | 哈尔滨工程大学 | System call sequence characteristic mode set generation method based on subgraph mining |
CN106682514B (en) * | 2016-12-15 | 2020-07-28 | 哈尔滨工程大学 | System calling sequence feature pattern set generation method based on subgraph mining |
CN107729762A (en) * | 2017-08-31 | 2018-02-23 | 徐州医科大学 | A kind of DNA based on difference secret protection model closes frequent motif discovery method |
CN107491557A (en) * | 2017-09-06 | 2017-12-19 | 徐州医科大学 | A kind of TopN collaborative filtering recommending methods based on difference privacy |
CN107844540A (en) * | 2017-10-25 | 2018-03-27 | 电子科技大学 | A kind of time series method for digging for electric power data |
CN108280366A (en) * | 2018-01-17 | 2018-07-13 | 上海理工大学 | A kind of batch linear query method based on difference privacy |
US11055492B2 (en) | 2018-06-02 | 2021-07-06 | Apple Inc. | Privatized apriori algorithm for sequential data discovery |
CN109409128A (en) * | 2018-10-30 | 2019-03-01 | 南京邮电大学 | A kind of Mining Frequent Itemsets towards difference secret protection |
CN109409128B (en) * | 2018-10-30 | 2022-05-17 | 南京邮电大学 | Differential privacy protection-oriented frequent item set mining method |
CN109861858B (en) * | 2019-01-28 | 2020-06-26 | 北京大学 | Error checking method for root cause node of micro-service system |
CN109861858A (en) * | 2019-01-28 | 2019-06-07 | 北京大学 | Wrong investigation method of the micro services system root because of node |
CN110471957A (en) * | 2019-08-16 | 2019-11-19 | 安徽大学 | Localization difference secret protection Mining Frequent Itemsets based on frequent pattern tree (fp tree) |
CN110471957B (en) * | 2019-08-16 | 2021-10-26 | 安徽大学 | Localized differential privacy protection frequent item set mining method based on frequent pattern tree |
CN110490000A (en) * | 2019-08-23 | 2019-11-22 | 广西师范大学 | The difference method for secret protection that Frequent tree mining excavates in more diagram datas |
CN112884614A (en) * | 2019-11-29 | 2021-06-01 | 北京金山云网络技术有限公司 | Frequent sequence based route recommendation method and device and electronic equipment |
CN112884614B (en) * | 2019-11-29 | 2024-05-14 | 北京金山云网络技术有限公司 | Route recommendation method and device based on frequent sequences and electronic equipment |
CN113282686A (en) * | 2021-06-03 | 2021-08-20 | 光大科技有限公司 | Method and device for determining association rule of unbalanced sample |
CN113282686B (en) * | 2021-06-03 | 2023-11-07 | 光大科技有限公司 | Association rule determining method and device for unbalanced sample |
CN115859132A (en) * | 2023-02-27 | 2023-03-28 | 广州帝隆科技股份有限公司 | Big data risk management and control method and system based on neural network model |
Also Published As
Publication number | Publication date |
---|---|
CN104537025B (en) | 2017-10-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104537025A (en) | Frequent sequence mining method | |
CN102289507B (en) | Method for mining data flow weighted frequent mode based on sliding window | |
CN105808696B (en) | It is a kind of based on global and local feature across line social network user matching process | |
CN103955542B (en) | Method of item-all-weighted positive or negative association model mining between text terms and mining system applied to method | |
CN104182527B (en) | Association rule mining method and its system between Sino-British text word based on partial order item collection | |
CN104216874A (en) | Chinese interword weighing positive and negative mode excavation method and system based on relevant coefficients | |
Yu et al. | Random walk with restart over dynamic graphs | |
CN104462184A (en) | Large-scale data abnormity recognition method based on bidirectional sampling combination | |
CN105550189A (en) | Ontology-based intelligent retrieval system for information security event | |
Lavor et al. | Bayesian spatio‐temporal reconstruction reveals rapid diversification and Pleistocene range expansion in the widespread columnar cactus Pilosocereus | |
Xu et al. | Differentially private frequent sequence mining via sampling-based candidate pruning | |
CN104317794B (en) | Chinese Feature Words association mode method for digging and its system based on dynamic item weights | |
CN103020283B (en) | A kind of semantic retrieving method of the dynamic restructuring based on background knowledge | |
CN103440308B (en) | A kind of digital thesis search method based on form concept analysis | |
Tatti | Probably the best itemsets | |
Cheng et al. | Differentially private maximal frequent sequence mining | |
Liu et al. | Randomized perturbation for privacy-preserving social network data publishing | |
Hong et al. | Hiding sensitive itemsets by inserting dummy transactions | |
CN108664548B (en) | Network access behavior characteristic group dynamic mining method and system under degradation condition | |
CN105740907A (en) | Local community mining method | |
Yi-Yang et al. | Data mining and analysis of our agriculture based on the decision tree | |
CN104182386A (en) | Word pair relation similarity calculation method | |
CN103077181B (en) | Method for automatically generating approximate functional dependency rule | |
Hamedanian et al. | An efficient prefix tree for incremental frequent pattern mining | |
CN106776607A (en) | Search engine operation behavior treating method and apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |