CN104537025A - Frequent sequence mining method - Google Patents

Frequent sequence mining method Download PDF

Info

Publication number
CN104537025A
CN104537025A CN201410802280.8A CN201410802280A CN104537025A CN 104537025 A CN104537025 A CN 104537025A CN 201410802280 A CN201410802280 A CN 201410802280A CN 104537025 A CN104537025 A CN 104537025A
Authority
CN
China
Prior art keywords
sequence
frequent
candidate
node
length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410802280.8A
Other languages
Chinese (zh)
Other versions
CN104537025B (en
Inventor
苏森
程祥
许胜之
双锴
徐鹏
王玉龙
张忠宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201410802280.8A priority Critical patent/CN104537025B/en
Publication of CN104537025A publication Critical patent/CN104537025A/en
Application granted granted Critical
Publication of CN104537025B publication Critical patent/CN104537025B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of data privacy and data mining and discloses a frequent sequence mining method. The method includes the first step of calculating the maximum constraint length lmax of sequences from a raw data base and obtaining beta= {beta1...betai...betan}, wherein betai stands for the maximum support degree of sequences of lengths i; the second step of searching the raw data base for frequent sequences on the basis of sampled candidate set pruning technology according to lmax and beta= {beta1...betai...betan} under the condition of a differential privacy protection paradigm. According to the frequent sequence mining method (PFS2) achieving candidate set pruning on the basis of sampling under the condition of meeting differential privacy, high usability of mining results can be achieved while differential privacy protection is satisfied.

Description

Frequent episodes method for digging
Technical field
The present invention relates to data-privacy and data mining technology field, particularly a kind of Frequent episodes method for digging.
Background technology
Frequent episodes excavation is the Basic Problems of Data Mining, and it has a wide range of applications in numerous areas.Frequent episodes excavates and can be described below: a given sequence library, wherein sequence is the ordered list of the item with some, and a sequence can regard the record of unique user as.For two sequence S=s 1s 2s | S|and T=t 1t 2t | T|if there is integer w 1<w 2< ... <w | s|, make t is then claimed to comprise S.The support of sequence refers in database the number comprising this sequence.When the support of some sequences is not less than a certain given threshold value, this sequence is claimed to be Frequent episodes.When a given sequence library and a threshold value, it is exactly all Frequent episodes occurred in mining data storehouse that Frequent episodes excavates.
In mining process, if the sequence in sequence library relates to sensitive information, directly issue the leakage that Frequent episodes can cause privacy of user.In Frequent episodes mining process, how to protect privacy of user to receive the extensive concern of academia and industry member.Difference secret protection normal form [1]proposition be solve in data analysis process the privacy concern produced to provide a kind of feasible scheme.Difference secret protection normal form has the theoretical privacy of user protection ensured by adding noise to provide.
Current, utilize difference privacy technical protection normal form Frequent Pattern Mining to be carried out to the research of secret protection, can be divided three classes according to by the type excavating object, i.e. Frequent episodes excavation, frequent item set mining and frequent graph excavate.Frequent episodes is excavated, document [2]for the problem solving Mining Frequent continuous items sequence proposes two stage difference Privacy preserving algorithms, its first time utilizes prefix trees to search candidate's Frequent episodes, and utilizes database transformation technology to improve the support of candidate sequence.Document [3] [4]solve the sequence library RELEASE PROBLEM meeting difference privacy.Wherein, document [3]propose a sequence library meeting difference privacy based on prefix trees and issue algorithm, document [4]utilize an elongated n-gram model from sequence library, extract necessary information, and utilize parsing tree to reduce adding of noise.Frequent episodes is excavated, document [5]propose the PrivBasis algorithm meeting difference privacy, for excavating top-k frequent item set.Document [6]find that the length of restriction affairs effectively can improve the availability of data and the balance of secret protection.It utilizes method for cutting, devises a kind of Frequent Itemsets Mining Algorithm based on Apriori algorithm meeting difference secret protection.Frequent tree mining is excavated, document [7]the mining process of Frequent tree mining and secret protection are incorporated in a Markov Chain Monte Carlo framework, propose a kind of Frequent Subgraph Mining meeting difference secret protection newly.
But the availability of the Result of said method and secret protection aspect all Shortcomings, hinder the application of difference secret protection technology in Frequent Pattern Mining research.
Below the pertinent literature of background technology:
[1]C.Dwork,“Differential privacy,”in ICALP,2006.
[2]L.Sweeney,“k-anonymity:A model for protecting privacy,”Int.J.Uncertain.Fuzziness Knowl.-Base Syst,2002.
[3]R.Chen,B.C.M.Fung,and B.C.Desai,“Differentially privatetransit data publication:A case study on the montreal transportationsystem,”in KDD,2012.
[4]R.Chen,G.Acs,and C.Castelluccia,“Differentially privatesequential data publication via variable-length n-grams,”in CCS,2012.
[5]N.Li,W.Qardaji,D.Su,and J.Cao,“Privbasis:frequent itemsetmining with differential privacy,”in VLDB,2012.
[6]C.Zeng,J.F.Naughton,and J.-Y.Cai,“On differentially privatefrequent itemset mining,”in VLDB,2012.
[7]E.Shen and T.Yu,“Mining frequent graph patterns withdifferential privacy,”in KDD,2013.
[8]R.Srikant and R.Agrawal,“Mining sequential patterns:Generalizations and performance improvements,”in EDBT,1996.
Summary of the invention
(1) technical matters that will solve
The technical problem to be solved in the present invention is: how can provide higher Result availability meeting difference secret protection while.
(2) technical scheme
For solving the problems of the technologies described above, the invention provides a kind of Frequent episodes method for digging, comprising step:
S1: the maximum limited length l of the sequence of calculation from raw data base max, and obtain β={ β 1... β i, β n, β irepresent that length is the max support of the sequence of i;
S2: according to described l maxwith β={ β 1... β i, β n, based on the Candidate Set technology of prunning branches of sampling, under the condition meeting difference secret protection normal form, from described raw data base, search Frequent episodes.
Wherein, in described step S1, the maximum limited length of the sequence of calculation specifically comprises:
Obtain the maximum length l of the sequence in described raw data base 1;
Calculate l 2, and make be not less than predetermined value, wherein α iand α jrepresent that in described raw data base, length is respectively the number of the sequence of i and j;
Calculate the maximum limited length l of described sequence max=min{l 1, l 2.
Wherein, described predetermined value is 85%.
Wherein, at calculating l 2time, be each α iadd noise.
Wherein, also comprise in described step S1 for each β iadd noise.
Wherein, described step S2 specifically comprises:
S2.1: for given threshold value θ, utilizes β to estimate Maximal frequent sequence length L f, make L ffor integer y, make β yfor being greater than the minimum value of θ in β;
S2.2: random for raw data base is divided into L findividual mutually disjoint database is as sample database, and composition set dbSet, each database comprises | D|/L findividual sequence, wherein | D| represents the number of sequence in database;
S2.3: generate candidate Frequent episodes, when Mining Frequent 1-sequence, the frequent 1-sequence of candidate is the item in database, later according to downward closure property, uses frequent (k-1)-sequence to generate the frequent k-sequence of candidate, is used for Mining Frequent k-sequence;
S2.4: in sample database, exceedes the maximum limited length l of sequence for length maxsequence, adopting sequence contraction method to limit its length, the threshold value that user specifies being relaxed, for judging the frequent attribute of sequence in sample database meanwhile;
S2.5: calculated candidate sequence adds the support of noise in raw data base, exports the candidate sequence that the support adding noise is greater than described θ as Frequent episodes.
Wherein, described sequence contraction method specifically comprises:
Sequence contraction method mainly comprises the following steps:
Step one, outlier are deleted, and the item be not included in arbitrary candidate sequence in sequence is called outlier;
Step 2, continuous mode compress, and in a sequence, if a certain pattern may occur continuously, this pattern are called continuous mode, carry out continuous mode compression in the following manner:
Make p kintermediate scheme p occurs k time continuously, T 1| T 2| T 3represent by sequence T 1, T 2and T 3the sequence connected into, Contain k(T) represent by the set being included in the frequent k-sequence of candidate in T and forming.Like this, for sequence T 1| p j| T 2and T 1| p k| T 2(wherein j>k), has Contain k(T 1| p j| T 2)=Contain k(T 1| p k| T 2);
Step 3, sequence reconstruct, if sequence length still can not meet the requirement of maximum length restriction after above step one and step 2, carry out sequence reconstruct in the following manner:
Form candidate sequence tree CS-tree, the common prefix of different candidate sequence is put into same branch; The height of tree equals the length of candidate sequence, and in tree, each vertex ticks is an item, and this node is associated with the sequence that all items form to its path from root; For a CS-tree by one group of frequent k-Sequence composition of candidate, node on kth layer is associated with the frequent k-sequence of candidate, these nodes are claimed to be c-node, node on k-1 layer is associated with (k-1)-sequence, these nodes are claimed to be g-node, the sequence be associated with them is called as formation sequence, specifically introduces sequence restructuring procedure:
A) the frequent k-sequence C of candidate is utilized kbuild a CS-tree, be designated as CT, for sequence S, find out the frequent k-sequence C of the candidate be included in s ' k, and generate a new CS-tree by these candidates frequent k-sequence, be designated as subCT, subCT is a stalk tree of CT, builds an empty sequence S' simultaneously;
B) from C' kin choose a candidate sequence and be appended in S', for C' kin arbitrary candidate sequence cs, the set that its (k-1)-subsequence is formed may comprise some C' kin the formation sequence of other candidate sequences, the corresponding g-node of each candidate sequence, each g-node has one group of child nodes, the number sum of the child nodes of these g-nodes is called the scoring of candidate sequence cs, be designated as c-score, so choose the candidate sequence that its c-score is maximum, if multiple candidate sequence has identical c-score, choose and comprise maximum that of different item;
C) from subCT, remove step b) in the sequence pair the chosen c-node of answering;
D) use subCT find the formation sequence be included in (the k-1)-son sequence set of S', described in the formation sequence be included in (the k-1)-son sequence set of S' can corresponding one group of g-node.If have node to have child nodes and c-node in this group g-node, the all corresponding tag entry of these child nodes, and the tag entry of different child nodes may be identical, the item choosing corresponding c-nodes more is appended to S', and is removed from subCT by these corresponding all c-nodes; If all nodes all do not have child nodes, then according to step b in this group g-node) from C' kin choose a candidate sequence and be appended in S', and the candidate sequence be selected is not comprised by S'; Constantly for S' adds item, until make it meet the requirement of maximum limited length.
Wherein, described threshold value is relaxed method and is specifically comprised:
For a given sample database D swith one group of frequent k-sequence of candidate, make a frequent k-sequence t of candidate k, make its true support be the described threshold value θ specified; Then, t is calculated kat D sin add the cumulative distribution function F of the support of noise k; Finally set F k(θ ')=ζ, calculates corresponding θ ', and in sample database, support is greater than the sequence of θ ' is potential Frequent episodes.
(3) beneficial effect
Frequent episodes method for digging (the PFS realizing Candidate Set beta pruning based on sampling meeting difference privacy in the present invention 2) higher Result availability can be provided while meeting difference secret protection.
Accompanying drawing explanation
Fig. 1 is a kind of Frequent episodes method for digging schematic flow sheet of the embodiment of the present invention;
Fig. 2 is CS-tree schematic diagram;
In Fig. 3, (a) ~ (f) is method of the present invention (i.e. PFS 2algorithm) scoring under different threshold value and the comparison diagram of recall rate from Prefix algorithm and n-gram algorithm;
In Fig. 4, (a) ~ (d) is method of the present invention (i.e. PFS 2algorithm) scoring under different privacy parameters and the comparison diagram of recall rate from Prefix algorithm and n-gram algorithm;
In Fig. 5, (a) ~ (d) is that sequence contraction method and threshold value relax method to PFS 2the impact of algorithm performance.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples for illustration of the present invention, but are not used for limiting the scope of the invention.
The Frequent episodes method for digging of the embodiment of the present invention is the Frequent episodes method for digging realizing Candidate Set beta pruning based on sampling meeting difference privacy, specifically comprises:
Step S1: the maximum limited length l of the sequence of calculation from raw data base max, and obtain β={ β 1... β i, β n, β irepresent that length is the max support of the sequence of i;
Step S2: use the Candidate Set technology of prunning branches based on sampling, search Frequent episodes under the condition meeting difference secret protection normal form.In the process of candidate sequence beta pruning, employ in the present embodiment sequence shrink and threshold value relax method.
The implementation procedure of step 1 and step 2 will be introduced in detail below.Step S1 specifically comprises:
Step S1.1: calculate maximum limited length.A given database, by setting l max=min{l 1, l 2, determine l in the didactic mode of one max.Wherein l 1the maximum length of sequence in representation database, which determines the maximal value of the error that the noise that adds in support computation process brings; L is calculated from database 2.L 2computing method can be described as: first, make α={ α 1..., α n, wherein α iit is the number of the list entries of i for length in raw data base; Then l is calculated 2make be not less than 85%.Because the calculating of α relates to data-privacy, be each α iadd appropriate noise.The mode adding noise is according to Laplace mechanism, on the basis of precise results, namely add the random number that is obeyed Laplace distribution.
Step S1.2: calculate β={ β 1... β i, β n, wherein β iit is the max support of the sequence of i for all length.It will be used for estimating Maximal frequent sequence length L in step 2 f.Because accurate Calculation β is that calculating is upper infeasible, in the present embodiment, use non-privacy Frequent episodes mining algorithm GSP [8]calculate β.Due to the build-in attribute that β is database, may leak data privacy, be therefore each β iadd appropriate noise.
The maximum limited length l of required sequence is just extracted like this through step 1 maxand β.Step S2 specifically comprises:
S2.1: for given threshold value θ, utilizes β to estimate Maximal frequent sequence length L f, make L ffor integer y, make β yfor being greater than the minimum value of θ in β.
S2.2: random for raw data base is divided into L findividual mutually disjoint database is as sample database, and their compositions gather dbSet, and each database approximately comprises | D|/L findividual sequence, wherein | D| represents the number of sequence in database;
S2.3: generate candidate's Frequent episodes.When Mining Frequent 1-sequence (this expression formula of x-sequence represents that length is the sequence of x), the frequent 1-sequence of candidate is the item in database, after, according to downward closure property, use frequent (k-1)-sequence to generate the frequent k-sequence of candidate, be used for Mining Frequent k-sequence.
S2.4: given sample database, the Candidate Set pruning algorithms (comprise sequence contraction method and threshold value relaxes method) based on sampling utilizing the present invention to propose carries out beta pruning to Candidate Set.Specifically, in sampling database, the maximum limited length l of sequence is exceeded for length maxsequence, adopt sequence contraction method to limit its length., the threshold value that user specifies is relaxed, for judging the frequent attribute of sequence in sampling database meanwhile.
S2.5: calculated candidate sequence adds the support of noise in raw data base, exports the candidate sequence that the support adding noise is greater than θ as Frequent episodes.
To shrink introducing the sequence used in this step respectively and threshold value relaxes method below.Sequence contraction method specifically comprises:
In sample database, the support of sequence in this sample database i.e. local support is used to estimate whether it is potential Frequent episodes.Due to privacy requirement, noise must be added for the local support of each sequence.In order to make to estimate more accurately, accomplish that the noise added is the least possible.In the frequent item set mining meeting difference privacy, find that the length limiting affairs can effectively reduce the quantity adding noise.But due to the difference of sequence and item collection essence, its method proposed is excavated no longer applicable to Frequent episodes.For this reason, the present invention proposes sequence contraction method.
For convenience, lift an example: have the frequent 2-sequence ab of one sequence A=abcbbce and four candidate, be, bb and ae, can find out sequence A 1=abbbe and A 2=abbe and sequence A comprise the frequent 2-sequence of identical candidate.
Sequence contraction method mainly comprises the following steps:
Step one: outlier is deleted.The item be not included in arbitrary candidate sequence in sequence is called outlier, as above c in example.Find that outlier does not affect candidate sequence, its deletion can not be caused information loss.A is converted into by A 1.
Step 2: continuous mode compresses.In a sequence, a certain pattern may occur continuously, and this pattern is called continuous mode, below theorem ensure that by carrying out certain compression to continuous mode and can not cause damage to frequent information.
Theorem: make p kintermediate scheme p occurs k time continuously, T 1| T 2| T 3represent by sequence T 1, T 2and T 3the sequence connected into, Contain k(T) represent by the set being included in the frequent k-sequence of candidate in T and forming.Like this, for sequence T 1| p j| T 2and T 1| p k| T 2(wherein j>k), has Contain k(T 1| p j| T 2)=Contain k(T 1| p k| T 2).Like this, can compress continuous mode.
Step 3: sequence reconstructs.If still can not meet the requirement of maximum length restriction through above step one and step 2 sequence length, some in sequence must be removed further.If removed at random, a large amount of frequent information dropout will be caused, this will cause potential Frequent episodes misjudgment, if and use the method enumerated to determine to meet maximum length restriction to require and the maximum sequence of the quantity of the identical candidate sequence comprised with original series, be computationally infeasible.Therefore the present invention proposes a kind of sequence reconstructing method of novelty.
First the definition of candidate sequence tree is proposed, i.e. CS-tree.The common prefix of different candidate sequence is put into same branch; The height of tree equals the length of candidate sequence; In tree, each vertex ticks is an item, and this node is associated with the sequence that all items form to its path from root.Fig. 2 be one by candidate's frequent 3-arrangement set abc, bcd, bda, bdb} form CS-tree.
For a CS-tree by one group of frequent k-Sequence composition of candidate, node on kth layer is associated with the frequent k-sequence of candidate, these nodes are claimed to be c-node, equally, node on k-1 layer is associated with (k-1)-sequence, claim these nodes to be g-node, the sequence be associated with them is called as formation sequence.Lower mask body introduces sequence restructuring procedure:
A) the frequent k-sequence C of candidate is utilized kbuild a CS-tree, be designated as CT, for sequence S, find out the frequent k-sequence C of the candidate be included in s ' k, and generate a new CS-tree by these candidates frequent k-sequence, be designated as subCT.Can find out, subCT is a stalk tree of CT.Build an empty sequence S' simultaneously.
B) from C' kin choose a candidate sequence and be appended in S'.For C' kin arbitrary candidate sequence cs, the set that its (k-1)-subsequence is formed may comprise some C' kin the formation sequence of other candidate sequences, the corresponding g-node of each candidate sequence, each g-node has one group of child nodes, and the number sum of the child nodes of these g-nodes is called and the scoring of candidate sequence cs is designated as c-score.So choose the candidate sequence that its c-score is maximum.If multiple candidate sequence has identical c-score, choose and comprise maximum that of different item.
C) from subCT, remove b) the c-node that the sequence pair chosen is answered.
D) use subCT to find the formation sequence be included in (the k-1)-son sequence set of S', they can corresponding one group of g-node.If have node to have child nodes and c-node in this group g-node, the all corresponding tag entry of these child nodes, and the tag entry of different child nodes may be identical, the item choosing corresponding c-nodes more is appended to S', and is removed from subCT by these corresponding all c-nodes; If all nodes all do not have child nodes in this group g-node, then according to b) from C' kin choose a candidate sequence and be appended in S', and the candidate sequence be selected is not comprised by S'.
Like this, constantly for S' adds item, until make it meet the requirement of maximum limited length.By this method, can realize sequence reconstruct efficiently, the public candidate sequence simultaneously making S' and S comprise is many as much as possible.Threshold value is relaxed method and is specifically comprised:
Because the sequence in sequence library is randomly drawed from raw data base, this can cause a lot of Frequent episodes in sample database, become no longer frequent.If still use user's threshold value of specifying to judge which sequence is potential Frequent episodes, the estimated result that can lead to errors.In order to address this problem, the present invention proposes threshold value and relaxing method.Its process specifically can specifically describe: for a given sample database D swith one group of frequent k-sequence of candidate, first suppose a frequent k-sequence t of candidate k, make its true support be the threshold value θ specified; Then, t is calculated kat D sin add the cumulative distribution function F of the support of noise k.T kat D sin add noise support be designated as z, z is two stochastic variable x (t kat D sin true support) and Laplace noise y sum, wherein x Gaussian distributed Normal (μ, σ 2), y obeys Laplace distribution Laplace so by the Cumulative Distribution Function calculating z be:
Wherein, μ, σ 2for the parameter of Gaussian distribution, wherein μ is the expectation of distribution, σ 2for the variance of distribution; λ, for the parameter of Laplace distribution, wherein λ is the location parameter of distribution, for the scale parameter of distribution.Erfc is error function, and its expression formula is:
( x ) = &Integral; x &infin; ( - p 2 ) dp
P is integration variable.Finally set F k(θ ')=ζ (F kthe expression formula of (θ ') is as above formula, z in the suitable above formula of θ '), calculate corresponding θ ', and be potential Frequent episodes with which candidate sequence in its sample estimates database, the sequence that namely in sample database, support is greater than θ ' is potential Frequent episodes.Find in experiment, when ζ is set to 0.3, usually can obtain good result.And drawn by theoretical analysis, the method meets the requirement of difference privacy normal form.
In sample database, the maximum limited length l of sequence is exceeded for length maxsequence, adopt sequence contraction method to limit its length, thus desensitising size, and then reduce Laplace noise and add, the final availability improving frequent item set support.And by sample database to an item integrate whether judge in advance as frequent item set time, a part of frequent item set may by misjudgement for nonmatching grids, and in order to reduce this error, by threshold value reduction to a certain degree, this method is called that threshold value relaxes method.Namely sequence contraction method and threshold value are relaxed method and are acted on the different stages, sequence contraction method is when utilizing sample database statistical items collection support, reduce noise to add, it is to reduce by the quantity of the frequent item set misjudged as nonmatching grids that threshold value relaxes method.
By form analysis, the Frequent episodes mining algorithm (PFS realizing Candidate Set beta pruning based on sampling meeting difference privacy in the present invention 2) higher Result availability can be provided while meeting difference secret protection.
By with document [3]the algorithm (Prefix) proposed and document [4]the algorithm (n-gram) proposed is compared, and can determine that the PFP algorithm proposed has obvious advantage in the availability and secret protection level of Result.In order to the advantage of algorithm of the present invention is described better, adopt widely used module F-score [6]with relative error (RE) [5]come PFS 2algorithm and Prefix algorithm, n-gram algorithm compare.Wherein, weigh the availability of the Frequent episodes of generation with F-score, weigh its error relative to the accurate support of sequence with RE.
Use Prefix and n-gram complete privacy search Frequent episodes time, have employed general method: on original database, first run them generate anonymous database, on the database of anonymity, then run the Frequent episodes mining algorithm GSP of non-privacy.
Arranging of concrete experiment is as follows.
Use three groups of real data sets, because the data in data set House-Power are time series, these values of discretize also successfully construct a sequence from every 50 samples.In these three databases, the specific features of data is as shown in table 1 below:
Data characteristics in table 1 database
Database Sequence number Item number Maximum length Average length
MSNBC 989818 17 14795 4.75
BIBLE 36369 19305 100 21.64
House_Power 40986 21 50 50
All algorithms are realized by JAVA language.The experimental situation of test is Intel Core2DuoE8400CPU (3.0GHz) and 4GB RAM.
Below by analysis design mothod data, PFS is described 2the performance of algorithm.
First PFS is compared 2algorithm, Prefix algorithm and the performance of n-gram algorithm under different threshold value.Because the quantity of BIBLE middle term is too large, concerning Prefix algorithm, be difficult to it and build a prefix trees, so do not show the performance of Prefix algorithm in BIBLE.Experimental result is as shown in (a) ~ (f) in Fig. 3.
As can be seen from Figure 3 PFS 2the performance of algorithm is obviously better than Prefix algorithm and n-gram algorithm.Interpretation of result is as follows.For meeting the requirement of maximum limited length, the item exceeding restriction in list entries is directly deleted by Prefix algorithm and n-gram algorithm, causes a large amount of frequent information dropout like this.Comparatively speaking, PFS 2algorithm employs sequence contraction method, and it effectively can protect the frequent information in each list entries, so it can significantly improve the availability of Frequent episodes.
Then, data set MSNBC (θ=0.015) and House_Power (θ=0.34) is utilized to compare PFS 2algorithm, Prefix algorithm and the performance of n-gram algorithm under different parameters.Experimental result is as shown in (a) ~ (d) in Fig. 4.
4 can find out from the graph, under the privacy parameters of phase same level, and PFS 2the performance of algorithm is better than Prefix algorithm and n-gram algorithm all the time.Find, three algorithm table reveal consistent characteristic simultaneously, and namely along with the increase of ε, the quality of Frequent episodes improves constantly.ε refers to the privacy parameters in ε-difference privacy herein, and its represents difference secret protection intensity, and ε is larger, and secret protection intensity is more weak, needed for the noise that adds fewer, so the quality of Frequent episodes is higher.
Finally, database BIBLE and House_Power measurement sequence contraction method and threshold value is utilized to relax method to PFS 2the impact of algorithm performance.Represent deleted entry random from sequence with RR in the present embodiment, and the threshold value that user specifies is not relaxed in sample database, represent with SR and employ sequence contraction method but do not use threshold value to relax method.
(a) ~ (d) as can be seen from Fig. 5, the RR not using sequence contraction method and threshold value to relax method can not produce rational result.Meanwhile, find that sequence contraction method and threshold value are relaxed method and can effectively be improved algorithm PFS 2the performance of algorithm in F-score, in RE, although after employing these two kinds of methods, performance slightly reduces, and they can reach good performance.This is because by using sequence contraction method and threshold value to relax method, more real Frequent episodes is retained in sample database, and this can increase a little by causing the noise content added for each candidate sequence.
By above great many of experiments, the PFS proposed can be determined 2algorithm has obvious advantage in the availability and secret protection level of Result.
Above embodiment is only for illustration of the present invention; and be not limitation of the present invention; the those of ordinary skill of relevant technical field; without departing from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all equivalent technical schemes also belong to category of the present invention, and scope of patent protection of the present invention should be defined by the claims.

Claims (8)

1. a Frequent episodes method for digging, is characterized in that, comprises step:
S1: the maximum limited length l of the sequence of calculation from raw data base max, and obtain β={ β 1... β i, β n, β irepresent that length is the max support of the sequence of i;
S2: according to described l maxwith β={ β 1... β i, β n, based on the Candidate Set technology of prunning branches of sampling, under the condition meeting difference secret protection normal form, from described raw data base, search Frequent episodes.
2. Frequent episodes method for digging as claimed in claim 1, it is characterized in that, in described step S1, the maximum limited length of the sequence of calculation specifically comprises:
Obtain the maximum length l of the sequence in described raw data base 1;
Calculate l 2, and make be not less than predetermined value, wherein α iand α jrepresent that in described raw data base, length is respectively the number of the sequence of i and j;
Calculate the maximum limited length l of described sequence max=min{l 1, l 2.
3. Frequent episodes method for digging as claimed in claim 2, it is characterized in that, described predetermined value is 85%.
4. Frequent episodes method for digging as claimed in claim 2, is characterized in that, at calculating l 2time, be each α iadd noise.
5. Frequent episodes method for digging as claimed in claim 2, is characterized in that, also comprise for each β in described step S1 iadd noise.
6. the Frequent episodes method for digging according to any one of Claims 1 to 5, is characterized in that, described step S2 specifically comprises:
S2.1: for given threshold value θ, utilizes β to estimate Maximal frequent sequence length L f, make L ffor integer y, make β yfor being greater than the minimum value of θ in β;
S2.2: random for raw data base is divided into L findividual mutually disjoint database is as sample database, and composition set dbSet, each database comprises | D|/L findividual sequence, wherein | D| represents the number of sequence in database;
S2.3: generate candidate Frequent episodes, when Mining Frequent 1-sequence, the frequent 1-sequence of candidate is the item in database, later according to downward closure property, uses frequent (k-1)-sequence to generate the frequent k-sequence of candidate, is used for Mining Frequent k-sequence;
S2.4: in sample database, exceedes the maximum limited length l of sequence for length maxsequence, adopting sequence contraction method to limit its length, the threshold value that user specifies being relaxed, for judging the frequent attribute of sequence in sample database meanwhile;
S2.5: calculated candidate sequence adds the support of noise in raw data base, exports the candidate sequence that the support adding noise is greater than described threshold value θ as Frequent episodes.
7. Frequent episodes method for digging as claimed in claim 6, it is characterized in that, described sequence contraction method specifically comprises:
Sequence contraction method mainly comprises the following steps:
Step one, outlier are deleted, and the item be not included in arbitrary candidate sequence in sequence is called outlier;
Step 2, continuous mode compress, and in a sequence, if a certain pattern may occur continuously, this pattern are called continuous mode, carry out continuous mode compression in the following manner:
Make p kintermediate scheme p occurs k time continuously, T 1| T 2| T 3represent by sequence T 1, T 2and T 3the sequence connected into, Contain k(T) represent by the set being included in the frequent k-sequence of candidate in T and forming, for sequence T 1| p j| T 2and T 1| p k| T 2(wherein j>k), has Contain k(T 1| p j| T 2)=Contain k(T 1| p k| T 2);
Step 3, sequence reconstruct, if sequence length still can not meet the requirement of maximum length restriction after above step one and step 2, carry out sequence reconstruct in the following manner:
Form candidate sequence tree CS-tree, the common prefix of different candidate sequence is put into same branch; The height of tree equals the length of candidate sequence, and in tree, each vertex ticks is an item, and this node is associated with the sequence that all items form to its path from root; For a CS-tree by one group of frequent k-Sequence composition of candidate, node on kth layer is associated with the frequent k-sequence of candidate, these nodes are claimed to be c-node, node on k-1 layer is associated with (k-1)-sequence, these nodes are claimed to be g-node, the sequence be associated with them is called as formation sequence, specifically introduces sequence restructuring procedure:
A) the frequent k-sequence C of candidate is utilized kbuild a CS-tree, be designated as CT, for sequence S, find out the frequent k-sequence C of the candidate be included in s ' k, and generate a new CS-tree by these candidates frequent k-sequence, be designated as subCT, subCT is a stalk tree of CT, builds an empty sequence S' simultaneously;
B) from C' kin choose a candidate sequence and be appended in S', for C' kin arbitrary candidate sequence cs, the set that its (k-1)-subsequence is formed may comprise some C' kin the formation sequence of other candidate sequences, the corresponding g-node of each candidate sequence, each g-node has one group of child nodes, the number sum of the child nodes of these g-nodes is called the scoring of candidate sequence cs, be designated as c-score, so choose the candidate sequence that its c-score is maximum, if multiple candidate sequence has identical c-score, choose and comprise maximum that of different item;
C) from subCT, remove step b) in the sequence pair the chosen c-node of answering;
D) subCT is used to find the formation sequence be included in (the k-1)-son sequence set of S', described be included in formation sequence in (the k-1)-son sequence set of S' can corresponding one group of g-node, if have node to have child nodes and c-node in this group g-node, the all corresponding tag entry of these child nodes, and the tag entry of different child nodes may be identical, the item choosing corresponding c-nodes more is appended to S', and is removed from subCT by these corresponding all c-nodes; If all nodes all do not have child nodes, then according to step b in this group g-node) from C' kin choose a candidate sequence and be appended in S', and the candidate sequence be selected is not comprised by S'; Constantly for S' adds item, until make it meet the requirement of maximum limited length.
8. Frequent episodes method for digging as claimed in claim 6, it is characterized in that, described threshold value is relaxed method and is specifically comprised:
For a given sample database D swith one group of frequent k-sequence of candidate, make a frequent k-sequence t of candidate k, make its true support be the described threshold value θ specified; Then, t is calculated kat D sin add the cumulative distribution function F of the support of noise k; Finally set F k(θ ')=ζ, calculates corresponding θ ', and in sample database, support is greater than the sequence of θ ' is potential Frequent episodes.
CN201410802280.8A 2014-12-19 2014-12-19 Frequent episodes method for digging Active CN104537025B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410802280.8A CN104537025B (en) 2014-12-19 2014-12-19 Frequent episodes method for digging

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410802280.8A CN104537025B (en) 2014-12-19 2014-12-19 Frequent episodes method for digging

Publications (2)

Publication Number Publication Date
CN104537025A true CN104537025A (en) 2015-04-22
CN104537025B CN104537025B (en) 2017-10-10

Family

ID=52852553

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410802280.8A Active CN104537025B (en) 2014-12-19 2014-12-19 Frequent episodes method for digging

Country Status (1)

Country Link
CN (1) CN104537025B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339609A (en) * 2016-09-19 2017-01-18 四川大学 Heuristic mining method of optimal comparing sequence mode of free interval constraint
CN106682514A (en) * 2016-12-15 2017-05-17 哈尔滨工程大学 System call sequence characteristic mode set generation method based on subgraph mining
CN107491557A (en) * 2017-09-06 2017-12-19 徐州医科大学 A kind of TopN collaborative filtering recommending methods based on difference privacy
CN107729762A (en) * 2017-08-31 2018-02-23 徐州医科大学 A kind of DNA based on difference secret protection model closes frequent motif discovery method
CN107844540A (en) * 2017-10-25 2018-03-27 电子科技大学 A kind of time series method for digging for electric power data
CN108280366A (en) * 2018-01-17 2018-07-13 上海理工大学 A kind of batch linear query method based on difference privacy
CN109409128A (en) * 2018-10-30 2019-03-01 南京邮电大学 A kind of Mining Frequent Itemsets towards difference secret protection
CN109861858A (en) * 2019-01-28 2019-06-07 北京大学 Wrong investigation method of the micro services system root because of node
CN110471957A (en) * 2019-08-16 2019-11-19 安徽大学 Localization difference secret protection Mining Frequent Itemsets based on frequent pattern tree (fp tree)
CN110490000A (en) * 2019-08-23 2019-11-22 广西师范大学 The difference method for secret protection that Frequent tree mining excavates in more diagram datas
CN112884614A (en) * 2019-11-29 2021-06-01 北京金山云网络技术有限公司 Frequent sequence based route recommendation method and device and electronic equipment
US11055492B2 (en) 2018-06-02 2021-07-06 Apple Inc. Privatized apriori algorithm for sequential data discovery
CN113282686A (en) * 2021-06-03 2021-08-20 光大科技有限公司 Method and device for determining association rule of unbalanced sample
CN115859132A (en) * 2023-02-27 2023-03-28 广州帝隆科技股份有限公司 Big data risk management and control method and system based on neural network model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050240582A1 (en) * 2004-04-27 2005-10-27 Nokia Corporation Processing data in a computerised system
CN101561854A (en) * 2009-05-22 2009-10-21 江苏大学 Private data guard method in sequential mode mining
CN101931570A (en) * 2010-02-08 2010-12-29 中国航天科技集团公司第七一○研究所 Method for reconstructing network attack path based on frequent pattern-growth algorithm
CN102254034A (en) * 2011-08-08 2011-11-23 浙江鸿程计算机系统有限公司 Online analytical processing (OLAP) query log mining and recommending method based on efficient mining of frequent closed sequences (BIDE)
CN103150311A (en) * 2011-12-07 2013-06-12 微软公司 Frequent object mining method based on data partitioning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050240582A1 (en) * 2004-04-27 2005-10-27 Nokia Corporation Processing data in a computerised system
CN101561854A (en) * 2009-05-22 2009-10-21 江苏大学 Private data guard method in sequential mode mining
CN101931570A (en) * 2010-02-08 2010-12-29 中国航天科技集团公司第七一○研究所 Method for reconstructing network attack path based on frequent pattern-growth algorithm
CN102254034A (en) * 2011-08-08 2011-11-23 浙江鸿程计算机系统有限公司 Online analytical processing (OLAP) query log mining and recommending method based on efficient mining of frequent closed sequences (BIDE)
CN103150311A (en) * 2011-12-07 2013-06-12 微软公司 Frequent object mining method based on data partitioning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
丁丽萍等: ""面向频繁模式挖掘的差分隐私保护研究综述"", 《通信学报》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339609A (en) * 2016-09-19 2017-01-18 四川大学 Heuristic mining method of optimal comparing sequence mode of free interval constraint
CN106682514A (en) * 2016-12-15 2017-05-17 哈尔滨工程大学 System call sequence characteristic mode set generation method based on subgraph mining
CN106682514B (en) * 2016-12-15 2020-07-28 哈尔滨工程大学 System calling sequence feature pattern set generation method based on subgraph mining
CN107729762A (en) * 2017-08-31 2018-02-23 徐州医科大学 A kind of DNA based on difference secret protection model closes frequent motif discovery method
CN107491557A (en) * 2017-09-06 2017-12-19 徐州医科大学 A kind of TopN collaborative filtering recommending methods based on difference privacy
CN107844540A (en) * 2017-10-25 2018-03-27 电子科技大学 A kind of time series method for digging for electric power data
CN108280366A (en) * 2018-01-17 2018-07-13 上海理工大学 A kind of batch linear query method based on difference privacy
US11055492B2 (en) 2018-06-02 2021-07-06 Apple Inc. Privatized apriori algorithm for sequential data discovery
CN109409128A (en) * 2018-10-30 2019-03-01 南京邮电大学 A kind of Mining Frequent Itemsets towards difference secret protection
CN109409128B (en) * 2018-10-30 2022-05-17 南京邮电大学 Differential privacy protection-oriented frequent item set mining method
CN109861858B (en) * 2019-01-28 2020-06-26 北京大学 Error checking method for root cause node of micro-service system
CN109861858A (en) * 2019-01-28 2019-06-07 北京大学 Wrong investigation method of the micro services system root because of node
CN110471957A (en) * 2019-08-16 2019-11-19 安徽大学 Localization difference secret protection Mining Frequent Itemsets based on frequent pattern tree (fp tree)
CN110471957B (en) * 2019-08-16 2021-10-26 安徽大学 Localized differential privacy protection frequent item set mining method based on frequent pattern tree
CN110490000A (en) * 2019-08-23 2019-11-22 广西师范大学 The difference method for secret protection that Frequent tree mining excavates in more diagram datas
CN112884614A (en) * 2019-11-29 2021-06-01 北京金山云网络技术有限公司 Frequent sequence based route recommendation method and device and electronic equipment
CN112884614B (en) * 2019-11-29 2024-05-14 北京金山云网络技术有限公司 Route recommendation method and device based on frequent sequences and electronic equipment
CN113282686A (en) * 2021-06-03 2021-08-20 光大科技有限公司 Method and device for determining association rule of unbalanced sample
CN113282686B (en) * 2021-06-03 2023-11-07 光大科技有限公司 Association rule determining method and device for unbalanced sample
CN115859132A (en) * 2023-02-27 2023-03-28 广州帝隆科技股份有限公司 Big data risk management and control method and system based on neural network model

Also Published As

Publication number Publication date
CN104537025B (en) 2017-10-10

Similar Documents

Publication Publication Date Title
CN104537025A (en) Frequent sequence mining method
CN102289507B (en) Method for mining data flow weighted frequent mode based on sliding window
CN105808696B (en) It is a kind of based on global and local feature across line social network user matching process
CN103955542B (en) Method of item-all-weighted positive or negative association model mining between text terms and mining system applied to method
CN104182527B (en) Association rule mining method and its system between Sino-British text word based on partial order item collection
CN104216874A (en) Chinese interword weighing positive and negative mode excavation method and system based on relevant coefficients
Yu et al. Random walk with restart over dynamic graphs
CN104462184A (en) Large-scale data abnormity recognition method based on bidirectional sampling combination
CN105550189A (en) Ontology-based intelligent retrieval system for information security event
Lavor et al. Bayesian spatio‐temporal reconstruction reveals rapid diversification and Pleistocene range expansion in the widespread columnar cactus Pilosocereus
Xu et al. Differentially private frequent sequence mining via sampling-based candidate pruning
CN104317794B (en) Chinese Feature Words association mode method for digging and its system based on dynamic item weights
CN103020283B (en) A kind of semantic retrieving method of the dynamic restructuring based on background knowledge
CN103440308B (en) A kind of digital thesis search method based on form concept analysis
Tatti Probably the best itemsets
Cheng et al. Differentially private maximal frequent sequence mining
Liu et al. Randomized perturbation for privacy-preserving social network data publishing
Hong et al. Hiding sensitive itemsets by inserting dummy transactions
CN108664548B (en) Network access behavior characteristic group dynamic mining method and system under degradation condition
CN105740907A (en) Local community mining method
Yi-Yang et al. Data mining and analysis of our agriculture based on the decision tree
CN104182386A (en) Word pair relation similarity calculation method
CN103077181B (en) Method for automatically generating approximate functional dependency rule
Hamedanian et al. An efficient prefix tree for incremental frequent pattern mining
CN106776607A (en) Search engine operation behavior treating method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant