WO2019178733A1 - 大规模数据集的频繁项集挖掘方法、装置、设备及介质 - Google Patents
大规模数据集的频繁项集挖掘方法、装置、设备及介质 Download PDFInfo
- Publication number
- WO2019178733A1 WO2019178733A1 PCT/CN2018/079554 CN2018079554W WO2019178733A1 WO 2019178733 A1 WO2019178733 A1 WO 2019178733A1 CN 2018079554 W CN2018079554 W CN 2018079554W WO 2019178733 A1 WO2019178733 A1 WO 2019178733A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data set
- scale data
- noise
- preset
- sample
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6227—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
Definitions
- the invention belongs to the field of information technology, and in particular relates to a frequent item set mining method, device, device and medium for a large-scale data set.
- K-anonymity and its extension model Traditional privacy protection methods are mostly based on K-anonymity and its extension model. Such methods require certain assumptions. Once the assumptions are destroyed, it is difficult to protect privacy.
- the disadvantage of K-anonymity and its extension model is that it is not strict.
- the attack model is defined and the knowledge of the attacker is not quantified.
- some existing attack models also challenge the privacy effectiveness of such methods.
- Dwork proposes a strong privacy protection model based on data distortion—differential privacy, which is based on its strict privacy definition and irrelevant attacks. The characteristics of background knowledge, although differential privacy can obtain good privacy guarantee, but there are widespread problems of high sensitivity and poor usability.
- the object of the present invention is to provide a frequent item set mining method, device, device and storage medium for a large-scale data set, which aims to solve the problem that the frequent item set mining method of the large-scale data set in the prior art cannot simultaneously take into account the privacy of the data mining. Sex, usability, sensitivity, and computational strength issues.
- the present invention provides a frequent item set mining method for a large-scale data set, the method comprising the following steps:
- the present invention provides a frequent item set mining device for a large-scale data set, the device comprising:
- a sample capacity estimating unit configured to receive a large-scale data set input by the user, and estimate a sample capacity corresponding to the large-scale data set according to the preset precision threshold and the preset credibility threshold;
- a sampling mining unit configured to perform simple random sampling on the large-scale data set, generate a sample data set of the sample size, and mine a closed frequent item set in the sample data set;
- a data set reduction unit configured to calculate a maximum length constraint corresponding to the large-scale data set according to the sample data set, and generate a reduction corresponding to the large-scale data set according to the closed frequent item set and the maximum length constraint data set;
- An FP-Tree building unit configured to construct a noise FP-Tree of the large-scale data set by using the reduced data set, and equally distribute a preset privacy budget to each layer of the noise FP-Tree;
- a frequent item set screening unit configured to select a candidate set from the noise FP-Tree according to a preset noise threshold, and increase privacy protection of the candidate set by using a preset geometrical mechanism noise, and from the candidate set Select a preset number of frequent itemsets.
- the present invention also provides a computing device including a memory, a processor, and a computer program stored in the memory and operable on the processor, the processor implementing the computer program The steps described in the frequent item set mining method of the large-scale data set described above.
- the present invention also provides a computer readable storage medium storing a computer program that, when executed by a processor, implements frequent item set mining as described above for a large-scale data set Method described.
- the invention samples the sample data set of the sample size from the large-scale data set, and mines the closed frequent itemsets in the sample data set, so as to reduce the calculation intensity of frequent item set mining under the large data set, and calculate the maximum length constraint according to the sample data set.
- the reduced dataset corresponding to the large-scale dataset is generated to reduce the global sensitivity of frequent item set mining under the large dataset, and the noise FP-Tree is constructed by reducing the dataset.
- the median privacy budget is evenly distributed to the noise FP-Tree.
- the candidate set is selected on the noise FP-Tree by the noise threshold, and the privacy protection of the candidate set is increased by the geometric mechanism noise, and the frequent itemsets are selected from the candidate set. Therefore, the computational intensity of frequent item set mining in large-scale data sets is reduced, the privacy of data mining is ensured, the global sensitivity is reduced, and the availability of data and mining results is improved.
- FIG. 1 is a flowchart of an implementation of a frequent item set mining method for a large-scale data set according to Embodiment 1 of the present invention
- FIG. 2 is a schematic structural diagram of a frequent item set mining apparatus for a large-scale data set according to Embodiment 2 of the present invention
- FIG. 3 is a schematic diagram of a preferred structure of a frequent item set mining device for a large-scale data set according to Embodiment 2 of the present invention
- FIG. 4 is a schematic structural diagram of a computing device according to Embodiment 3 of the present invention.
- Embodiment 1 is a diagrammatic representation of Embodiment 1:
- FIG. 1 is a flowchart showing an implementation process of a frequent item set mining method for a large-scale data set according to Embodiment 1 of the present invention. For convenience of description, only parts related to the embodiment of the present invention are shown, which are as follows:
- step S101 the large-scale data set input by the user is received, and the sample capacity corresponding to the large-scale data set is estimated according to the preset precision threshold and the preset credibility threshold.
- a large-scale data set is composed of a series of transactions. For example, in a supermarket shopping, all items purchased by each person at a time can be regarded as one transaction, and thousands of individuals purchase items constitute a data set. .
- frequent item set mining is required for large-scale data sets.
- large-scale data needs to be pre-processed.
- a preset accuracy threshold and a confidence threshold are used to estimate the sample size corresponding to a large data set.
- n is the sample size to be estimated
- f n is the number of times the frequent item set appears in the random sample
- ⁇ is the precision threshold
- a is the confidence threshold.
- step S102 a simple random sampling is performed on the large-scale data set, a sample data set of the sample size is generated, and a closed frequent item set in the sample data set is mined.
- the large-scale data set can be simply and randomly sampled by a sampling tool (for example, SAS, Statistical Analysis System), and the sample obtained by the collected sample size is configured.
- a sampling tool for example, SAS, Statistical Analysis System
- the sample data set for large-scale data sets, can process the sample data set that meets the accuracy requirements, and can achieve the same processing requirements. Therefore, the sample data set can be subsequently mined to obtain multiple closed frequent items in the sample data set. Set, which effectively reduces the computational intensity.
- the sample data set can be mined by the Apriori mining algorithm to obtain a closed frequent item set in the sample data set.
- step S103 a maximum length constraint corresponding to the large-scale data set is calculated according to the sample data set, and a reduced data set corresponding to the large-scale data set is generated according to the closed frequent item set and the maximum length constraint.
- the distribution of all transaction sequence lengths ⁇ z 1 , . . . , z i , . . . , z n ⁇ , z i in the sample data set may be represented by a preset heuristic method.
- the number of transactions with a sequence length of i, starting from a sequence length of 1, incrementally calculating the number of transactions of each sequence length i and summing until satisfied Will meet
- the minimum i value is set to the maximum length constraint.
- ⁇ is a preset constraint parameter.
- the large-scale data set can be processed according to the closed frequent item set and the maximum length constraint (including reducing the number of transactions in the large-scale data set and reducing the large-scale data centralized transaction).
- the large-scale data set is processed by the following steps to obtain a reduced data set corresponding to the large-scale data set:
- the 1-item set with the support degree greater than the preset support degree threshold in the large-scale data set that is, the frequent 1-item set.
- the items in each closed frequent item set are sorted, and then the element set is combined according to the sorted closed frequent item set.
- the closed frequent items set ⁇ c, b ⁇ , ⁇ f, d, e ⁇ ⁇ a, b, c ⁇ are sorted separately to get ⁇ c, b ⁇ , ⁇ e, d, f ⁇ , ⁇ a, c, b ⁇ , and then these closed frequent itemsets are combined to obtain a set of elements ⁇ c, b ,e,d,f,a ⁇ .
- the sorted closed frequent item set is combined to obtain a set of elements, and the order between the items (or elements) in the set of elements may not conform to the order of decreasing support, and therefore needs to be sorted according to the order.
- the sequence length of the transaction in the data set is the sequence length of the transaction in the data set.
- the transaction in the large-scale data set sequence length exceeding the maximum length constraint is matched with the closed frequent item set, and then the most A similar string truncates a transaction whose sequence length exceeds the maximum length remainder, allowing multiple matches and multiple truncations until the sequence length of the transaction does not exceed the maximum length constraint, thereby reducing large-scale data set transactions with similar strings
- the sequence length avoids the problem of information loss caused by the direct truncation of the transaction by using the maximum length constraint, thereby reducing the global sensitivity, improving the availability of data, and greatly reducing the computational strength in the subsequent mining process.
- a maximum common subsequence algorithm can be used to implement string matching between a transaction in a large data set with a sequence length exceeding a maximum length constraint and a closed frequent item set.
- step S104 the noise FP-Tree of the large-scale data set is constructed by reducing the data set, and the preset privacy budget is equally distributed to each layer of the noise FP-Tree.
- the noise FP-Tree of the large-scale data set can be constructed based on the reduced data set.
- calculating the count of each node of the noise FP-Tree will destroy privacy, so it is necessary to add noise to each node of the noise FP-Tree.
- l max represents the maximum length constraint
- ⁇ 1 represents the preset privacy budget (the FP-Tree obtained at this time is ⁇ 1 - differential privacy).
- the privacy budget ⁇ 1 can be equally assigned to the ⁇ 1/l max privacy budget for each layer of the FP-Tree based on the depth of the FP-Tree (ie, the maximum length constraint), which is used for each layer.
- Add Laplace noise ⁇ f is the sensitivity of the current data mining stage.
- each node of the FP-Tree corresponds to a 1-item set in the reduced data set, so when removing or adding a path in the FP-Tree, the FP-Tree is overall. The impact is small, that is, the sensitivity is small.
- step S105 a candidate set is selected from the noise FP-Tree according to a preset noise threshold, and the privacy protection of the candidate set is increased by a preset geometrical noise, and a preset number of frequent items are selected from the candidate set. set.
- a count with noise in each frequent item set in the reduced data set can be obtained, and the count is compared with a preset noise threshold, and is composed of a frequent item set whose technology exceeds the noise threshold.
- Candidate sets then add geometric noise to each frequent item set in the candidate set
- a preset number of frequent itemsets are selected from the candidate sets to complete frequent item set mining of large-scale data sets, for example, selecting the first K frequent itemsets from the candidate set.
- the closed frequent item set of the sample data set is mined, and the reduced data set corresponding to the large-scale data set is generated by the closed frequent item set and the maximum length constraint, and the noise FP-Tree is constructed by reducing the data set, and is constructed.
- the privacy budget is evenly distributed to the noise FP-Tree.
- the candidate set is selected on the noise FP-Tree, and the privacy protection of the candidate set is increased by the geometric mechanism noise, and the frequent itemsets are selected from the candidate set.
- Embodiment 2 is a diagrammatic representation of Embodiment 1:
- Embodiment 2 shows the structure of a frequent item set mining device for a large-scale data set according to Embodiment 2 of the present invention. For the convenience of description, only parts related to the embodiment of the present invention are shown, including:
- the sample capacity estimating unit 21 is configured to receive a large-scale data set input by the user, and estimate a sample capacity corresponding to the large-scale data set according to the preset precision threshold and the preset credibility threshold.
- the sample size corresponding to the large-scale data set can be estimated based on the preset accuracy threshold and the confidence threshold.
- the distribution of the frequent itemsets in the large-scale data set satisfies the binomial distribution probability model, so the sample size corresponding to the large-scale data set is estimated by the normal distribution table, the precision threshold, and the confidence threshold, wherein, The absolute error of the sample size estimation does not exceed the accuracy threshold, and the credibility of estimating the sample capacity is not less than the confidence threshold, thereby effectively improving the accuracy of the sample capacity estimation.
- the estimation formula of the sample size can be expressed as:
- n is the sample size to be estimated
- f n is the number of times the frequent item set appears in the random sample
- ⁇ is the precision threshold
- a is the confidence threshold.
- the sample mining unit 22 is configured to perform simple random sampling on the large-scale data set, generate a sample data set of the sample size, and mine the closed frequent itemset in the sample data set.
- the large-scale data set can be simply and randomly sampled by the sampling tool, and the sample obtained by the acquisition constitutes a sample data set of the sample size, for large-scale
- the same processing requirements can be achieved by processing the sample data set that meets the accuracy requirements. Therefore, the sample data set can be subsequently mined to obtain multiple closed frequent itemsets of the large-scale data set, thereby effectively reducing the number of items. Calculate the intensity.
- the data set reduction unit 23 is configured to calculate a maximum length constraint corresponding to the large-scale data set according to the sample data set, and generate a reduced data set corresponding to the large-scale data set according to the closed frequent item set and the maximum length constraint.
- the distribution of all transaction sequence lengths ⁇ z 1 , . . . , z i , . . . , z n ⁇ , z i in the sample data set may be represented by a preset heuristic method.
- the number of transactions with a sequence length of i, starting from a sequence length of 1, incrementally calculating the number of transactions of each sequence length i and summing until satisfied Will meet
- the minimum i value is set to the maximum length constraint.
- ⁇ is a preset constraint parameter.
- the large-scale data set can be processed according to the closed frequent item set and the maximum length constraint, and the reduced data set corresponding to the large-scale data set is obtained, thereby reducing large-scale data.
- the FP-Tree construction unit 24 is configured to construct a noise FP-Tree of a large-scale data set by reducing the data set, and equally distribute the preset privacy budget to each layer of the noise FP-Tree.
- the noise FP-Tree of the large-scale data set can be constructed based on the reduced data set.
- calculating the count of each node of the noise FP-Tree will destroy privacy, so it is necessary to add noise to each node of the noise FP-Tree.
- l max represents the maximum length constraint
- ⁇ 1 represents the preset privacy budget (the FP-Tree obtained at this time is ⁇ 1 - differential privacy).
- the privacy budget ⁇ 1 can be equally assigned to the ⁇ 1/l max privacy budget for each layer of the FP-Tree based on the depth of the FP-Tree, which is used to add Laplacian noise to each layer.
- ⁇ f is the sensitivity of the current data mining stage.
- each node of the FP-Tree corresponds to a 1-item set in the reduced data set, so when removing or adding a path in the FP-Tree, the FP-Tree is overall. The impact is small, that is, the sensitivity is small.
- the frequent item set screening unit 25 is configured to select a candidate set from the noise FP-Tree according to a preset noise threshold, increase the privacy protection of the candidate set by using a preset geometrical mechanism noise, and select a preset quantity from the candidate set. Frequent itemsets.
- a count with noise in each frequent item set in the reduced data set can be obtained, and the count is compared with a preset noise threshold, and is composed of a frequent item set whose technology exceeds the noise threshold.
- Candidate sets then add geometric noise to each frequent item set in the candidate set
- a preset number of frequent itemsets are selected from the candidate sets to complete frequent item set mining of large-scale data sets.
- the data set reduction unit 23 includes:
- a length distribution estimating unit 331, configured to obtain a number of transactions in different sequence lengths in the sample data set by estimating a distribution of lengths of all transaction sequences in the sample data set;
- the length constraint calculation unit 332 is configured to calculate a maximum length constraint according to the number of transactions in different sequence lengths in the sample data set and the preset constraint parameters.
- the data set reduction unit 23 further includes a 1-item set sorting unit 333, a closed frequent item set sorting unit 334, an item culling unit 335, and a transaction truncation unit 336, wherein:
- the 1-item set sorting unit 333 is configured to scan a 1-item set whose support degree in the large-scale data set is greater than a preset support degree threshold, and arrange the 1-item set set consisting of all 1-item sets in descending order of support degree. .
- the closed frequent item set sorting unit 334 is configured to sort each closed frequent item set according to the sorted 1-item set, and generate a corresponding element set according to all the sorted closed frequent itemsets;
- the items in each closed frequent item set are sorted, and then the element set is combined according to the sorted closed frequent item set.
- the closed frequent items set ⁇ c, b ⁇ , ⁇ f, d, e ⁇ ⁇ a, b, c ⁇ are sorted separately to get ⁇ c, b ⁇ , ⁇ e, d, f ⁇ , ⁇ a, c, b ⁇ , and then these closed frequent itemsets are combined to obtain a set of elements ⁇ c, b ,e,d,f,a ⁇ .
- the item culling unit 335 is configured to sort the element set according to the sorted 1-item set, and remove unnecessary items in all transactions in the large-scale data set according to the sorted element set.
- the sorted closed frequent item set is combined to obtain a set of elements, and the order between the items (or elements) in the set of elements may not conform to the order of decreasing support, and therefore needs to be sorted according to the order.
- the sequence length of the transaction in the data set is the sequence length of the transaction in the data set.
- the transaction truncation unit 336 is configured to perform similarity matching on the large-scale data set sequence length exceeding the maximum length constraint and the closed frequent item set, and according to the matching result of the similarity matching, the sequence length exceeding the maximum length constraint in the large-scale data set The transaction is truncated.
- the transaction in the large-scale data set sequence length exceeding the maximum length constraint is matched with the closed frequent item set, and then the most A similar string truncates a transaction whose sequence length exceeds the maximum length remainder, allowing multiple matches and multiple truncations until the sequence length of the transaction does not exceed the maximum length constraint, thereby reducing large-scale data set transactions with similar strings
- the sequence length avoids the problem of information loss caused by the direct truncation of the transaction by using the maximum length constraint, thereby reducing the global sensitivity, improving the availability of data, and greatly reducing the computational strength in the subsequent mining process.
- the closed frequent item set in the sample data set is mined, and the reduced data set corresponding to the large-scale data set is generated by the closed frequent item set and the maximum length constraint, and the noise FP-Tree is constructed by reducing the data set, and is constructed.
- the privacy budget is evenly distributed to the noise FP-Tree.
- the candidate set is selected on the noise FP-Tree, and the privacy protection of the candidate set is increased by the geometric mechanism noise, and the frequent itemsets are selected from the candidate set.
- each unit of the frequent item set mining device of the large-scale data set may be implemented by a corresponding hardware or software unit, and each unit may be an independent software and hardware unit, or may be integrated into a software and hardware unit. There is no need to limit the invention herein.
- Embodiment 3 is a diagrammatic representation of Embodiment 3
- FIG. 4 shows the structure of a computing device provided by Embodiment 3 of the present invention. For the convenience of description, only parts related to the embodiment of the present invention are shown.
- the computing device 4 of an embodiment of the present invention includes a processor 40, a memory 41, and a computer program 42 stored in the memory 41 and executable on the processor 40.
- the processor 40 when executing the computer program 42, implements the steps in the above-described method embodiments, such as steps S101 through S105 shown in FIG.
- the processor 40 executes the computer program 42, the functions of the units in the above-described apparatus embodiments, such as the functions of the units 21 to 25 shown in FIG. 2, are implemented.
- the closed frequent item set in the sample data set is mined, and the reduced data set corresponding to the large-scale data set is generated by the closed frequent item set and the maximum length constraint, and the noise FP-Tree is constructed by reducing the data set, and is constructed.
- the privacy budget is evenly distributed to the noise FP-Tree.
- the candidate set is selected on the noise FP-Tree, and the privacy protection of the candidate set is increased by the geometric mechanism noise, and the frequent itemsets are selected from the candidate set.
- Embodiment 4 is a diagrammatic representation of Embodiment 4:
- a computer readable storage medium storing a computer program, which when executed by a processor, implements the steps in the foregoing method embodiments, for example, FIG. Steps S101 to S105 are shown.
- the computer program when executed by the processor, implements the functions of the various units in the above-described apparatus embodiments, such as the functions of units 21 through 25 shown in FIG.
- the closed frequent item set in the sample data set is mined, and the reduced data set corresponding to the large-scale data set is generated by the closed frequent item set and the maximum length constraint, and the noise FP-Tree is constructed by reducing the data set, and is constructed.
- the privacy budget is evenly distributed to the noise FP-Tree.
- the candidate set is selected on the noise FP-Tree, and the privacy protection of the candidate set is increased by the geometric mechanism noise, and the frequent itemsets are selected from the candidate set.
- the computer readable storage medium of the embodiments of the present invention may include any entity or device capable of carrying computer program code, a recording medium such as a ROM/RAM, a magnetic disk, an optical disk, a flash memory, or the like.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Medical Informatics (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
一种大规模数据集的频繁项集挖掘方法、装置、设备及介质,该方法包括:估算样本容量,从大规模数据集中采集样本容量大小的样本数据集,挖掘样本数据集中的闭频繁项集并计算大规模数据集对应的最大长度约束,以生成大规模数据集对应的缩减数据集,通过缩减数据集构建大规模数据集的噪声FP-Tree,将隐私预算平均分配给噪声FP-Tree的每一层,通过噪声FP-Tree和噪音阈值选出候选集合,通过几何机制噪音增加候选集合的隐私保护,再从候选集合中选出预预设数量个频繁项集,从而减小了大规模数据集频繁项集挖掘的计算强度,保证了数据挖掘的隐私性,同时降低了全局敏感度、提高了数据和挖掘结果的可用性。
Description
本发明属于信息技术领域,尤其涉及一种大规模数据集的频繁项集挖掘方法、装置、设备及介质。
近几年,随着数据爆炸性地增长、信息技术(尤其是网络技术、数据存储技术)的迅猛发展,各行业都通过各种渠道积累海量数据,从这些海量数据发现有用的知识来应用到各个行业(商业决策、潜在客户分析等等)成为亟待解决的问题。
由于本地计算资源有限、云计算快速发展,相比于在本地进行数据挖掘,企业或个人将数据挖掘外包给云计算更为明智,能够节省人力物力。然而,将数据挖掘外包给云计算,将面临企业隐私或者个人隐私被泄露的问题,不管是提供数据方还是提供数据服务方,都希望在不泄露隐私、数据不透明化的情况下,挖掘到有意义的数据以便决策。目前,在大规模数据集上,还没有一个行之有效的隐私挖掘方法,可以同时兼顾频繁模式挖掘的隐私性与可用性、并减少计算强度。
传统的隐私保护方式大多基于K-匿名及其扩展模型等,这类方法需要一定的假设条件,一旦假设条件被破坏,很难保护隐私,K-匿名及其扩展模型的不足之处在于没有严格定义攻击模型,对攻击者所具有的知识未能定量化定义。此外,现有的一些攻击模型也对这类方法的隐私有效性提出了挑战,Dwork提出一种基于数据失真的强隐私保护模型∈-差分隐私,该模型因其严格的隐私定义以及无关于攻击者背景知识的特点,通过差分隐私虽然能获得良好的隐私保证,却普遍存在敏感度高、可用性差的问题。
发明内容
本发明的目的在于提供一种大规模数据集的频繁项集挖掘方法、装置、设备及存储介质,旨在解决现有技术中大规模数据集的频繁项集挖掘方法无法同时兼顾数据挖掘的隐私性、可用性、敏感度以及计算强度的问题。
一方面,本发明提供了一种大规模数据集的频繁项集挖掘方法,所述方法包括下述步骤:
接收用户输入的大规模数据集,并根据预设的精度阈值和预设的可信度阈值,估算所述大规模数据集对应的样本容量;
对所述大规模数据集进行简单随机采样,生成所述样本容量大小的样本数据集,挖掘所述样本数据集中的闭频繁项集;
根据所述样本数据集计算所述大规模数据集对应的最大长度约束,根据所述闭频繁项集和所述最大长度约束,生成所述大规模数据集对应的缩减数据集;
通过所述缩减数据集,构建所述大规模数据集的噪音FP-Tree,并将预设的隐私预算平均分配给所述噪音FP-Tree的每一层;
根据预设的噪音阈值,从所述噪音FP-Tree上选出候选集合,通过预设的几何机制噪音增加所述候选集合的隐私保护,并从所述候选集合中选出预设数量个频繁项集。
另一方面,本发明提供了一种大规模数据集的频繁项集挖掘装置,所述装置包括:
样本容量估算单元,用于接收用户输入的大规模数据集,并根据预设的精度阈值和预设的可信度阈值,估算大规模数据集对应的样本容量;
抽样挖掘单元,用于对所述大规模数据集进行简单随机采样,生成所述样本容量大小的样本数据集,挖掘所述样本数据集中的闭频繁项集;
数据集缩减单元,用于根据所述样本数据集计算所述大规模数据集对应的最大长度约束,根据所述闭频繁项集和所述最大长度约束,生成所述大规模数 据集对应的缩减数据集;
FP-Tree构建单元,用于通过所述缩减数据集,构建所述大规模数据集的噪音FP-Tree,并将预设的隐私预算平均分配给所述噪音FP-Tree的每一层;以及
频繁项集筛选单元,用于根据预设的噪音阈值,从所述噪音FP-Tree选出候选集合,通过预设的几何机制噪音增加所述候选集合的隐私保护,并从所述候选集合中选出预设数量个频繁项集。
另一方面,本发明还提供了一种计算设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述大规模数据集的频繁项集挖掘方法所述的步骤。
另一方面,本发明还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如上述大规模数据集的频繁项集挖掘方法所述的步骤。
本发明从大规模数据集中采样样本容量大小的样本数据集,并挖掘样本数据集中的闭频繁项集,以减小大数据集下频繁项集挖掘的计算强度,根据样本数据集计算最大长度约束,通过闭频繁项集和最大长度约束,生成大规模数据集对应的缩减数据集,以减少大数据集下频繁项集挖掘的全局敏感度,通过缩减数据集构建噪声FP-Tree,在构建过程中将隐私预算平均分配给噪声FP-Tree,构建好后通过噪声阈值在噪声FP-Tree上选出候选集合,通过几何机制噪音增加候选集合的隐私保护,再从候选集合中选出频繁项集,从而减小了大规模数据集频繁项集挖掘的计算强度,保证了数据挖掘的隐私性,同时降低了全局敏感度、提高了数据和挖掘结果的可用性。
图1是本发明实施例一提供的大规模数据集的频繁项集挖掘方法的实现流程图;
图2是本发明实施例二提供的大规模数据集的频繁项集挖掘装置的结构示 意图;
图3是本发明实施例二提供的大规模数据集的频繁项集挖掘装置的优选结构示意图;以及
图4是本发明实施例三提供的计算设备的结构示意图。
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
以下结合具体实施例对本发明的具体实现进行详细描述:
实施例一:
图1示出了本发明实施例一提供的大规模数据集的频繁项集挖掘方法的实现流程,为了便于说明,仅示出了与本发明实施例相关的部分,详述如下:
在步骤S101中,接收用户输入的大规模数据集,并根据预设的精度阈值和预设的可信度阈值,估算大规模数据集对应的样本容量。
在本发明实施例中,大规模数据集由一系列事务组成,例如在超市购物中每个人当次购买的所有物品可以认为是一个事务,成千上万个人的购买物品就构成了一个数据集。在接收到用户输入的大规模数据集后,需对大规模数据集进行频繁项集挖掘,为了降低挖掘过程的计算强度,需对大规模数据进行预处理,在预处理过程中,可先根据预设的精度阈值和可信度阈值,估算大规模数据集对应的样本容量。
优选地,频繁项集在大规模数据集中的分布满足二项分布概率模型,所以通过正态分布表、精度阈值和可信度阈值,对大规模数据集对应的样本容量进行估算,其中,对样本容量进行估算的绝对误差不超过精度阈值、且对样本容量进行估算的可信度不小于可信度阈值,从而有效地提高了样本容量估算的准确度。具体地,样本容量的估算公式可表示为:
其中,p表示频繁项集在大规模数据集中的总体概率,n表示待估算的样本容量,f
n表示频繁项集在随机抽样中出现的次数,δ为精度阈值,a为可信度阈值。由公式
和正态分布表,可以推出
进而可以判断出样本容量n满足
即
Z
a是正态分布表中的正态分布临界值。
在步骤S102中,对大规模数据集进行简单随机采样,生成样本容量大小的样本数据集,挖掘样本数据集中的闭频繁项集。
在本发明实施例中,在预处理过程中,估算得到样本容量后,可通过抽样工具(例如SAS,Statistical Analysis System)对大规模数据集进行简单随机采样,由采集得到的样本构成样本容量大小的样本数据集,对于大规模数据集来说,处理符合精度要求的样本数据集,可以达到同样的处理需求,因此后续可通过对样本数据集进行挖掘,得到样本数据集中的多个闭频繁项集,从而有效地降低了计算强度。
作为示例地,可通过Apriori挖掘算法对样本数据集进行挖掘,得到样本数据集中的闭频繁项集。
在步骤S103中,根据样本数据集计算大规模数据集对应的最大长度约束,根据闭频繁项集和最大长度约束,生成大规模数据集对应的缩减数据集。
在本发明实施例中,可通过预设的启发式方法在样本数据集中所有事务序列长度的分布{z
1,...,z
i,...,z
n},z
i表示样本数据集中序列长度为i的事务数量,可从序列长度为1开始,递增地计算每个序列长度为i的事务数量并求和,直至满足
将满足
的最小的i值设置为最大长度约束。其中,η为预设的约束参数。
在本发明实施例中,在计算得到最大长度约束后,可根据闭频繁项集和最大长度约束,对大规模数据集进行处理(包括减少大规模数据集中的事务数量、 减少大规模数据集中事务的序列长度),得到大规模数据集对应的缩减数据集,从而降低大规模数据集中频繁项集挖掘的全局敏感度,提高数据和挖掘结果的可用性。
优选地,通过下述步骤对大规模数据集进行处理,以得到大规模数据集对应的缩减数据集:
(1)扫描大规模数据集中支持度大于预设支持度阈值的1-项集,将由所有1-项集构成的1-项集集合按照支持度下降的顺序进行排列。
在本发明实施例中,在大规模数据集中支持度大于预设支持度阈值的1-项集,即频繁1-项集。
(2)根据排序后的1-项集集合,对每个闭频繁项集进行排序,根据所有排序后的闭频繁项集,生成对应的元素集合。
在本发明实施例中,根据排序后的1-项集集合,对每个闭频繁项集中的项进行排序,再根据排序后的闭频繁项集,组合得到元素集合。
作为示例地,当按照支持度排序后的1-项集集合为{a,c,e,b,d,f}时,对闭频繁项集{c,b}、{f,d,e}、{a,b,c}进行分别排序得到{c,b}、{e,d,f}、{a,c,b},再将这些闭频繁项集进行组合得到元素集合{c,b,e,d,f,a}。
(3)根据排序后的1-项集集合,对元素集合进行排序,根据排序后的元素集合,对大规模数据集中所有事务中多余的项进行剔除。
在本发明实施例中,有排序后的闭频繁项集组合得到元素集合,该元素集合中的项(或元素)之间的顺序可能并不符合支持度下降的顺序,因此需要再根据排序后的1-项集集合,对元素集合进行排序,再对大规模数据集中所有事务中未出现在元素集合中的项进行剔除,既可以减少大规模数据集中事务的数量,又可以减小大规模数据集中事务的序列长度。
(4)将大规模数据集中序列长度超过最大长度约束的事务与闭频繁项集进行相似度匹配,并根据相似度匹配的匹配结果,对大规模数据集中序列长度超过最大长度约束的事务进行截断。
在本发明实施例中,在对大规模数据集中事务中多余的项进行剔除后,将大规模数据集中序列长度超过最大长度约束的事务与闭频繁项集进行字符串匹配,再根据两者最相似的字符串对该序列长度超过最大长度余数的事务进行截断,可进行多次匹配和多次截断,直至事务的序列长度不超过最大长度约束,从而通过相似字符串减小大规模数据集中事务的序列长度,避免了利用最大长度约束对事务直接进行截断导致的信息丢失问题,进而降低全局敏感度、提高数据的可用性,同时还大大降低了后续挖掘过程中的计算强度。
作为示例地,可运用最大公共子序列算法实现大规模数据集中序列长度超过最大长度约束的事务与闭频繁项集之间的字符串匹配。
在步骤S104中,通过缩减数据集,构建大规模数据集的噪音FP-Tree,并将预设的隐私预算平均分配给噪音FP-Tree的每一层。
在本发明实施例中,在得到大规模数据集的缩减数据集后,可基于缩减数据集构建大规模数据集的噪音FP-Tree。在构建噪音FP-Tree时,计算噪音FP-Tree每个结点的计数时会破坏隐私,所以需要在噪音FP-Tree的每个结点上添加噪音
以初始化每个结点的计数,l
max表示最大长度约束,∈
1表示预设的隐私预算(此时得到的FP-Tree是满足∈
1-差分隐私的)。同时,可将隐私预算∈
1依据FP-Tree的深度(即最大长度约束),为FP-Tree的每一层平均分配给∈1/l
max大小的隐私预算,该隐私预算用来为每层添加拉普拉斯噪音
Δf为当前数据挖掘阶段的敏感度,此时FP-Tree的每个节点对应缩减数据集中的一个1-项集,所以当移除或增加FP-Tree中的一条路径时,对FP-Tree总体的影响很小,即敏感度很小。
在步骤S105中,根据预设的噪音阈值,从噪音FP-Tree上选出候选集合,通过预设的几何机制噪音增加候选集合的隐私保护,并从候选集合中选出预设数量个频繁项集。
在本发明实施例中,根据噪音FP-Tree可得到缩减数据集中每个频繁项集 带有噪声的计数,将该计数与预设的噪音阈值进行比较,由技术超过噪音阈值的频繁项集构成候选集合,再为候选集合中的每个频繁项集添加几何机制噪音
以进一步增加隐私安全保护,
表示候选集合,N表示大规模数据集的大小。最后,从候选集合中选出预设数量个频繁项集,完成大规模数据集的频繁项集挖掘,例如从候选集合选择前K个频繁项集。
在本发明实施例中,挖掘样本数据集的闭频繁项集,通过闭频繁项集和最大长度约束,生成大规模数据集对应的缩减数据集,通过缩减数据集构建噪声FP-Tree,在构建过程中将隐私预算平均分配给噪声FP-Tree,构建好后在噪声FP-Tree上选出候选集合,并通过几何机制噪音增加候选集合的隐私保护,再从候选集合中选出频繁项集,从而减小了大规模数据集频繁项集挖掘的计算强度,保证了数据挖掘的隐私性,同时降低了全局敏感度、提高了数据和挖掘结果的可用性。
实施例二:
图2示出了本发明实施例二提供的大规模数据集的频繁项集挖掘装置的结构,为了便于说明,仅示出了与本发明实施例相关的部分,其中包括:
样本容量估算单元21,用于接收用户输入的大规模数据集,并根据预设的精度阈值和预设的可信度阈值,估算大规模数据集对应的样本容量。
在本发明实施例中,在接收到用户输入的大规模数据集后,需对大规模数据集进行频繁项集挖掘,为了降低挖掘过程的计算强度,需对大规模数据进行预处理,在预处理过程中,可先根据预设的精度阈值和可信度阈值,估算大规模数据集对应的样本容量。
优选地,频繁项集在大规模数据集中的分布满足二项分布概率模型,所以通过正态分布表、精度阈值和可信度阈值,对大规模数据集对应的样本容量进行估算,其中,对样本容量进行估算的绝对误差不超过精度阈值、且对样本容量进行估算的可信度不小于可信度阈值,从而有效地提高了样本容量估算的准确度。具体地,样本容量的估算公式可表示为:
其中,p表示频繁项集在大规模数据集中的总体概率,n表示待估算的样本容量,f
n表示频繁项集在随机抽样中出现的次数,δ为精度阈值,a为可信度阈值。由公式
和正态分布表,可以推出
进而可以判断出样本容量n满足
即
Z
a是正态分布表中的正态分布临界值。
抽样挖掘单元22,用于对大规模数据集进行简单随机采样,生成样本容量大小的样本数据集,挖掘样本数据集中的闭频繁项集。
在本发明实施例中,在预处理过程中,估算得到样本容量后,可通过抽样工具对大规模数据集进行简单随机采样,由采集得到的样本构成样本容量大小的样本数据集,对于大规模数据集来说,处理符合精度要求的样本数据集,可以达到同样的处理需求,因此后续可通过对样本数据集进行挖掘,得到大规模数据集的多个闭频繁项集,从而有效地降低了计算强度。
数据集缩减单元23,用于根据样本数据集计算大规模数据集对应的最大长度约束,根据闭频繁项集和最大长度约束,生成大规模数据集对应的缩减数据集。
在本发明实施例中,可通过预设的启发式方法在样本数据集中所有事务序列长度的分布{z
1,...,z
i,...,z
n},z
i表示样本数据集中序列长度为i的事务数量,可从序列长度为1开始,递增地计算每个序列长度为i的事务数量并求和,直至满足
将满足
的最小的i值设置为最大长度约束。其中,η为预设的约束参数。
在本发明实施例中,在计算得到最大长度约束后,可根据闭频繁项集和最大长度约束,对大规模数据集进行处理,得到大规模数据集对应的缩减数据集,从而降低大规模数据集中频繁项集挖掘的全局敏感度,提高数据和挖掘结果的可用性。
FP-Tree构建单元24,用于通过缩减数据集,构建大规模数据集的噪音FP-Tree,并将预设的隐私预算平均分配给噪音FP-Tree的每一层。
在本发明实施例中,在得到大规模数据集的缩减数据集后,可基于缩减数据集构建大规模数据集的噪音FP-Tree。在构建噪音FP-Tree时,计算噪音FP-Tree每个结点的计数时会破坏隐私,所以需要在噪音FP-Tree的每个结点上添加噪音
以初始化每个结点的计数,l
max表示最大长度约束,∈
1表示预设的隐私预算(此时得到的FP-Tree是满足∈
1-差分隐私的)。同时,可将隐私预算∈
1依据FP-Tree的深度,为FP-Tree的每一层平均分配给∈1/l
max大小的隐私预算,该隐私预算用来为每层添加拉普拉斯噪音
Δf为当前数据挖掘阶段的敏感度,此时FP-Tree的每个节点对应缩减数据集中的一个1-项集,所以当移除或增加FP-Tree中的一条路径时,对FP-Tree总体的影响很小,即敏感度很小。
频繁项集筛选单元25,用于根据预设的噪音阈值,从噪音FP-Tree选出候选集合,通过预设的几何机制噪音增加候选集合的隐私保护,并从候选集合中选出预设数量个频繁项集。
在本发明实施例中,根据噪音FP-Tree可得到缩减数据集中每个频繁项集带有噪声的计数,将该计数与预设的噪音阈值进行比较,由技术超过噪音阈值的频繁项集构成候选集合,再为候选集合中的每个频繁项集添加几何机制噪音
以进一步增加隐私安全保护,
表示候选集合,N表示大规模数据集的大小。最后,从候选集合中选出预设数量个频繁项集,完成大规模数据集的频繁项集挖掘。
优选地,如图3所示,数据集缩减单元23包括:
长度分布估计单元331,用于通过估计样本数据集中所有事务序列长度的分布,获得样本数据集中不同序列长度下的事务数量;以及
长度约束计算单元332,用于根据样本数据集中不同序列长度下的事务数 量和预设的约束参数,计算最大长度约束。
优选地,数据集缩减单元23还包括1-项集排序单元333、闭频繁项集排序单元334、项剔除单元335和事务截断单元336,其中:
1-项集排序单元333,用于扫描大规模数据集中支持度大于预设支持度阈值的1-项集,将由所有1-项集构成的1-项集集合按照支持度下降的顺序进行排列。
闭频繁项集排序单元334,用于根据排序后的1-项集集合,对每个闭频繁项集进行排序,根据所有排序后的闭频繁项集,生成对应的元素集合;
在本发明实施例中,根据排序后的1-项集集合,对每个闭频繁项集中的项进行排序,再根据排序后的闭频繁项集,组合得到元素集合。
作为示例地,当按照支持度排序后的1-项集集合为{a,c,e,b,d,f}时,对闭频繁项集{c,b}、{f,d,e}、{a,b,c}进行分别排序得到{c,b}、{e,d,f}、{a,c,b},再将这些闭频繁项集进行组合得到元素集合{c,b,e,d,f,a}。
项剔除单元335,用于根据排序后的1-项集集合,对元素集合进行排序,根据排序后的元素集合,对大规模数据集中所有事务中多余的项进行剔除。
在本发明实施例中,有排序后的闭频繁项集组合得到元素集合,该元素集合中的项(或元素)之间的顺序可能并不符合支持度下降的顺序,因此需要再根据排序后的1-项集集合,对元素集合进行排序,再对大规模数据集中所有事务中未出现在元素集合中的项进行剔除,既可以减少大规模数据集中事务的数量,又可以减小大规模数据集中事务的序列长度。
事务截断单元336,用于将大规模数据集中序列长度超过最大长度约束的事务与闭频繁项集进行相似度匹配,并根据相似度匹配的匹配结果,对大规模数据集中序列长度超过最大长度约束的事务进行截断。
在本发明实施例中,在对大规模数据集中事务中多余的项进行剔除后,将大规模数据集中序列长度超过最大长度约束的事务与闭频繁项集进行字符串匹配,再根据两者最相似的字符串对该序列长度超过最大长度余数的事务进行截 断,可进行多次匹配和多次截断,直至事务的序列长度不超过最大长度约束,从而通过相似字符串减小大规模数据集中事务的序列长度,避免了利用最大长度约束对事务直接进行截断导致的信息丢失问题,进而降低全局敏感度、提高数据的可用性,同时还大大降低了后续挖掘过程中的计算强度。
在本发明实施例中,挖掘样本数据集中的闭频繁项集,通过闭频繁项集和最大长度约束,生成大规模数据集对应的缩减数据集,通过缩减数据集构建噪声FP-Tree,在构建过程中将隐私预算平均分配给噪声FP-Tree,构建好后在噪声FP-Tree上选出候选集合,并通过几何机制噪音增加候选集合的隐私保护,再从候选集合中选出频繁项集,从而减小了大规模数据集频繁项集挖掘的计算强度,保证了数据挖掘的隐私性,同时降低了全局敏感度、提高了数据和挖掘结果的可用性。
在本发明实施例中,大规模数据集的频繁项集挖掘装置的各单元可由相应的硬件或软件单元实现,各单元可以为独立的软、硬件单元,也可以集成为一个软、硬件单元,在此不用以限制本发明。
实施例三:
图4示出了本发明实施例三提供的计算设备的结构,为了便于说明,仅示出了与本发明实施例相关的部分。
本发明实施例的计算设备4包括处理器40、存储器41以及存储在存储器41中并可在处理器40上运行的计算机程序42。该处理器40执行计算机程序42时实现上述方法实施例中的步骤,例如图1所示的步骤S101至S105。或者,处理器40执行计算机程序42时实现上述装置实施例中各单元的功能,例如图2所示单元21至25的功能。
在本发明实施例中,挖掘样本数据集中的闭频繁项集,通过闭频繁项集和最大长度约束,生成大规模数据集对应的缩减数据集,通过缩减数据集构建噪声FP-Tree,在构建过程中将隐私预算平均分配给噪声FP-Tree,构建好后在噪声FP-Tree上选出候选集合,并通过几何机制噪音增加候选集合的隐私保护, 再从候选集合中选出频繁项集,从而减小了大规模数据集频繁项集挖掘的计算强度,保证了数据挖掘的隐私性,同时降低了全局敏感度、提高了数据和挖掘结果的可用性。
实施例四:
在本发明实施例中,提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行时实现上述方法实施例中的步骤,例如,图1所示的步骤S101至S105。或者,该计算机程序被处理器执行时实现上述装置实施例中各单元的功能,例如图2所示单元21至25的功能。
在本发明实施例中,挖掘样本数据集中的闭频繁项集,通过闭频繁项集和最大长度约束,生成大规模数据集对应的缩减数据集,通过缩减数据集构建噪声FP-Tree,在构建过程中将隐私预算平均分配给噪声FP-Tree,构建好后在噪声FP-Tree上选出候选集合,并通过几何机制噪音增加候选集合的隐私保护,再从候选集合中选出频繁项集,从而减小了大规模数据集频繁项集挖掘的计算强度,保证了数据挖掘的隐私性,同时降低了全局敏感度、提高了数据和挖掘结果的可用性。
本发明实施例的计算机可读存储介质可以包括能够携带计算机程序代码的任何实体或装置、记录介质,例如,ROM/RAM、磁盘、光盘、闪存等存储器。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。
Claims (10)
- 一种大规模数据集的频繁项集挖掘方法,其特征在于,所述方法包括下述步骤:接收用户输入的大规模数据集,并根据预设的精度阈值和预设的可信度阈值,估算所述大规模数据集对应的样本容量;对所述大规模数据集进行简单随机采样,生成所述样本容量大小的样本数据集,挖掘所述样本数据集中的闭频繁项集;根据所述样本数据集计算所述大规模数据集对应的最大长度约束,根据所述闭频繁项集和所述最大长度约束,生成所述大规模数据集对应的缩减数据集;通过所述缩减数据集,构建所述大规模数据集的噪音FP-Tree,并将预设的隐私预算平均分配给所述噪音FP-Tree的每一层;根据预设的噪音阈值,从所述噪音FP-Tree上选出候选集合,通过预设的几何机制噪音增加所述候选集合的隐私保护,并从所述候选集合中选出预设数量个频繁项集。
- 如权利要求1所述的方法,其特征在于,根据预设的精度阈值和预设的可信度阈值,估算所述大规模数据集对应的样本容量的步骤,包括:根据预设的正态分布表,估算所述大规模数据对应的所述样本容量,对所述样本容量进行估算的绝对误差不超过所述精度阈值、且对所述样本容量进行估算的可信度不小于所述可信度阈值。
- 如权利要求1所述的方法,其特征在于,根据所述样本数据集计算所述大规模数据集对应的最大长度约束的步骤,包括:通过估计所述样本数据集中所有事务序列长度的分布,获得所述样本数据集中不同序列长度下的事务数量;根据所述样本数据集中不同序列长度下的事务数量和预设的约束参数,计算所述最大长度约束。
- 如权利要求1所述的方法,其特征在于,根据所述闭频繁项集和所述最 大长度约束,生成所述大规模数据集对应的缩减数据集的步骤,包括:扫描所述大规模数据集中支持度大于预设支持度阈值的1-项集,将由所述所有1-项集构成的1-项集集合按照所述支持度下降的顺序进行排列;根据所述排序后的1-项集集合,对所述每个闭频繁项集进行排序,根据所述所有排序后的闭频繁项集,生成对应的元素集合;根据所述排序后的1-项集集合,对所述元素集合进行排序,根据所述排序后的元素集合,对所述大规模数据集中所有事务中多余的项进行剔除;将所述大规模数据集中序列长度超过所述最大长度约束的事务与所述闭频繁项集进行相似度匹配,并根据所述相似度匹配的匹配结果,对所述大规模数据集中序列长度超过所述最大长度约束的事务进行截断。
- 如权利要求1所述的方法,其特征在于,通过所述缩减数据集,构建所述大规模数据集的噪音FP-Tree的步骤,包括:根据所述隐私预算和所述最大长度约束,在所述噪音FP-Tree的每个结点上添加噪音,以初始化所述每个结点的计数;根据所述缩减数据集,迭代更新所述噪音FP-Tree上每个结点的计数。
- 一种大规模数据集的频繁项集挖掘装置,其特征在于,所述装置包括:样本容量估算单元,用于接收用户输入的大规模数据集,并根据预设的精度阈值和预设的可信度阈值,估算大规模数据集对应的样本容量;抽样挖掘单元,用于对所述大规模数据集进行简单随机采样,生成所述样本容量大小的样本数据集,挖掘所述样本数据集中的闭频繁项集;数据集缩减单元,用于根据所述样本数据集计算所述大规模数据集对应的最大长度约束,根据所述闭频繁项集和所述最大长度约束,生成所述大规模数据集对应的缩减数据集;FP-Tree构建单元,用于通过所述缩减数据集,构建所述大规模数据集的噪音FP-Tree,并将预设的隐私预算平均分配给所述噪音FP-Tree的每一层;以及频繁项集筛选单元,用于根据预设的噪音阈值,从所述噪音FP-Tree选出 候选集合,通过预设的几何机制噪音增加所述候选集合的隐私保护,并从所述候选集合中选出预设数量个频繁项集。
- 如权利要求6所述的装置,其特征在于,所述数据集缩减单元包括:长度分布估计单元,用于通过估计所述样本数据集中所有事务序列长度的分布,获得所述样本数据集中不同序列长度下的事务数量;以及长度约束计算单元,用于根据所述样本数据集中不同序列长度下的事务数量和预设的约束参数,计算所述最大长度约束。
- 如权利要求6所述的装置,其特征在于,所述数据集缩减单元还包括:1-项集排序单元,用于扫描所述大规模数据集中支持度大于预设支持度阈值的1-项集,将由所述所有1-项集构成的1-项集集合按照所述支持度下降的顺序进行排列;闭频繁项集排序单元,用于根据所述排序后的1-项集集合,对所述每个闭频繁项集进行排序,根据所述所有排序后的闭频繁项集,生成对应的元素集合;项剔除单元,用于根据所述排序后的1-项集集合,对所述元素集合进行排序,根据所述排序后的元素集合,对所述大规模数据集中所有事务中多余的项进行剔除;以及事务截断单元,用于将所述大规模数据集中序列长度超过所述最大长度约束的事务与所述闭频繁项集进行相似度匹配,并根据所述相似度匹配的匹配结果,对所述大规模数据集中序列长度超过所述最大长度约束的事务进行截断。
- 一种计算设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至5任一项所述方法的步骤。
- 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至5任一项所述方法的步骤。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201880000191.8A CN108475292B (zh) | 2018-03-20 | 2018-03-20 | 大规模数据集的频繁项集挖掘方法、装置、设备及介质 |
PCT/CN2018/079554 WO2019178733A1 (zh) | 2018-03-20 | 2018-03-20 | 大规模数据集的频繁项集挖掘方法、装置、设备及介质 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2018/079554 WO2019178733A1 (zh) | 2018-03-20 | 2018-03-20 | 大规模数据集的频繁项集挖掘方法、装置、设备及介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019178733A1 true WO2019178733A1 (zh) | 2019-09-26 |
Family
ID=63259918
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/079554 WO2019178733A1 (zh) | 2018-03-20 | 2018-03-20 | 大规模数据集的频繁项集挖掘方法、装置、设备及介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108475292B (zh) |
WO (1) | WO2019178733A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111222570A (zh) * | 2020-01-06 | 2020-06-02 | 广西师范大学 | 基于差分隐私的集成学习分类方法 |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110120873B (zh) * | 2019-05-08 | 2021-04-27 | 西安电子科技大学 | 基于云外包交易数据的频繁项集挖掘方法 |
CN114153319B (zh) * | 2021-12-07 | 2024-06-21 | 中国海洋大学 | 面向用户多数据场景的频繁字符串的挖掘方法 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102637208A (zh) * | 2012-03-28 | 2012-08-15 | 南京财经大学 | 一种基于模式挖掘的噪音数据过滤方法 |
CN105740245A (zh) * | 2014-12-08 | 2016-07-06 | 北京邮电大学 | 频繁项集挖掘方法 |
CN107092837A (zh) * | 2017-04-25 | 2017-08-25 | 华中科技大学 | 一种支持差分隐私的频繁项集挖掘方法和系统 |
CN107577771A (zh) * | 2017-09-07 | 2018-01-12 | 北京海融兴通信息安全技术有限公司 | 一种大数据挖掘系统 |
CN107609110A (zh) * | 2017-09-13 | 2018-01-19 | 深圳大学 | 基于分类树的最大多样频繁模式的挖掘方法及装置 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103761236B (zh) * | 2013-11-20 | 2017-02-08 | 同济大学 | 一种增量式频繁模式增长数据挖掘方法 |
IL236234A0 (en) * | 2014-12-14 | 2015-03-31 | Google Llc | Systems and methods for creating travel plans using location information |
CN106021546A (zh) * | 2016-05-27 | 2016-10-12 | 西华大学 | 基于项目子集事例树的极小非约简关联规则挖掘方法 |
-
2018
- 2018-03-20 CN CN201880000191.8A patent/CN108475292B/zh active Active
- 2018-03-20 WO PCT/CN2018/079554 patent/WO2019178733A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102637208A (zh) * | 2012-03-28 | 2012-08-15 | 南京财经大学 | 一种基于模式挖掘的噪音数据过滤方法 |
CN105740245A (zh) * | 2014-12-08 | 2016-07-06 | 北京邮电大学 | 频繁项集挖掘方法 |
CN107092837A (zh) * | 2017-04-25 | 2017-08-25 | 华中科技大学 | 一种支持差分隐私的频繁项集挖掘方法和系统 |
CN107577771A (zh) * | 2017-09-07 | 2018-01-12 | 北京海融兴通信息安全技术有限公司 | 一种大数据挖掘系统 |
CN107609110A (zh) * | 2017-09-13 | 2018-01-19 | 深圳大学 | 基于分类树的最大多样频繁模式的挖掘方法及装置 |
Non-Patent Citations (1)
Title |
---|
LEE, JAEWOO ET AL.: "Top-k frequent itemsets via differentially private FP- trees", PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 27 August 2014 (2014-08-27), XP058053796 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111222570A (zh) * | 2020-01-06 | 2020-06-02 | 广西师范大学 | 基于差分隐私的集成学习分类方法 |
CN111222570B (zh) * | 2020-01-06 | 2022-08-26 | 广西师范大学 | 基于差分隐私的集成学习分类方法 |
Also Published As
Publication number | Publication date |
---|---|
CN108475292B (zh) | 2021-08-24 |
CN108475292A (zh) | 2018-08-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7183388B2 (ja) | 個人情報の調査結果の信頼性レベルを特定するための機械学習システムおよび方法 | |
CN108595667B (zh) | 一种网络异常数据的关联性分析方法 | |
US10289848B2 (en) | Malicious software clustering method expressed based on TLSH feature | |
Ahmed et al. | Network sampling: From static to streaming graphs | |
US7866542B2 (en) | System and method for resolving identities that are indefinitely resolvable | |
US10311231B1 (en) | Preventing a malicious computer application from executing in a computing environment | |
CN102945240B (zh) | 一种支持分布式计算的关联规则挖掘算法实现方法及装置 | |
WO2019178733A1 (zh) | 大规模数据集的频繁项集挖掘方法、装置、设备及介质 | |
CN108897842A (zh) | 计算机可读存储介质及计算机系统 | |
JP6553816B2 (ja) | ユーザデータ共有方法及び装置 | |
CN115641009B (zh) | 基于专利异构信息网络挖掘竞争者的方法及装置 | |
US20210004727A1 (en) | Hyper-parameter tuning method for machine learning algorithms using pattern recognition and reduced search space approach | |
CN110689368A (zh) | 一种移动应用内广告点击率预测系统设计方法 | |
CN111160847A (zh) | 一种处理流程信息的方法和装置 | |
CN113705099A (zh) | 基于对比学习的社交平台谣言检测模型构建方法及检测方法 | |
Kumar et al. | Detection of recurring vulnerabilities in computing services | |
CN110399464B (zh) | 一种相似新闻判别方法、系统及电子设备 | |
WO2016106944A1 (zh) | MapReduce平台上的虚拟人建立方法 | |
CN109062915B (zh) | 一种文本数据集正负关联规则挖掘方法及装置 | |
CN117811801A (zh) | 一种模型训练方法、装置、设备及介质 | |
CN111125183B (zh) | 一种雾环境下基于CFI-Apriori算法的元组度量方法及系统 | |
Liu et al. | Lsdh: a hashing approach for large-scale link prediction in microblogs | |
JP2023171286A (ja) | グラフ機械学習のためのグラフの遠隔統計的生成 | |
Li et al. | Automatic classification algorithm for multisearch data association rules in wireless networks | |
CN115567572A (zh) | 确定对象异常度的方法、装置、设备以及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A 13.01.2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18911226 Country of ref document: EP Kind code of ref document: A1 |