CN105989161A

CN105989161A - Big data processing method and apparatus

Info

Publication number: CN105989161A
Application number: CN201510095692.7A
Authority: CN
Inventors: 欧阳军; 范伟; 何诚
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2015-03-04
Filing date: 2015-03-04
Publication date: 2016-10-05

Abstract

Embodiments of the present invention provide a method and device for processing big data. The method includes: receiving a query instruction sent by a client, and determining a query function K according to the query instruction; querying the large data set D according to the query function K to obtain a query result R, the query result R={R _j }, Wherein 1≤j≤m, m is a positive integer greater than or equal to 1; obtain the sensitivity S(K) of the query function K, the sensitivity S(K) characterizes the sensitivity of the query function K; according to the query result R and sensitivity S(K) determine the noise N that needs to be added to the query result R, noise N={N _j }, the noise component N _j of noise N corresponds to the query result component R _j one by one; according to the noise component N _j The query result component R _j is subjected to noise-adding processing to obtain a noisy query result R'={R' _j }. The embodiment of the present invention proposes a method capable of adding noise to big data, and can output data query results with differential privacy.

Description

A method and device for processing big data

技术领域technical field

本发明涉及数据处理领域，并且更具体地，涉及处理大数据的方法和装置。The present invention relates to the field of data processing, and more particularly, to a method and device for processing big data.

背景技术Background technique

隐私保护数据挖掘的目的是为了保护个人隐私数据，同时能够促进用户之间的数据共享。微分隐私是一个用来描述和分析数据发布方法的严格理论模型，其目的是提供有效的方法从统计数据库中最大化统计查询信息的准确性，同时最小化识别个体记录的机会。The purpose of privacy-preserving data mining is to protect personal privacy data while promoting data sharing among users. Differential privacy is a rigorous theoretical model used to describe and analyze data publishing methods, with the aim of providing efficient methods to maximize the accuracy of statistical query information from statistical databases, while minimizing the chance of identifying individual records.

目前可行的具有微分隐私的数据处理过程只能应用于小规模数据，但是对于大数据来说，其查询结果矢量的每一个分量都具有独立的坐标，而这每个独立的坐标是一个呈指数级规模分布的随机变量，因此尚无可以在大规模数据上可以实施微分隐私的有效办法。The current feasible data processing process with differential privacy can only be applied to small-scale data, but for large data, each component of the query result vector has independent coordinates, and each independent coordinate is an exponential Therefore, there is no effective way to implement differential privacy on large-scale data.

发明内容Contents of the invention

本发明实施例提供一种处理大数据的方法和装置，能够在大规模数据上实现具有微分隐私查询的目的。Embodiments of the present invention provide a method and device for processing big data, which can achieve the purpose of querying with differential privacy on large-scale data.

第一方面，本发明实施例提供了一种处理大数据的方法，包括：接收客户端发送的查询指令，并根据所述查询指令确定查询函数K；根据所述查询函数K对大数据集D进行查询得到查询结果R，所述查询结果R＝{R_j}，其中1≤j≤m，m是大于或等于1的正整数；获取所述查询函数K的敏感度S(K)，所述敏感度S(K)表征所述查询函数K的敏感性；根据所述查询结果R与所述敏感度S(K)确定需要给查询结果R加入的噪声N，所述噪声N＝{N_j}，所述噪声N的噪声分量N_j与查询结果分量R_j一一对应；根据所述噪声分量N_j对所述查询结果分量R_j进行加噪处理，得到加噪的查询结果R’＝{R’_j}。In the first aspect, the embodiment of the present invention provides a method for processing big data, including: receiving a query instruction sent by a client, and determining a query function K according to the query instruction; Query to obtain query result R, the query result R={R _j }, wherein 1≤j≤m, m is a positive integer greater than or equal to 1; obtain the sensitivity S(K) of the query function K, the The sensitivity S(K) characterizes the sensitivity of the query function K; the noise N that needs to be added to the query result R is determined according to the query result R and the sensitivity S(K), and the noise N={N _j }, the noise component N _j of the noise N is in one-to-one correspondence with the query result component R _j ; according to the noise component N _j , the query result component R _j is subjected to a noise-adding process to obtain a noisy query result R' = {R' _j }.

结合第一方面，在第一方面的第一种可能的实现方式中，所述获取所述查询函数K(x)的敏感度S(K)包括：获取数据集D1的查询结果K(D1)与数据集D2的查询结果K(D2)；将所述查询结果K(D1)与所述查询结果K(D2)在一个度量空间内差值的最小值作为所述敏感度S(K)的值，其中，所述数据集D1和所述数据集D2是所述大数据集D的两个不同子集，所述数据集D1和所述数据集D2之间至多相差一个记录数据。With reference to the first aspect, in the first possible implementation manner of the first aspect, the acquiring the sensitivity S(K) of the query function K(x) includes: acquiring the query result K(D1) of the data set D1 and the query result K (D2) of the data set D2; the minimum value of the difference between the query result K (D1) and the query result K (D2) in a metric space is used as the sensitivity S (K) Value, wherein, the data set D1 and the data set D2 are two different subsets of the large data set D, and the difference between the data set D1 and the data set D2 is at most one record data.

结合第一方面或第一方面的第一种可能的实现方式，在第一方面的第二种可能的实现方式中，所述根据所述查询结果R与所述敏感度S(K)确定噪声N包括：根据所述查询结果R生成满足拉普拉斯噪声分布的噪声N’，其中所述噪声N’中各个噪声分量相互独立；根据所述敏感度S(K)校正所述噪声N’后得到所述噪声N，其中所述噪声分量N_j满足所述敏感度S(K)的拉普拉斯噪声分布。With reference to the first aspect or the first possible implementation of the first aspect, in the second possible implementation of the first aspect, the determination of the noise according to the query result R and the sensitivity S(K) N includes: generating noise N' satisfying the Laplace noise distribution according to the query result R, wherein each noise component in the noise N' is independent of each other; correcting the noise N' according to the sensitivity S(K) Then the noise N is obtained, wherein the noise component N _j satisfies the Laplace noise distribution of the sensitivity S(K).

结合第一方面或第一方面的第一至第二种可能的实现方法，在第一方面的第三种可能的实现方式中，所述查询函数K为哈希函数F，所述方法包括：根据所述大数据集D的训练集，训练得到所述哈希函数F；其中，所述训练集为所述大数据集D的一个子集，所述训练集还包括属性集X和分类标签Y，所述属性集X是所述训练集中表征元素属性的数据的集合，所述分类标签Y是所述训练集中表征元素分类结果的数据的集合。In combination with the first aspect or the first to second possible implementation methods of the first aspect, in a third possible implementation manner of the first aspect, the query function K is a hash function F, and the method includes: According to the training set of the large data set D, the hash function F is obtained through training; wherein, the training set is a subset of the large data set D, and the training set also includes an attribute set X and a classification label Y, the attribute set X is a collection of data representing element attributes in the training set, and the classification label Y is a collection of data representing element classification results in the training set.

第二方面，本发明实施例提供了一种用于处理大数据的装置，包括：接收模块，所述接收模块用于接收客户端发送的查询指令，并根据所述查询指令确定查询函数K；第一确定模块，所述第一确定模块用于根据所述查询函数K对大数据集D进行查询得到查询结果R，所述查询结果R＝{R_j}，其中1≤j≤m，m是大于或等于1的正整数；获取模块，所述获取模块用于获取所述第一确定模块确定的所述查询函数K的敏感度S(K)，所述敏感度S(K)表征所述查询函数K的敏感性；第二确定模块，所述第二确定模块用于根据所述查询结果R和根据所述获取模块得到的所述敏感度S(K)确定需要给查询结果R加入的噪声N，所述噪声N＝{N_j}，所述噪声N的噪声分量N_j与查询结果分量R_j一一对应；加噪模块，所述加噪模块用于根据所述第二确定模块确定的噪声N_j对所述查询结果分量R_j进行加噪，得到加噪的查询结果R’＝{R’_j}。In a second aspect, an embodiment of the present invention provides a device for processing big data, including: a receiving module, configured to receive a query instruction sent by a client, and determine a query function K according to the query instruction; A first determination module, the first determination module is used to query the large data set D according to the query function K to obtain a query result R, the query result R={R _j }, where 1≤j≤m, m is a positive integer greater than or equal to 1; an acquisition module, the acquisition module is used to acquire the sensitivity S(K) of the query function K determined by the first determination module, and the sensitivity S(K) represents the The sensitivity of the query function K; the second determination module, the second determination module is used to determine the need to add query results R according to the query results R and the sensitivity S (K) obtained according to the acquisition module noise N, the noise N={N _j }, the noise component N _j of the noise N corresponds to the query result component R _j one-to-one; the noise adding module is used to determine according to the second The noise N _j determined by the module adds noise to the query result component R _j to obtain a noisy query result R'={R' _j }.

结合第二方面，在第二方面的第一种可能的实现方式中，所述获取模块具体用于：获取数据集D1的查询结果K(D1)与数据集D2的查询结果K(D2)；将所述查询结果K(D1)与所述查询结果K(D2)差值的范数最小值设置为所述敏感度S(K)的值，其中所述数据集D1和所述数据集D2是所述大数据集D的两个不同子集，所述数据集D1和所述数据集D2之间至多相差一个记录数据。With reference to the second aspect, in the first possible implementation of the second aspect, the acquisition module is specifically configured to: acquire the query result K(D1) of the dataset D1 and the query result K(D2) of the dataset D2; Setting the norm minimum value of the difference between the query result K(D1) and the query result K(D2) as the value of the sensitivity S(K), wherein the data set D1 and the data set D2 are two different subsets of the large data set D, and there is at most one record data difference between the data set D1 and the data set D2.

，其中，结合第二方面或第二方面的第一至可能的实现方式，在第二方面的第二种可能的实现方式中，所述第二确定模块具体用于：根据所述查询结果R生成满足拉普拉斯噪声分布的噪声N’，其中所述噪声N’中各个噪声分量相互独立；根据所述敏感度S(K)校正所述噪声N’后得到所述噪声N，其中所述噪声N的噪声分量N_j满足所述敏感度S(K)的拉普拉斯噪声分布。, wherein, in combination with the second aspect or the first to possible implementation manners of the second aspect, in the second possible implementation manner of the second aspect, the second determining module is specifically configured to: according to the query result R Generate noise N' that satisfies the Laplace noise distribution, wherein the noise components in the noise N' are independent of each other; the noise N' is obtained after correcting the noise N' according to the sensitivity S(K), where the The noise component _Nj of the noise N satisfies the Laplace noise distribution of the sensitivity S(K).

结合第二方面或第二方面的第一至第二种可能的实现方式，在第二方面的第三种可能的实现方式中，所述查询函数K为哈希函数F，所述第一确定模块还用于：根据所述大数据集D的训练集，训练得到所述哈希函数F；其中，所述训练集为所述大数据集D的一个子集，所述训练集包括属性集X和分类标签Y，所述属性集X是所述训练集中表征元素属性的数据的集合，所述分类标签Y是所述训练集中表征元素分类结果的数据的集合。In combination with the second aspect or the first to second possible implementations of the second aspect, in the third possible implementation of the second aspect, the query function K is a hash function F, and the first determination The module is also used for: training the hash function F according to the training set of the large data set D; wherein, the training set is a subset of the large data set D, and the training set includes an attribute set X and classification label Y, the attribute set X is a collection of data representing element attributes in the training set, and the classification label Y is a collection of data representing element classification results in the training set.

本发明实施例通过确定大数据集的查询函数的敏感度，基于该敏感度确定需要给查询结果加入的噪声并将该噪声加入查询结果，从而能够对原始大数据集的查询结果进行具有微分隐私的加噪处理，最终得到具有微分隐私的查询结果。因此，本发明实施能够对规模化的大数据进行加噪处理，最大可能的避免敏感数据的泄露，实现微分隐私查询的目的。In the embodiment of the present invention, by determining the sensitivity of the query function of the large data set, based on the sensitivity, the noise that needs to be added to the query result is determined and the noise is added to the query result, so that the query result of the original large data set can be differentially private. Noise processing, and finally get query results with differential privacy. Therefore, the implementation of the present invention can add noise to large-scale large data, avoid the leakage of sensitive data as much as possible, and realize the purpose of differential privacy query.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对本发明实施例中所需要使用的附图作简单地介绍，显而易见地，下面所描述的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following will briefly introduce the accompanying drawings required in the embodiments of the present invention. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For Those of ordinary skill in the art can also obtain other drawings based on these drawings without making creative efforts.

图1是可应用本发明实施例的系统场景实例的示意图。Fig. 1 is a schematic diagram of a system scenario example where an embodiment of the present invention can be applied.

图2是本发明实施例的一种处理大数据的方法的流程图。Fig. 2 is a flowchart of a method for processing big data according to an embodiment of the present invention.

图3是本发明另一实施例的一种处理大数据的方法的流程图。Fig. 3 is a flowchart of a method for processing big data according to another embodiment of the present invention.

图4是本发明另一实施例的一种处理大数据的方法的流程图。Fig. 4 is a flowchart of a method for processing big data according to another embodiment of the present invention.

图5是本发明实施例的一种处理大数据的装置的示意性框图。Fig. 5 is a schematic block diagram of an apparatus for processing big data according to an embodiment of the present invention.

图6是本发明另一实施例的一种处理大数据的装置的示意性框图。Fig. 6 is a schematic block diagram of an apparatus for processing big data according to another embodiment of the present invention.

具体实施方式detailed description

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明的一部分实施例，而不是全部实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都应属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts shall fall within the protection scope of the present invention.

在没有进行具有微分隐私的噪声处理时，客户端用户直接向具有原始敏感数据的数据库提出精确查询请求1，如图中虚线箭头1所示，数据库将精确查询结果2返回客户端，如图中虚线箭头2所示，这样就很容易将原始敏感数据的个体隐私数据在客户端泄露。When noise processing with differential privacy is not performed, the client user directly submits a precise query request 1 to the database with original sensitive data, as shown by the dotted arrow 1 in the figure, and the database returns the precise query result 2 to the client, as shown in the figure As shown by the dotted arrow 2, it is easy to leak the individual privacy data of the original sensitive data on the client.

通过加入查询机制，客户端用户可以通过查询机制向具有原始敏感数据的数据库提出统计查询请求3，如图中实线箭头3所示，数据库可以将统计查询结果4经过查询机制返回客户端，在返回客户端之前，该统计查询结果经过敏感度噪声处理，得到加噪后的统计查询结果5返回客户端，如图中实线箭头5所示，从而使得查询结果具有隐私保护而最大可能的避免将个体数据隐私泄露。By adding the query mechanism, the client user can submit a statistical query request 3 to the database with original sensitive data through the query mechanism, as shown by the solid line arrow 3 in the figure, the database can return the statistical query result 4 to the client through the query mechanism, in Before returning to the client, the statistical query result is subjected to sensitivity noise processing, and the statistical query result 5 after adding noise is returned to the client, as shown by the solid line arrow 5 in the figure, so that the query result has privacy protection and avoids as much as possible Disclosure of personal data privacy.

图2是本发明实施例的一种处理大数据的方法的流程图。如图2所示，该方法包括：Fig. 2 is a flowchart of a method for processing big data according to an embodiment of the present invention. As shown in Figure 2, the method includes:

步骤210，接收客户端发送的查询指令，并根据所述查询指令确定查询函数K。Step 210, receiving a query instruction sent by the client, and determining a query function K according to the query instruction.

步骤220，根据查询函数K对大数据集D进行查询得到查询结果R，所述查询结果R＝{R_j}，其中1≤j≤m，m是大于或等于1的正整数。Step 220, query the large data set D according to the query function K to obtain a query result R, the query result R={R _j }, where 1≤j≤m, m is a positive integer greater than or equal to 1.

步骤230，获取查询函数K的敏感度S(K)，该敏感度S(K)表征所述查询函数K的敏感性。Step 230, acquiring the sensitivity S(K) of the query function K, where the sensitivity S(K) characterizes the sensitivity of the query function K.

步骤240，根据查询结果R与敏感度S(K)确定需要给查询结果R加入的噪声N，噪声N＝{N_j}，噪声N的噪声分量N_j与查询结果分量R_j一一对应。Step 240, determine the noise N to be added to the query result R according to the query result R and the sensitivity S(K), noise N={N _j }, the noise component N _j of the noise N corresponds to the query result component R _j one-to-one.

步骤250，根据噪声分量N_j对查询结果分量R_j进行加噪处理，得到加噪的查询结果R’＝{R’_j}。Step 250: Perform noise-adding processing on the query result component R _j according to the noise component N _j to obtain a noisy query result R'={R' _j }.

本发明实施例通过确定大数据集的查询函数的敏感度，基于该敏感度确定噪声，能够对原始大数据集的查询结果进行具有微分隐私的加噪处理，最终得到具有微分隐私的查询结果。因此，本发明实施能够对规模化的大数据进行加噪处理，最大可能的避免敏感数据的泄露，实现微分隐私查询的目的。In the embodiment of the present invention, by determining the sensitivity of the query function of the large data set, noise is determined based on the sensitivity, and the query result of the original large data set can be noise-added with differential privacy, and finally the query result with differential privacy is obtained. Therefore, the implementation of the present invention can add noise to large-scale large data, avoid the leakage of sensitive data as much as possible, and realize the purpose of differential privacy query.

具体地，在步骤210中，接收客户端发送的查询指令后，通过获取客户端对相关用户数据集的据和信息查询要求，选取具体的统计查询函数K，例如该统计查询函数可以是求和(sum)或求平均值(average)等的函数，也可以是基于分类查询结果训练得到的哈希函数，其中，查询结果R是根据查询函数K在大数据集D通过统计查询得到的。Specifically, in step 210, after receiving the query instruction sent by the client, a specific statistical query function K is selected by obtaining the data and information query requirements of the relevant user data set from the client. For example, the statistical query function can be summation Functions such as (sum) or average (average) can also be hash functions trained based on classification query results, wherein the query result R is obtained through statistical query in the large data set D according to the query function K.

具体地，在步骤230中，获取查询函数K的敏感度S(K)，该敏感度S(K)表征所述查询函数K的敏感性(sensitive)，对于任意一个函数F的敏感度S(F)的定义为：满足条件||F(D₁)-F(D₂)||_M≤S(F)的最小值，其中数据集D1和D2最多相差一个记录数据，M表示一个度量空间，应理解数据集D1和D2相差一个记录数据的含义是在D1和D2数据元素数目相同的情况下，某一个元素的数值或数值类型不同。可选地，作为本发明一个实施例，在步骤220中，获取查询函数K的敏感度S(K)包括：计算数据集D1的查询结果K(D1)与数据集D2的查询结果K(D2)；将查询结果K(D1)与查询结果K(D2)在一个度量空间内差值的最小值作为所述敏感度S(K)的值，其中数据集D1和数据集D2之间至多相差一个记录数据，数据集D1和数据集D2是所述大数据集D的两个不同子集，应注意，这里查询结果K(D1)与查询结果K(D2)在一个度量空间内差值指的是查询结果K(D1)与查询结果K(D2)差值的绝对值。Specifically, in step 230, the sensitivity S(K) of the query function K is obtained, the sensitivity S(K) characterizes the sensitivity (sensitive) of the query function K, and the sensitivity S(K) for any function F( F) is defined as: the minimum value that satisfies the condition ||F(D ₁ )-F(D ₂ )|| _M ≤ S(F), where the difference between data sets D1 and D2 is at most one record data, and M represents a metric space , it should be understood that the difference of one record data between data sets D1 and D2 means that the value or value type of a certain element is different when the number of data elements in D1 and D2 is the same. Optionally, as an embodiment of the present invention, in step 220, obtaining the sensitivity S(K) of the query function K includes: calculating the query result K(D1) of the data set D1 and the query result K(D2) of the data set D2 ); the minimum value of the difference between the query result K (D1) and the query result K (D2) in a metric space is used as the value of the sensitivity S (K), wherein the difference between the data set D1 and the data set D2 is at most A record data, data set D1 and data set D2 are two different subsets of the large data set D, it should be noted that the difference between query result K(D1) and query result K(D2) in a metric space refers to is the absolute value of the difference between the query result K(D1) and the query result K(D2).

具体地，获取查询函数K的敏感度S(K)包括根据下式计算所述敏感度S(K)：S(K)＝min||K(D₁)‐K(D₂)||_M，其中，数据集D1和数据集D2最多相差一个记录数据，M表示一个度量空间。Specifically, obtaining the sensitivity S(K) of the query function K includes calculating the sensitivity S(K) according to the following formula: S(K)=min||K(D ₁ )-K(D ₂ )|| _M , where the difference between dataset D1 and dataset D2 is at most one record data, and M represents a metric space.

可选地，作为本发明一个实施例，所述根据所述查询结果R与所述敏感度S(K)确定噪声N包括：根据查询结果生成满足拉普拉斯噪声分布的噪声N’，其中噪声N’中各个噪声分量相互独立；根据敏感度S(K)校正噪声N’后得到噪声N，其中噪声分量N_j满足敏感度S(K)的拉普拉斯噪声分布。Optionally, as an embodiment of the present invention, the determining the noise N according to the query result R and the sensitivity S(K) includes: generating a noise N' satisfying a Laplacian noise distribution according to the query result, where Each noise component in the noise N' is independent of each other; the noise N is obtained after correcting the noise N' according to the sensitivity S(K), and the noise component N _j satisfies the Laplace noise distribution of the sensitivity S(K).

具体地，在步骤230中，根据拉普拉斯微分隐私定理，选择加噪机制，生成噪声N’＝[N’₁，…，N’_j，…，N’_m]，其中噪声N’的各个噪声分量相互独立，根据K的敏感度S(K)对该噪声进行校准，得到校准后噪声N＝[N₁，…，N_j，…，N_m]，其中校准后的噪声N中的各个噪声分量相互独立的。Specifically, in step 230, according to Laplace's differential privacy theorem, a noise adding mechanism is selected to generate noise N'=[N' ₁ ,...,N' _j ,...,N' _m ], where the noise N' Each noise component is independent of each other, and the noise is calibrated according to the sensitivity S(K) of K to obtain the calibrated noise N=[N ₁ ,...,N _j ,...,N _m ], where the calibrated noise N is The individual noise components are independent of each other.

具体地，根据噪声N_j对查询结果R_j进行加噪处理，得到具有微分隐私的查询结果R’_j是指将校准后噪声加入到统计查询结果，输出具有隐私保护的查询结果R’＝[R’₁，…，R’_j，…，R’_m]＝R＝[R₁，…，R_j，…，R_m]+[N₁，…，N_j，…，N_m]。Specifically, adding noise to the query result R _j according to the noise N _j to obtain the query result R' with differential privacy means adding the calibrated noise to the statistical query result and outputting the query result with privacy protection R' ₌ [ R' ₁ , ..., R' _j , ..., R' _m ] = R = [R ₁ , ..., R _j , ..., R _m ] + [N ₁ , ..., N _j , ..., N _m ].

可选地，作为本发明一个实施例，根据客户端对大数据集D的查询需求，确定查询结果R的查询函数K为哈希函数F，查询结果R的查询结果分量为R_j，其中1≤j≤m，m是大于等于1的正整数。Optionally, as an embodiment of the present invention, according to the query requirements of the client for the large data set D, the query function K of the query result R is determined to be a hash function F, and the query result component of the query result R is R _j , where 1 ≤j≤m, m is a positive integer greater than or equal to 1.

可选地，作为本发明一个实施例，根据客户端对大数据集D的查询需求，确定查询结果R的查询函数K为哈希函数F包括：根据大数据集D的训练集，训练得到哈希函数F，并生成哈希函数F的第一哈希分类表T；其中，训练集包括属性集X和分类标签Y，所述属性集X是所述训练集中表征元素属性的数据的集合，所述分类标签Y是所述训练集中表征元素分类结果的数据的集合。Optionally, as an embodiment of the present invention, according to the client's query requirements for the large data set D, determining the query function K of the query result R as a hash function F includes: according to the training set of the large data set D, training to obtain the hash function K Hide function F, and generate the first hash classification table T of hash function F; Wherein, the training set includes attribute set X and classification label Y, and described attribute set X is the collection of the data that characterizes element attribute in described training set, The classification label Y is a collection of data representing element classification results in the training set.

可选地，作为本发明一个实施例，根据噪声N_j对查询结果分量R_j进行加噪处理，得到具有微分隐私的查询结果R’包括：根据噪声N_j对哈希函数F(x)的第一哈希分类表T进行加噪处理，得到查询结果R’具有微分隐私的第二哈希分类表T’。Optionally, as an embodiment of the present invention, performing noise addition processing on the query result component R _j according to the noise N _j to obtain the query result R' with differential privacy includes: according to the noise N _j to the hash function F(x) Noise-adding processing is performed on the first hash classification table T to obtain a second hash classification table T' with differential privacy for the query result R'.

图3是本发明实施例的一种处理大数据的方法的流程图。如图3所示，该方法包括：Fig. 3 is a flowchart of a method for processing big data according to an embodiment of the present invention. As shown in Figure 3, the method includes:

步骤310，接收客户端发送的查询指令，并根据所述查询指令确定查询函数F，该查询函数F为哈希函数。Step 310, receiving a query instruction sent by the client, and determining a query function F according to the query instruction, where the query function F is a hash function.

步骤320，根据查询函数F对大数据集D进行查询得到查询结果R，所述查询结果R＝{R_j}，其中1≤j≤m，m是大于或等于1的正整数。Step 320, query the large data set D according to the query function F to obtain a query result R, the query result R={R _j }, where 1≤j≤m, m is a positive integer greater than or equal to 1.

步骤330，获取查询函数F的敏感度S(K)，该敏感度S(K)表征所述查询函数F的敏感性。Step 330, acquiring the sensitivity S(K) of the query function F, where the sensitivity S(K) characterizes the sensitivity of the query function F.

步骤340，根据查询结果R与敏感度S(K)确定需要给查询结果R加入的噪声N，噪声N＝{N_j}，噪声N的噪声分量N_j与查询结果分量R_j一一对应。Step 340, determine the noise N to be added to the query result R according to the query result R and the sensitivity S(K), noise N={N _j }, the noise component N _j of the noise N corresponds to the query result component R _j one-to-one.

步骤350，根据噪声分量N_j对查询结果分量R_j进行加噪处理，得到加噪的查询结果R’＝{R’_j}。Step 350: Perform noise-adding processing on the query result component R _j according to the noise component N _j to obtain a noisy query result R'={R' _j }.

本发明实施例通过确定大数据集的查询函数的敏感度，基于该敏感度确定需要给查询结果加入的噪声声并将该噪声加入查询结果，从而能够对原始大数据集的查询结果进行具有微分隐私的加噪处理，最终得到具有微分隐私的查询结果。因此，本发明实施能够对规模化的大数据进行加噪处理，最大可能的避免敏感数据的泄露，实现微分隐私查询的目的。In the embodiment of the present invention, by determining the sensitivity of the query function of the large data set, based on the sensitivity, the noise that needs to be added to the query result is determined and the noise is added to the query result, so that the query result of the original large data set can be differentiated Noise processing for privacy, and finally get query results with differential privacy. Therefore, the implementation of the present invention can add noise to large-scale large data, avoid the leakage of sensitive data as much as possible, and realize the purpose of differential privacy query.

应理解，在步骤110，上述大数据集R的查询需求是指，例如可以是对R的某一子集中的某个属性进行求和，获得的统计查询函数，其中1≤j≤m，m是大于等于1的正整数，可以根据对大数据集R查询的需求，通过构造微分隐私随机决策哈希(英文：Differentially Private Random Decision Hashing，简写：DPRDH)，训练构造哈希函数F。It should be understood that in step 110, the query requirement of the above-mentioned large data set R refers to, for example, a statistical query function obtained by summing a certain attribute in a certain subset of R, where 1≤j≤m, m is a positive integer greater than or equal to 1, and can be trained to construct a hash function F by constructing Differentially Private Random Decision Hashing (English: Differentially Private Random Decision Hashing, abbreviated: DPRDH) according to the query requirements of the large data set R.

可选地，作为本发明一个实施例，根据大数据集R获取大数据集R的哈希函数F包括：根据大数据集D的训练集，构造具有微分隐私的随机决策哈希，以训练得到哈希函数F，并生成哈希函数F的第一哈希分类表T；其中，训练集包括属性集X和分类标签Y，，所述属性集X是所述训练集中表征元素属性的数据的集合，所述分类标签Y是所述训练集中表征元素分类结果的数据的集合。应理解，根据对大数据集D查询的需求，能够训练得到获得m个哈希分类表，该m个哈希分类表都为第一哈希分类表T，应理解，第一哈希分类表T是通过大数据集D的训练集训练得到的包含至少一个哈希分类表的一类哈希分类表。Optionally, as an embodiment of the present invention, obtaining the hash function F of the large data set R according to the large data set R includes: constructing a random decision hash with differential privacy according to the training set of the large data set D, and obtaining Hash function F, and generate the first hash classification table T of hash function F; wherein, the training set includes an attribute set X and a classification label Y, and the attribute set X is the data representing the attribute of an element in the training set The classification label Y is a collection of data representing element classification results in the training set. It should be understood that according to the query requirements for the large data set D, m hash classification tables can be obtained through training, and the m hash classification tables are all the first hash classification table T. It should be understood that the first hash classification table T is a class of hash classification table that contains at least one hash classification table obtained through the training set of the large data set D.

具体地，构造具有微分隐私的随机决策哈希过程如下：输入训练集{属性集X，分类标签Y}，第一哈希分类表中包含m个初始哈希分类表和L个类，其中，属性集X可以为数值型(numerical)，类别型(categorical)和二进制型(binary)，而分类标签Y是根据属性集X分类后得到的标签符号，在分类标签Y下对应L个类，L是大于等于1的正整数；输出m个哈希分类表，这m个哈希分类表集合为T＝[T₁，…，T_j，…，T_m]，即得到第一哈希分类表T，对于其中任意一个子表T_j＝[bk_key1,bk_key2,……，bk_keyL]。Specifically, construct a random decision-making hashing process with differential privacy as follows: input the training set {attribute set X, classification label Y}, the first hash classification table contains m initial hash classification tables and L classes, where, The attribute set X can be numerical, categorical, and binary, and the classification label Y is the label symbol obtained after classification according to the attribute set X. Under the classification label Y, it corresponds to L classes, L is a positive integer greater than or equal to 1; output m hash classification tables, the set of m hash classification tables is T=[T ₁ ,...,T _j ,...,T _m ], that is, the first hash classification table is obtained T, for any one of the sub-tables T _j =[bk _key1 , bk _key2 , . . . , bk _keyL ].

具体地，根据上述输入和输出的参数，构造具有微分隐私的随机决策哈希如下流程：Specifically, according to the above input and output parameters, construct a random decision hash with differential privacy as follows:

1.随机生成m个蒙版矢量(maskvector)，其中任意一个蒙版矢量为maskvector_j；1. Randomly generate m mask vectors (maskvector), wherein any mask vector is maskvector _j ;

2.对输入的训练集中属性集X统一进行编码为二进制型，得到m个二进制型Xbinary，其中任意一个二进制型编码为Xbinary_j；2. The attribute set X in the input training set is uniformly encoded as a binary type, and m binary types Xbinary are obtained, wherein any binary type is encoded as Xbinary _j ;

3.构造具有微分隐私的随机哈希的训练过程如下：3. The training process for constructing a random hash with differential privacy is as follows:

For i＝1；i≤|X|；++i doFor i=1; i≤|X|; ++i do

For j＝1，j≤m；++j doFor j=1, j≤m; ++j do

计算键值，key＝maskvector_jAndXbinary_j；Calculate the key value, key=maskvector _j AndXbinary _j ;

分配键值到哈希分类表，bk_key＝T_j[key]；Assign the key value to the hash classification table, bk _key = T _j [key];

调整哈希分类表中的键值bk_key(Y)+＝1；Adjust the key value bk _key(Y) +=1 in the hash classification table;

Endend

应理解，对应于步骤3中的外层循环来说，是指对属性集X中每一个元素都进行一遍内层的循环，以将它们对应的键值分配到m个哈希分类表中；而对应于内层循环，则是按照蒙版矢量逻辑与二进制型编码的结果作为键值，将每个键值分配到任意一个哈希分类表T_j，循环m次以得到m个哈希分类表T＝[T₁，…，T_j，…，T_m]。It should be understood that, corresponding to the outer loop in step 3, it refers to performing an inner loop on each element in the attribute set X to assign their corresponding key values to m hash classification tables; Corresponding to the inner loop, according to the result of the mask vector logic and binary encoding as the key value, assign each key value to any hash classification table T _j , and loop m times to obtain m hash classifications Table T=[T ₁ , . . . , T _j , . . . , T _m ].

4.经过步骤3，可以得到m个哈希分类表，T＝[T₁，…，T_j，…，T_m]，其中任意一个子表T_j＝[bk_key1，bk_key2，……，bk_keyL]对应于表1，该表的每一个列矢量Y＝[Y_1bi,…,Y_ibi,…,Y_Lbn]称作一个蒙版矢量。4. After step 3, m hash classification tables can be obtained, T = [T ₁ , ..., T _j , ..., T _m ], any sub-table T _j = [bk _key1 , bk _key2 , ..., bk _keyL ] corresponds to Table 1, and each column vector Y=[Y _1bi ,...,Y _ibi ,...,Y _Lbn ] of this table is called a mask vector.

表1Table 1

分类标签(L类)Classification label (L class) 桶1bucket 1 桶2Barrel 2 ……... 桶ibucket i ……... 桶nbarrel n Y₁ Y ₁ Y_1b1 Y _1b1 Y_1b2 Y _1b2 Y_1bi Y _1bi Y_1bn Y _1bn Y₂ Y ₂ Y_2b1 Y _2b1 Y_2b2 Y _2b2 Y_2bi Y _2bi Y_2bn Y _2bn …… Y_i Y _i Y_ib1 Y _ib1 Y_ib2 Y _ib2 Y_ibi _Yibi Y_ibn Y _ibn …… Y_L Y _L Y_Lb1 Y _Lb1 Y_Lb2 Y _Lb2 Y_Lbi Y _L Y_Lbn Y _Lbn

可选地，作为本发明一个实施例，计算查询函数F(x)的敏感度S(F)包括：计算数据集D1的查询结果K(D1)与数据集D2的查询结果K(D2)；将查询结果K(D1)与查询结果K(D2)在一个度量空间内差值的最小值作为所述敏感度S(K)的值，其中数据集D1和数据集D2之间至多相差一个记录数据,数据集D1和数据集D2是所述大数据集D的两个不同子集。Optionally, as an embodiment of the present invention, calculating the sensitivity S(F) of the query function F(x) includes: calculating the query result K(D1) of the data set D1 and the query result K(D2) of the data set D2; The minimum value of the difference between the query result K(D1) and the query result K(D2) in a metric space is used as the value of the sensitivity S(K), where there is at most one record difference between the data set D1 and the data set D2 Data, dataset D1 and dataset D2 are two different subsets of the large dataset D.

可选地，作为本发明一个实施例，上述哈希函数F(x)的敏感度S(F)由下式计算得到：S(F)＝min||F(D₁)‐F(D₂)||_M，其中，数据集D1和D2最多相差一个记录数据，M表示一个度量空间。Optionally, as an embodiment of the present invention, the sensitivity S(F) of the above hash function F(x) is calculated by the following formula: S(F)=min||F(D ₁ )-F(D ₂ )|| _M , where the difference between data sets D1 and D2 is at most one record data, and M represents a metric space.

可选地，所述根据查询结果R与敏感度S(K)确定噪声N包括：根据查询结果生成满足拉普拉斯噪声分布的噪声N’，其中噪声N’中各个噪声分量相互独立；根据敏感度S(K)校正噪声N’后得到噪声N，其中噪声分量N_j满足敏感度S(K)的拉普拉斯噪声分布，即噪声分量N_j满足Lap(S(F)/ε)，以使得加入噪声后的查询结果R’具有ε-微分隐私。Optionally, the determining the noise N according to the query result R and the sensitivity S(K) includes: generating a noise N' satisfying the Laplace noise distribution according to the query result, wherein each noise component in the noise N' is independent of each other; according to After the sensitivity S(K) corrects the noise N', the noise N is obtained, and the noise component N _j satisfies the Laplace noise distribution of the sensitivity S(K), that is, the noise component N _j satisfies Lap(S(F)/ε) , so that the query result R' after adding noise has ε-differential privacy.

可选地，作为本发明一个实施例，根据客户端对大数据集D的查询需求，确定查询结果R的查询函数K(x)为哈希函数F(x)包括：根据大数据集D的训练集，训练得到所述哈希函数F(x)，并生成哈希函数F(x)的第一哈希分类表T；其中，训练集包括大数据集D的属性集X和分类标签Y。Optionally, as an embodiment of the present invention, according to the client's query requirements for the large data set D, determining the query function K(x) of the query result R as a hash function F(x) includes: according to the large data set D The training set is trained to obtain the hash function F(x), and generates the first hash classification table T of the hash function F(x); wherein, the training set includes the attribute set X and the classification label Y of the large data set D .

可选地，作为本发明一个实施例，根据噪声N_j对所述查询结果分量R_j进行加噪处理，得到具有微分隐私的查询结果R’包括：Optionally, as an embodiment of the present invention, performing noise addition processing on the query result component _Rj according to the noise _Nj , and obtaining the query result R' with differential privacy includes:

根据噪声N_j对哈希函数F(x)的第一哈希分类表T进行加噪处理，得到与具有微分隐私的查询结果R’对应的第二哈希分类表T’。Noise processing is performed on the first hash classification table T of the hash function F(x) according to the noise N _j to obtain a second hash classification table T' corresponding to the query result R' with differential privacy.

可选地，作为本发明一个实施例，通过构造微分隐私随机决策哈希分类器(英文：Differentially Private Random Decision Hashing Classifier，简写：DPRDHC)可以预测输出具有微分隐私的查询结果R’。Optionally, as an embodiment of the present invention, by constructing a Differentially Private Random Decision Hashing Classifier (English: Differentially Private Random Decision Hashing Classifier, abbreviated: DPRDHC), the query result R' with differential privacy can be predicted and output.

具体地，构造微分隐私随机哈希分类器，以预测输出查询结果R’的如下流程：Specifically, construct a differentially private random hash classifier to predict the output query result R' as follows:

1.输入m个第二哈希分类表集合，T’＝[T’₁，…T’_j，…T’_m]以及被分类的标识列X’；1. Input m second sets of hash classification tables, T'=[T' ₁ ,...T' _j ,...T' _m ] and the classified identification column X';

2.初始化分类标签矢量(label vectors)，分类统计(label count和labelaverage)；2. Initialize the classification label vector (label vectors), classification statistics (label count and label average);

3.编码被分类的列X’；3. Encode the classified column X';

4.构造具有微分隐私的随机哈希分类器的预测过程如下：4. The prediction process of constructing a random hash classifier with differential privacy is as follows:

For j＝1；j≤m；++j doFor j=1; j≤m; ++j do

调整哈希分类表中的键值label count+＝bk_key；Adjust the key value label count+=bk _key in the hash classification table;

Endend

5.计算m个第二类哈希分类表中的分类标签的算数平均值，labelavg＝label count/m；5. Calculate the arithmetic mean of the classification labels in m second-class hash classification tables, labelavg=label count/m;

6.在m个label中取最大值作为分类标签值，Y’＝argmax(label avg)；6. Take the maximum value among the m labels as the classification label value, Y'=argmax(label avg);

7.输出分类标签值Y’。7. Output the classification label value Y'.

下面结合具体步骤，更详细的描述本发明实施例。The following describes the embodiment of the present invention in more detail in combination with specific steps.

图4是本发明另一实施例的一种处理大数据的方法的流程图。如图4所示，该方法400以下步骤：Fig. 4 is a flowchart of a method for processing big data according to another embodiment of the present invention. As shown in Figure 4, the method 400 includes the following steps:

步骤401，获得统计查询函数F。In step 401, a statistical query function F is obtained.

步骤402，生成相互独立的噪声N’＝[N’₁，…，N’_j，…，N’_m]。Step 402, generating mutually independent noises N'=[N' ₁ , ..., N' _j , ..., N' _m ].

步骤403，计算噪声N’的标准偏差D＝[D₁，…，D_j，…，D_m]。Step 403, calculating the standard deviation D=[D ₁ , . . . , D _j , . . . , D _m ] of the noise N'.

步骤404，计算统计查询函数的敏感度S(F)。Step 404, calculating the sensitivity S(F) of the statistical query function.

步骤405，通过校准噪声N’的标准偏差D，得到校准后的噪声N＝[N₁，…，N_j，…，N_m]。Step 405, by calibrating the standard deviation D of the noise N', the calibrated noise N=[N ₁ , . . . , N _j , . . . , N _m ] is obtained.

步骤406，获得统计查询结果R＝[R₁，…，R_j，…，R_m]。Step 406, obtain statistical query results R=[R ₁ , . . . , R _j , . . . , R _m ].

步骤407，校准后的噪声N加入到统计查询结果，输出隐私保护的查询结果R’＝[R’₁，…，R’_j，…，R’_m]＝R＝[R₁，…，R_j，…，R_m]+[N₁，…，N_j，…，N_m]。Step 407, the calibrated noise N is added to the statistical query results, and the privacy-protected query results are output R'=[R' ₁ ,...,R' _j ,...,R' _m ]=R=[R ₁ ,...,R _j , . . . , R _m ]+[N ₁ , . . . , N _j , . . . , N _m ].

可选地，在步骤401中，根据客户端相关用户数据集的聚合信息查询要求，选取具体的统计查询函数F，例如求和或求平均值函数等，也可以是基于分类查询结果训练得到的哈希函数。Optionally, in step 401, according to the aggregation information query requirements of the relevant user data sets of the client, select a specific statistical query function F, such as a summation or averaging function, etc., which can also be obtained by training based on classification query results hash function.

可选地，在步骤402中，根据统计查询函数F的查询结果，可以选的合适的噪声机制以生成相互独立的噪声N’＝[N’₁，…，N’_j，…，N’_m]，该噪声N’中的每一个分量都是相互独立的，例如N’可以为满足拉普拉斯噪声分布，那么其中N’的每一个分量都是相互独立且满足拉普拉斯噪声分布的。应理解，选取合适的噪声机制是指根据拉普拉斯微分隐私定理，选择加噪机制。Optionally, in step 402, according to the query result of the statistical query function F, an appropriate noise mechanism can be selected to generate mutually independent noise N'=[N' ₁ ,...,N' _j ,...,N' _m ], each component of the noise N' is independent of each other, for example, N' can satisfy the Laplace noise distribution, then each component of N' is mutually independent and satisfies the Laplace noise distribution of. It should be understood that selecting an appropriate noise mechanism refers to selecting a noise addition mechanism according to Laplace's differential privacy theorem.

可选地，在步骤403中，分别计算噪声N’中每个独立分量的标准偏差得到标准偏差D＝[D₁，…，D_j，…，D_m]。Optionally, in step 403, the standard deviation of each independent component in the noise N' is calculated separately to obtain the standard deviation D=[D ₁ , . . . , D _j , . . . , D _m ].

可选地，在步骤404中，计算数据集D1的查询结果F(D1)与数据集D2的查询结果F(D2)；取查询结果F(D1)与查询结果F(D2)在一个度量空间内差值的最小值作为所述敏感度S(K)的值，其中所述数据集D1和所述数据集D2之间至多相差一个记录数据。具体地，计算统计查询函数F的敏感度S(F)包括根据下式计算所述敏感度S(F)：S(F)＝min||F(D₁)‐F(D₂)||_M，其中，数据集D1和数据集D2最多相差一个记录数据，M表示一个度量空间,数据集D1和数据集D2是大数据集D的两个不同子集。Optionally, in step 404, calculate the query result F(D1) of the data set D1 and the query result F(D2) of the data set D2; take the query result F(D1) and the query result F(D2) in a metric space The minimum value of the inner difference is used as the value of the sensitivity S(K), wherein the difference between the data set D1 and the data set D2 is at most one record data. Specifically, calculating the sensitivity S(F) of the statistical query function F includes calculating the sensitivity S(F) according to the following formula: S(F)=min||F(D ₁ )-F(D ₂ )|| _M , where the data set D1 and the data set D2 differ by at most one record data, M represents a metric space, and the data set D1 and the data set D2 are two different subsets of the large data set D.

可选地，在步骤405中，通过校准噪声N’的标准偏差D，得到校准后的噪声N＝[N₁，…，N_j，…，N_m]，使得校准后的噪声N中的每一个分量N_j满足Lap(S(F)/ε)，以便于输出具有ε-微分隐私的查询结果，其中ε值域在[0,1]之间，该ε可以由用户指定。Optionally, in step 405, the calibrated noise N=[N ₁ ,...,N _j ,...,N _m ] is obtained by calibrating the standard deviation D of the noise N', so that each of the calibrated noise N A component N _j satisfies Lap(S(F)/ε), so as to output query results with ε-differential privacy, where ε ranges between [0,1], and ε can be specified by the user.

可选地，在步骤406中，根据统计查询函数F获得统计查询结果R＝[R₁，…，R_j，…，R_m]，应理解，该步骤也可以在生成噪声N’之前得到，本发明不限于此。Optionally, in step 406, the statistical query result R=[R ₁ ,...,R _j ,...,R _m ] is obtained according to the statistical query function F. It should be understood that this step can also be obtained before generating the noise N', The present invention is not limited thereto.

可选地，在步骤407中，将校准后的噪声N加入到统计查询结果，输出隐私保护的查询结果R’＝[R’₁，…，R’_j，…，R’_m]＝R＝[R₁，…，R_j，…，R_m]+[N₁，…，N_j，…，N_m]，由于噪声N中的每一项分量是根据统计函数敏感度S(F)校准后得到的且满足Lap(S(F)/ε)分布，因此，输出的查询结果R’具有ε-微分隐私。Optionally, in step 407, the calibrated noise N is added to the statistical query result, and the privacy-protected query result R'=[R' ₁ ,...,R' _j ,...,R' _m ]=R= [R ₁ ,...,R _j ,...,R _m ]+[N ₁ ,...,N _j ,...,N _m ], since each component in the noise N is calibrated according to the statistical function sensitivity S(F) The latter is obtained and satisfies the Lap(S(F)/ε) distribution, therefore, the output query result R' has ε-differential privacy.

图1至图4是从方法角度详细描述了处理大数据的具体过程，下面结合图5至图6从详细描述用于处理大数据的装置。Figures 1 to 4 describe in detail the specific process of processing big data from the perspective of methods, and the following describes the device for processing big data in detail in conjunction with Figures 5 to 6 .

图5是本发明实施例的一种处理大数据的装置的示意性框图。如图5所示，装置500包括：接收模块510、第一确定模块520、计算模块530、第二确定模块540和加噪模块550。Fig. 5 is a schematic block diagram of an apparatus for processing big data according to an embodiment of the present invention. As shown in FIG. 5 , the apparatus 500 includes: a receiving module 510 , a first determining module 520 , a calculating module 530 , a second determining module 540 and a noise adding module 550 .

接收模块510，用于接收客户端发送的查询指令，并根据所述查询指令确定查询函数K。The receiving module 510 is configured to receive a query instruction sent by the client, and determine a query function K according to the query instruction.

第一确定模块520，第一确定模块用于根据查询函数K对大数据集D进行查询得到查询结果R，查询结果R＝{R_j}，其中1≤j≤m，m是大于或等于1的正整数。The first determination module 520, the first determination module is used to query the large data set D according to the query function K to obtain the query result R, the query result R={R _j }, where 1≤j≤m, m is greater than or equal to 1 positive integer of .

获取模块530，获取模块用于获取第一确定模块确定的查询函数K的敏感度S(K)，该敏感度S(K)表征所述查询函数K的敏感性。An acquiring module 530, configured to acquire the sensitivity S(K) of the query function K determined by the first determining module, where the sensitivity S(K) characterizes the sensitivity of the query function K.

第二确定模块540，第二确定模块用于根据查询结果R和根据获取模块得到的敏感度S(K)确定需要给查询结果R加入的噪声N，噪声N＝{N_j}，噪声N的噪声分量N_j与查询结果分量R_j一一对应。The second determination module 540, the second determination module is used to determine the noise N that needs to be added to the query result R according to the query result R and the sensitivity S(K) obtained by the acquisition module, noise N={N _j }, noise N The noise component N _j is in one-to-one correspondence with the query result component R _j .

加噪模块550，加噪模块用于根据第二确定模块确定的噪声N_j对查询结果分量R_j进行加噪，得到加噪的查询结果R’＝{R’_j}。A noise adding module 550, configured to add noise to the query result component R _j according to the noise N _j determined by the second determining module, to obtain a noisy query result R'={R' _j }.

具体地，接收模块510通过获取客户端对相关用户数据集的据和信息查询要求，选取具体的统计查询函数K，例如该统计查询函数可以是求和(sum)或求平均值(average)等的函数，也可以是基于分类查询结果训练得到的哈希函数，其中，查询结果R是根据查询函数K在大数据集D通过统计查询得到的。Specifically, the receiving module 510 selects a specific statistical query function K by acquiring the data and information query requirements of the relevant user data sets from the client. For example, the statistical query function may be sum or average, etc. can also be a hash function trained based on classification query results, wherein the query result R is obtained through statistical query in the large data set D according to the query function K.

可选地，作为本发明一个实施例，获取模块530具体用于：计算数据集D1的查询结果K(D1)与数据集D2的查询结果K(D2)；将查询结果K(D1)与查询结果K(D2)的差值的最小值设置为敏感度S(K)的值，其中数据集D1和所述数据集D2之间至多相差一个记录数据，数据集D1和数据集D2是大数据集D的两个不同子集，应理解，数据集D1和D2相差一个记录数据的含义是在D1和D2数据元素数目相同的情况下，某一个元素的数值或数值类型不同。同时应注意，这里查询结果K(D1)与查询结果K(D2)的差值指的是两者之间差值的绝对值。Optionally, as an embodiment of the present invention, the acquisition module 530 is specifically configured to: calculate the query result K(D1) of the data set D1 and the query result K(D2) of the data set D2; combine the query result K(D1) with the query The minimum value of the difference of the result K(D2) is set as the value of the sensitivity S(K), wherein there is at most one record data difference between the data set D1 and the data set D2, and the data set D1 and the data set D2 are large data Two different subsets of set D, it should be understood that the difference of one record data between data sets D1 and D2 means that when D1 and D2 have the same number of data elements, the value or value type of a certain element is different. At the same time, it should be noted that the difference between the query result K(D1) and the query result K(D2) refers to the absolute value of the difference between the two.

具体地，获取模块520还用于根据下式计算敏感度S(K)：S(K)＝min||K(D₁)‐K(D₂)||_M，其中，数据集D1和D2最多相差一个记录数据，数据集D1和数据集D2是所述大数据集D的两个不同子集，M表示一个度量空间。Specifically, the acquisition module 520 is also used to calculate the sensitivity S(K) according to the following formula: S(K)=min||K(D ₁ )-K(D ₂ )|| _M , where the data sets D1 and D2 There is at most one record data difference, the data set D1 and the data set D2 are two different subsets of the large data set D, and M represents a metric space.

可选地，作为本发明一个实施例，第二确定模块530具体用于：根据查询结果生成满足拉普拉斯噪声分布的噪声N’，其中所述噪声N’中各个噪声分量相互独立；根据敏感度S(K)校正噪声N’后得到噪声N，其中噪声分量N_j满足敏感度S(K)的拉普拉斯噪声分布。Optionally, as an embodiment of the present invention, the second determination module 530 is specifically configured to: generate noise N' satisfying the Laplace noise distribution according to the query result, wherein each noise component in the noise N' is independent of each other; according to The noise N is obtained after the sensitivity S(K) corrects the noise N', and the noise component N _j satisfies the Laplace noise distribution of the sensitivity S(K).

可选地，作为本发明一个实施例，第一确定模块510还用于：根据客户端对大数据集D的查询需求，确定查询结果R的查询函数K(x)为哈希函数F(x)，查询结果R的查询结果分量为R_j，其中1≤j≤m，m是大于等于1的正整数。Optionally, as an embodiment of the present invention, the first determination module 510 is further configured to: determine the query function K(x) of the query result R as the hash function F(x ), the query result component of the query result R is R _j , where 1≤j≤m, m is a positive integer greater than or equal to 1.

可选地，作为本发明一个实施例，第一确定模块510还用于：根据大数据集D的训练集，训练得到哈希函数F(x)，并生成哈希函数F(x)的第一哈希分类表T；其中，训练集为大数据集D的一个子集，该训练集包括属性集X和分类标签Y，所述属性集X是所述训练集中表征元素属性的数据的集合，所述分类标签Y是所述训练集中表征元素分类结果的数据的集合。Optionally, as an embodiment of the present invention, the first determination module 510 is also configured to: obtain the hash function F(x) through training according to the training set of the large data set D, and generate the first hash function F(x) A hash classification table T; wherein, the training set is a subset of the large data set D, the training set includes an attribute set X and a classification label Y, and the attribute set X is a collection of data representing element attributes in the training set , the classification label Y is a collection of data representing element classification results in the training set.

可选地，作为本发明一个实施例，第一确定模块510还用于根据所述噪声N_j对哈希函数F(x)的第一哈希分类表T进行加噪处理，得到查询结果R’_j具有微分隐私的第二哈希分类表T’。Optionally, as an embodiment of the present invention, the first determination module 510 is further configured to perform noise addition processing on the first hash classification table T of the hash function F(x) according to the noise N _j to obtain the query result R ' _j a second hash sorting table T' with differential privacy.

图6是本发明另一实施例的一种处理大数据的装置的示意性框图。应注意，图6所示的设备与图2至图4实施例对应，能够实现图1至图4实施例的处理大数据的方法的各个过程，为避免重复适当省略详细描述。如图6所示的一种处理大数据的装置包括：处理器610、存储器620和总线630。其中，处理器610和存储器620通过总线630相连，该存储器620用于存储指令，该处理器610用于执行该存储器620存储的指令。具体地，处理器610用于：接收客户端发送的查询指令，并根据该查询指令确定查询函数K；根据查询函数K对大数据集D进行查询得到查询结果R，查询结果R＝{R_j}，其中1≤j≤m，m是大于或等于1的正整数；获取查询函数K的敏感度S(K)，该敏感度S(K)表征查询函数K的敏感性；根据查询结果R与敏感度S(K)确定需要给查询结果R加入的噪声N，噪声N＝{N_j}，噪声N的噪声分量N_j与查询结果分量R_j一一对应；根据噪声分量N_j对查询结果分量R_j进行加噪处理，得到加噪的查询结果R’＝{R’_j}。Fig. 6 is a schematic block diagram of an apparatus for processing big data according to another embodiment of the present invention. It should be noted that the device shown in FIG. 6 corresponds to the embodiments in FIGS. 2 to 4, and can implement various processes of the method for processing big data in the embodiments in FIGS. 1 to 4, and detailed descriptions are appropriately omitted to avoid repetition. An apparatus for processing big data as shown in FIG. 6 includes: a processor 610 , a memory 620 and a bus 630 . Wherein, the processor 610 and the memory 620 are connected through a bus 630 , the memory 620 is used for storing instructions, and the processor 610 is used for executing the instructions stored in the memory 620 . Specifically, the processor 610 is configured to: receive the query instruction sent by the client, and determine the query function K according to the query instruction; query the large data set D according to the query function K to obtain the query result R, and the query result R={R _j }, where 1≤j≤m, m is a positive integer greater than or equal to 1; obtain the sensitivity S(K) of the query function K, the sensitivity S(K) represents the sensitivity of the query function K; according to the query result R Determine the noise N that needs to be added to the query result R with the sensitivity S(K), noise N={N _j }, the noise component N _j of the noise N is in one-to-one correspondence with the query result component R _j ; according to the noise component N _j , the query The result component R _j is subjected to noise-adding processing to obtain a noisy query result R'={R' _j }.

可选地，作为本发明一个实施例，处理器610用于获取数据集D1的查询结果K(D1)与数据集D2的查询结果K(D2)；将查询结果K(D1)与查询结果K(D2)在一个度量空间内差值的最小值作为敏感度S(K)的值，其中数据集D1和数据集D2之间至多相差一个记录数据，数据集D1和数据集D2是大数据集D的两个不同子集。Optionally, as an embodiment of the present invention, the processor 610 is configured to acquire the query result K(D1) of the data set D1 and the query result K(D2) of the data set D2; combine the query result K(D1) and the query result K (D2) The minimum value of the difference in a metric space is taken as the value of the sensitivity S(K), where there is at most one record data difference between the data set D1 and the data set D2, and the data set D1 and the data set D2 are large data sets Two different subsets of D.

具体地，处理器610用于根据下式获取敏感度S(K)：S(K)＝min||K(D₁)‐K(D₂)||_M，其中，数据集D1和数据集D2最多相差一个记录数据，数据集D1和数据集D2是大数据集D的两个不同子集，M表示一个度量空间。Specifically, the processor 610 is used to obtain the sensitivity S(K) according to the following formula: S(K)=min||K(D ₁ )-K(D ₂ )|| _M , wherein, the data set D1 and the data set D2 differs by at most one record data, dataset D1 and dataset D2 are two different subsets of large dataset D, and M represents a metric space.

可选地，作为本发明一个实施例，处理器610用于根据查询结果生成满足拉普拉斯噪声分布的噪声N’，其中噪声N’中各个噪声分量相互独立；根据敏感度S(K)校正噪声N’后得到噪声N，其中噪声分量N_j满足敏感度S(K)的拉普拉斯噪声分布。Optionally, as an embodiment of the present invention, the processor 610 is configured to generate noise N' satisfying the Laplacian noise distribution according to the query result, wherein each noise component in the noise N' is independent of each other; according to the sensitivity S(K) After correcting the noise N', the noise N is obtained, and the noise component N _j satisfies the Laplace noise distribution of the sensitivity S(K).

本领域普通技术人员可以意识到，结合本文中所公开的实施例中描述的各方法步骤和单元，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各实施例的步骤及组成。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。Those of ordinary skill in the art can realize that, in combination with the various method steps and units described in the embodiments disclosed herein, they can be implemented by electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the possibility of hardware and software For interchangeability, in the above description, the steps and components of each embodiment have been generally described according to their functions. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those of ordinary skill in the art may use different methods to implement the described functions for each particular application, but such implementation should not be regarded as exceeding the scope of the present invention.

结合本文中所公开的实施例描述的方法或步骤可以用硬件、处理器执行的软件程序，或者二者的结合来实施。软件程序可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The methods or steps described in connection with the embodiments disclosed herein may be implemented by hardware, software programs executed by a processor, or a combination of both. The software program can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other Any other known storage medium.

尽管通过参考附图并结合优选实施例的方式对本发明进行了详细描述，但本发明并不限于此。在不脱离本发明的精神和实质的前提下，本领域普通技术人员可以对本发明的实施例进行各种等效的修改或替换，而这些修改或替换都应在本发明的涵盖范围内。Although the present invention has been described in detail in conjunction with preferred embodiments with reference to the accompanying drawings, the present invention is not limited thereto. Without departing from the spirit and essence of the present invention, those skilled in the art can make various equivalent modifications or replacements to the embodiments of the present invention, and these modifications or replacements should all fall within the scope of the present invention.

Claims

1. A method for processing big data, comprising:

receiving the query instruction sent by the client, and determining the query function K according to the query instruction;

Query the large data set D according to the query function K to obtain a query result R, the query result R={R _j }, wherein 1≤j≤m, m is a positive integer greater than or equal to 1;

Acquiring the sensitivity S(K) of the query function K, the sensitivity S(K) representing the sensitivity of the query function K;

Determine the noise N that needs to be added to the query result R according to the query result R and the sensitivity S(K), the noise N={N _j }, the noise component N _j of the noise N and the query result component R _j one-to-one correspondence;

Noise-adding processing is performed on the query result component R _j according to the noise component N _j to obtain a noisy query result R'={R' _j }.

2. method according to claim 1, is characterized in that, the sensitivity S (K) of described acquisition described query function K comprises:

Obtain the query result K(D1) of dataset D1 and the query result K(D2) of dataset D2;

Taking the minimum value of the difference between the query result K(D1) and the query result K(D2) in a metric space as the value of the sensitivity S(K), wherein the data set D1 and the data Set D2 is two different subsets of the large data set D, and the difference between the data set D1 and the data set D2 is at most one record data.

3. The method according to any one of claims 1 and 2, wherein said determining noise N according to said query result R and said sensitivity S(K) comprises:

Generate noise N' satisfying Laplace noise distribution according to the query result R, wherein each noise component in the noise N' is independent of each other;

The noise N is obtained after correcting the noise N' according to the sensitivity S(K), wherein the noise component N _j of the noise N satisfies the Laplace noise distribution of the sensitivity S(K).

4. The method according to any one of claims 1 to 3, wherein the query function K is a hash function F, and the method comprises:

According to the training set of the large data set D, the hash function F is obtained through training;

Wherein, the training set is a subset of the large data set D, and the training set also includes an attribute set X and a classification label Y, and the attribute set X is a collection of data representing element attributes in the training set, The classification label Y is a collection of data representing element classification results in the training set.

5. A device for processing big data, comprising:

A receiving module, the receiving module is used to receive the query instruction sent by the client, and determine the query function K according to the query instruction;

A first determination module, the first determination module is used to query the large data set D according to the query function K to obtain a query result R, the query result R={R _j }, where 1≤j≤m, m is a positive integer greater than or equal to 1;

An acquisition module, the acquisition module is used to acquire the sensitivity S(K) of the query function K determined by the first determination module, the sensitivity S(K) characterizes the sensitivity of the query function K;

A second determination module, the second determination module is used to determine the noise N that needs to be added to the query result R according to the query result R and the sensitivity S(K) obtained by the acquisition module, the noise N ={N _j }, the noise component N _j of the noise N is in one-to-one correspondence with the query result component R _j ;

A noise adding module, configured to add noise to the query result component R _j according to the noise N _j determined by the second determining module, to obtain a noisy query result R'={R' _j }.

6. The device according to claim 5, wherein the acquiring module is specifically used for:

Setting the norm minimum value of the difference between the query result K(D1) and the query result K(D2) as the value of the sensitivity S(K), wherein the data set D1 and the data set D2 are two different subsets of the large data set D, and there is at most one record data difference between the data set D1 and the data set D2.

7. The device according to any one of claims 5 or 6, wherein the second determination module is specifically configured to:

8. The device according to any one of claims 7 to 7, wherein the query function K is a hash function F, and the first determination module is also used for:

Wherein, the training set is a subset of the large data set D, the training set includes an attribute set X and a classification label Y, and the attribute set X is a collection of data representing element attributes in the training set, so The classification label Y is a collection of data representing element classification results in the training set.