CN108280366B

CN108280366B - Batch linear query method based on differential privacy

Info

Publication number: CN108280366B
Application number: CN201810042656.8A
Authority: CN
Inventors: 王迪; 袁健; 申泽宇
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2018-01-17
Filing date: 2018-01-17
Publication date: 2021-10-01
Anticipated expiration: 2038-01-17
Also published as: CN108280366A

Abstract

A batch linear query method based on differential privacy comprises the following steps: step 1: inquiring an original data set R to obtain a data inquiry result set M; step 2: sorting the attribute frequency of the R in a descending order, screening the attribute with the frequency not greater than the minimum support degree, and discarding the attribute and the data corresponding to the attribute; performing data independence processing on the attribute with the attribute frequency larger than the minimum support degree to obtain an irrelevant data set D with the attribute frequency larger than the minimum support degree; and step 3: establishing a data-independent load matrix W on the basis of establishing an initial load matrix by using M, and decomposing the W in parallel by using a parallel gradient descent matrix decomposition technology to obtain a first matrix B of a complete decomposition result of the W and a second matrix L of the decomposition result; and 4, step 4: performing self-adaptive noise addition based on the difference privacy, adding Laplace noise to the L and the D, and restoring the discarded attribute and data to obtain a noise-added query result data set S; and 5: and returning the S to the user.

Description

Batch linear query method based on differential privacy

Technical Field

The invention relates to the technical field of computers, in particular to a batch linear query method based on differential privacy.

Background

With the development of the internet, humans enter the big data era. When large data is processed, batch linear query is the most common operation, but the query scale is very large, the query process is complicated, and the performance is low. In addition, in the process of using big data, many sensitive information is easily leaked, and the query precision (data availability) and the privacy protection degree cannot be guaranteed at the same time.

The algorithm in the prior art cannot simultaneously guarantee the algorithm performance, the query precision and the privacy protection degree of batch linear query. In the aspect of algorithm performance, the existing algorithm has high complexity and is not suitable for large-scale batch linear query; in terms of query accuracy, existing algorithms add noise to the query results to reduce the amount of noise required, thereby optimizing query accuracy. However, when the query sequence is given by the user arbitrarily, the computation overhead required by these mechanisms to find the optimal noise distribution is very large, and grows exponentially with the increase of data dimension, and cannot be used for large data sets; in the privacy protection degree, the added noise volume and the user authority are not considered in the existing algorithm, the appropriate noise volume added to users with different authorities cannot be ensured, and for users with high authorities, if the added noise is too much, the noise interference is large, and the query precision is reduced; for a low-authority user, if the noise is too low, the degree of privacy protection is insufficient.

Disclosure of Invention

The present invention is made to solve the above problems, and an object of the present invention is to provide a batch linear query method based on differential privacy.

The invention provides a batch linear query method based on differential privacy, which is characterized by comprising the following steps: step 1: inquiring an original data set R to obtain a data inquiry result set M; step 2: arranging the attribute frequency of the original data set R in a descending order, setting the attribute with the minimum support screening frequency not greater than the minimum support and discarding the attribute and the data corresponding to the attribute; performing data independence processing after the attribute with the attribute frequency larger than the minimum support degree adopts an FP-tree to obtain the associated attribute of the data to obtain an irrelevant data set D with the attribute frequency larger than the minimum support degree; and step 3: establishing an initial load matrix by using the data query result set M, establishing a data-independent load matrix W on the basis of the initial load matrix by using the attribute correlation in the step 2, and decomposing the data-independent load matrix W in parallel by using a parallel gradient descent matrix decomposition technology to obtain a first matrix B of a complete decomposition result of the data-independent load matrix W and a second matrix L of the decomposition result; and 4, step 4: performing adaptive noise addition based on differential privacy, adding Laplace noise to the second matrix L of the decomposition result and the irrelevant data set D with the attribute frequency greater than the minimum support degree, and restoring the attribute with the frequency not greater than the minimum support degree and the data corresponding to the attribute discarded in the step 2 to obtain a noise addition query result data set S; and 5: and returning the noisy query result data set S to the user.

The batch linear query method based on the differential privacy provided by the invention can also have the following characteristics: wherein the data independence process based on relevance analysis in step 2 comprises the following steps: step 2-1: scanning the original data set R to obtain the frequency of each attribute in the original data set R, and performing descending arrangement according to the attribute frequency to obtain an attribute frequency descending list; step 2-2: setting a minimum support degree, and removing attributes with the frequency not greater than the minimum support degree and data corresponding to the attributes according to the attribute frequency reduction sequence table; step 2-3: storing the attribute with the frequency not greater than the minimum support degree and the residual original data set R' of the corresponding data, which are removed, by a prefix tree to form an FP-tree, and establishing a linked list for the nodes appearing for the first time; step 2-4: sorting the FP-tree by using an FP-growth algorithm, and excavating an association mode; step 2-5: judging whether the leaf nodes are single paths or not, removing the leaf nodes when the judgment result is yes, generating a prefix path set, and entering the step 2-6; if not, generating a set of prefix paths of each path to form a new FP-tree, and returning to the step 2-4; step 2-6: acquiring the set of the prefix paths generated in the step 2-5, and defining the set as the associated attribute of the data; step 2-7: and carrying out data independence processing, and removing redundant data by utilizing the relevance of the attributes.

The batch linear query method based on the differential privacy provided by the invention can also have the following characteristics: wherein the parallel gradient descent matrix decomposition in step 3 comprises the following steps: step 3-1: generating an initial load matrix according to the query requirement of a user based on the data query result set M obtained in the step 1; step 3-2: converting the initial load matrix into a data-independent load matrix W according to the relevance attribute of the data obtained by the relevance processing of the data based on the relevance analysis in the step 2; step 3-3: running the Map process: decomposing a data-independent load matrix W into W and BL, B being a first matrix of decomposition results and L being a second matrix of decomposition results, wherein the B matrix has m rows and n columns, the L matrix has n rows and r columns, m represents the number of query records, r represents the maximum query attribute scale, and n represents the number of nodes, wherein the gradient of the first matrix B of decomposition results and the second matrix L of decomposition results

The calculation is made by the following formula:

B＝(βWL^T+πL^T)(βLL^T+I)^-1 (1)

in the formulas (1) and (2), T is a transposed symbol; beta is a positive penaltyThe term, β, needs to be initialized; i is an identity matrix; pi is a Lagrangian multiplier; performing a matrix decomposition algorithm in parallel: decomposing B into B₁，B₂，…，B_i，…，B_nN matrices, wherein B_iA decomposition matrix representing B at the ith node; decomposing L into L according to rows₁，L₂，…，L_i，…，L_nN matrices, where L_iA decomposition matrix representing L at the ith node, i.e. a data-independent load matrix W into n parts, each part comprising a B_iMatrix and one L_iMatrix, where the number of rows in each part is m/n, m is the number of rows of W, n is the number of nodes in the distributed system, and the decomposition matrix of W at the ith node is represented as W_i＝B_iL_iThe Map process of distributed computing is introduced: firstly, accessing a decomposed data set, traversing each row of data, recording a row number a, then, rounding an output key value as a packet number a/n, making value as m/n row of data in the data, and carrying out a Combiner process: aggregating the data in each group to form data to be processed, distributing the divided parts to n nodes, and carrying out steps 3-4: run Reduce procedure: calculating at each node a matrix norm τ of the difference between the positive penalty term factor β and the product of the data-independent load matrix W and the decomposition matrix, τ being calculated by the formula τ | W-B_iL_iII, and updating beta and tau, stopping iteration when beta > 1000 and tau < 0.001, introducing a Reduce process of distributed computation: distributing B and L to each node, and calculating B by each node_iAnd L_iAnd the group number a/n is written into the Reduce process of cloud computing to realize integration, and B with the same group number is written into the cloud computing to form a group_iAnd L_iSplicing is performed by line number a, thus obtaining complete L, B.

The batch linear query method based on the differential privacy provided by the invention can also have the following characteristics: wherein, the self-adaptive noise adding in the step 4 comprises the following steps: step 4-1: by

Calculating the upper bound of the privacy budget epsilon, selecting epsilon according to the authority of the user,in equation (3), ε is the privacy budget; l is a second matrix of the decomposition result of the load matrix; ρ represents a correlation coefficient in the range of [ -1,1 [ ]](ii) a Δ q is sensitivity; step 4-2: adding Laplace noise satisfying epsilon to L and D by utilizing a Laplace noise mechanism; step 4-3: restoring the attribute with the discarding frequency not more than the minimum support in the step 2 and the data corresponding to the attribute; step 4-4: a noisy query result dataset S is obtained.

The batch linear query method based on the differential privacy provided by the invention can also have the following characteristics: the higher the authority of the user is, the closer the selected epsilon value is to the upper bound, the smaller the privacy protection degree is, and the higher the query precision is; the lower the authority of the user, the smaller the selected epsilon value, the greater the privacy protection degree and the lower the query precision.

Action and Effect of the invention

Aiming at the characteristics of batch linear query, the method realizes data independence processing based on relevance analysis, reduces redundant information, and improves query performance by adopting a parallel gradient descent matrix decomposition algorithm for processing. In addition, the method is based on a differential privacy protection algorithm and combines a user authority design self-adaptive noise adding algorithm to generate a reasonable amount of noise, so that privacy protection is realized. Therefore, the batch linear query method based on the differential privacy is not only an efficient linear query algorithm, but also a privacy protection algorithm which gives consideration to query precision and privacy protection degree.

Drawings

FIG. 1 is an overall flow diagram of a batch linear query method based on differential privacy in an embodiment of the invention;

FIG. 2 is a flow diagram of data independence processing based on relevance analysis in an embodiment of the invention;

FIG. 3 is a flow chart of a parallel gradient descent matrix decomposition in an embodiment of the invention; and FIG. 4 is a flow chart of adaptive noise addition in an embodiment of the present invention.

Detailed Description

In order to make the technical means, the creation features, the achievement purposes and the effects of the present invention easy to understand, the following embodiments specifically describe the batch linear query method based on differential privacy in conjunction with the accompanying drawings.

FIG. 1 is an overall flowchart of a batch linear query method based on differential privacy in an embodiment of the present invention.

As shown in fig. 1, the batch linear query method based on differential privacy of the present invention includes the following steps:

step 1: inquiring an original data set R to obtain a data inquiry result set M, wherein the original data set R comprises attributes and data, and the attributes are repeated; the data query result set M also contains attributes and data, wherein the attributes have duplicates.

Step 2: setting an attribute with the minimum support screening frequency not greater than the minimum support and discarding the attribute and data corresponding to the attribute; and acquiring the associated attribute of the data by adopting an FP-tree for the attribute with the attribute frequency greater than the minimum support degree, then performing data independence processing, removing the redundant attribute, adding the data with the redundant attribute to the data with the associated attribute, and obtaining an irrelevant data set D with the attribute frequency greater than the minimum support degree. The above-mentioned discarded attributes and their data are not processed in parallel in step 2, and are finally restored in the subsequent steps.

FIG. 2 is a flow diagram of data independence processing based on relevance analysis in an embodiment of the invention.

The data independence processing based on the relevance analysis in the step 2 comprises the following steps:

step 2-1: and scanning the original data set R to obtain the frequency of each attribute in the original data set R, and performing descending arrangement according to the attribute frequency to obtain an attribute frequency descending list.

Step 2-2: and setting the minimum support degree, and removing the attribute with the frequency not greater than the minimum support degree and the data thereof according to the attribute frequency reduction sequence table.

Step 2-3: and storing the residual original data set R' without the attribute with the frequency not more than the minimum support degree by using a data structure of an prefix tree to form an FP-tree, and establishing a linked list for the nodes appearing for the first time.

Step 2-4: and (5) sorting the FP-tree by using an FP-growth algorithm, and excavating an association mode.

Step 2-5: judging whether the leaf nodes are single paths or not, removing the leaf nodes when the judgment result is yes, generating a set of prefix paths, and entering the step 2-6; and if not, generating a set of prefix paths of each path to form a new FP-tree, and returning to the step 2-4.

Step 2-6: and acquiring the set of prefix paths generated in the step 2-5, and defining the set as the associated attribute of the data.

Step 2-7: and carrying out data independence processing, and removing redundant data by utilizing the relevance of the attributes.

And step 3: and (3) establishing an initial load matrix by using the query result set M, and establishing an irrelevant load matrix W on the basis of the initial load matrix by using the data correlation in the step 2. And decomposing the data-independent load matrix W in parallel by adopting a parallel gradient descent matrix decomposition technology to obtain a first matrix B of a complete decomposition result of the data-independent load matrix W and a second matrix L of the decomposition result.

FIG. 3 is a flow diagram of a parallel gradient descent matrix decomposition technique in an embodiment of the invention.

The parallel gradient descent matrix decomposition technology in the step 3 comprises the following steps:

step 3-1: and (4) generating an initial load matrix according to the query requirement of the user based on the query result set M obtained in the step (1). The initial load matrix contains only a matrix of attributes, with duplicates of the attributes.

Step 3-2: and (3) according to the relevance attribute of the data obtained by the relevance processing of the data analyzed by the relevance, performing conversion elimination on the relevance data to generate a data-irrelevant load matrix W. The data-independent load matrix W is a matrix containing attributes, which are not duplicated.

Step 3-3: running the Map process:

and decomposing the data-independent load matrix W into W and BL, B being a first matrix of the decomposition result, and L being a second matrix of the decomposition result, wherein the B matrix has m rows and n columns, the L matrix has n rows and r columns, m represents the query record number, r represents the maximum query attribute scale, and n represents the node number.

Wherein the gradient of the first matrix B of the decomposition results and the second matrix L of the decomposition results

The calculation is made by the following formula:

B＝(βWL^T+πL^T)(βLL^T+I)^-1 (1)

in the formulas (1) and (2), G represents a function of L; t is a transposed symbol; beta is a positive penalty factor, and needs to be initialized; i is an identity matrix; pi is the lagrange multiplier.

Performing a matrix decomposition algorithm in parallel: decomposing B into B₁，B₂，…，B_i，…，B_nN matrices, wherein B_iA decomposition matrix representing B at the ith node; decomposing L into L according to rows₁，L₂，…，L_i，…，L_nN matrices, where L_iA decomposition matrix representing L at the ith node, i.e. a data-independent load matrix W into n parts, each part comprising a B_iMatrix and one L_iMatrix, where the number of rows in each part is m/n, m is the number of rows of W, n is the number of nodes in the distributed system, and the decomposition matrix of W at the ith node is represented as W_i＝B_iL_i。

Map process for distributed computing is introduced: firstly, accessing a decomposed data set, traversing each row of data, recording a row number a, then, rounding an output key value as a packet number a/n, and making value as m/n row of data in the data.

And carrying out Combiner process, namely aggregating the data in each group to form data to be processed, and distributing the divided parts to n nodes.

Step 3-4: run Reduce procedure:

calculating a positive penalty factor beta and a data-independent negative on each nodeA matrix norm τ of the difference of the product of the carrier matrix W and the decomposition matrix, τ being given by the formula τ | W-B_iL_iII and update β and τ, stopping iteration when β > 1000 and τ < 0.001.

Reduce process with distributed computing: distributing B and L to each node, and calculating B by each node_iAnd L_iAnd the group number a/n is written into the Reduce process of cloud computing to realize integration.

B with the same group number_iAnd L_iSplicing is performed by line number a, thus obtaining complete L, B.

And 4, step 4: and carrying out self-adaptive noise addition based on differential privacy, and adding Laplace noise to a second matrix L of the decomposition result of the data-independent load matrix W and an independent data set D pointed by the attribute with the frequency greater than the minimum support degree.

In this embodiment, the amount of added noise is not specified in advance, but the privacy budget ε is chosen according to a formula in combination with the user's rights. The higher the authority of the user is, the closer the selected epsilon value is to the upper bound, and the smaller the privacy protection degree is; the higher the query accuracy (data availability); the lower the authority of the user, the smaller the selected epsilon value, the greater the privacy protection degree, and the lower the query precision (data availability).

Fig. 4 is a flow chart of adaptive noise addition in an embodiment of the present invention.

The self-adaptive noise adding in the step 4 comprises the following steps:

step 4-1: by

And calculating the upper bound of the privacy budget epsilon, and selecting the epsilon according to the authority of the user.

In formula (3), ε is the privacy budget used to measure the privacy protection level; l is a second matrix of the decomposition result of the load matrix; ρ represents a correlation coefficient in the range of [ -1,1 ]; Δ q is the sensitivity.

Step 4-2: and adding Laplace noise meeting the epsilon to L and D by utilizing a Laplace noise mechanism to realize differential privacy protection. Since the result of the batch linear query is WD ═ BLD, and B changes with the change of L, the differential privacy protection can be achieved by only adding laplacian noise satisfying epsilon to L and D using the laplacian noise mechanism.

Step 4-3: and restoring the attribute and the data thereof which are discarded in the step 2 and have the frequency not more than the minimum support degree. Since the data removed in step 2 contains attributes, it is only necessary to add the data directly to the noisy result set WD during restoration, where the noisy result set WD contains attributes and data.

Step 4-4: and combining the WD subjected to noise addition, the restored attributes and the data thereof to form a noise addition query result data set S. S is a set, which is composed of attributes with frequencies of adding noise WD and restoring no greater than the minimum support and data thereof, where W is an independent load matrix (a matrix including only attributes, and is not repeated), D represents an independent data set pointed to by attributes with frequencies greater than the minimum support, does not include attributes, and is a part of the initial query result set M (including attributes and data in all query results). L represents a second matrix of W decomposition results. W ═ BL, and B varies with L, so W varies with L.

And 5: and returning the noisy query result data set S obtained in the step 4-4 to the user.

Effects and effects of the embodiments

Aiming at the characteristics of batch linear query, the redundant information is reduced through data independence processing, and meanwhile, the query performance is improved by adopting a parallel gradient descent matrix decomposition algorithm for calculation processing. In addition, the embodiment is based on a differential privacy protection algorithm, and combines a user authority design self-adaptive noise adding algorithm to generate a reasonable amount of noise, so that privacy protection is realized. Therefore, the batch linear query method based on the differential privacy of the embodiment is not only an efficient linear query algorithm, but also a privacy protection algorithm which takes account of query precision and privacy protection degree.

Further, in order to give consideration to both the query precision and the privacy protection degree, the embodiment does not specify the added noise volume in advance, but proposes to select the privacy budget epsilon according to a formula and in combination with the user authority, so as to display that a reasonable amount of noise is adaptively added, thereby reducing the noise interference degree, improving the query precision and ensuring the data usefulness. Different epsilon can be selected for users with different authorities, so that the query precision and the differential privacy protection degree are organically related to the authority of the users.

The above embodiments are preferred examples of the present invention, and are not intended to limit the scope of the present invention.

Claims

1. A batch linear query method based on differential privacy is characterized by comprising the following steps:

step 1: inquiring an original data set R to obtain a data inquiry result set M;

step 2: arranging the attribute frequency of the original data set R in a descending order, setting the attribute with the minimum support screening frequency not greater than the minimum support and discarding the attribute and the data corresponding to the attribute; performing data independence processing after the attribute with the attribute frequency larger than the minimum support degree adopts an FP-tree to obtain the associated attribute of the data to obtain an irrelevant data set D with the attribute frequency larger than the minimum support degree;

and step 3: establishing an initial load matrix by using the data query result set M, establishing a data-independent load matrix W on the basis of the initial load matrix by using the attribute correlation in the step 2, and decomposing the data-independent load matrix W in parallel by using a parallel gradient descent matrix decomposition technology to obtain a first matrix B of a complete decomposition result of the data-independent load matrix W and a second matrix L of the decomposition result;

and 4, step 4: performing adaptive noise addition based on differential privacy, adding Laplace noise to the second matrix L of the decomposition result and the irrelevant data set D with the attribute frequency greater than the minimum support degree, and restoring the attribute with the frequency not greater than the minimum support degree and the data corresponding to the attribute discarded in the step 2 to obtain a noise addition query result data set S;

and 5: the noisy query result dataset S is returned to the user,

The calculation is made by the following formula:

B＝(βWL^T+πL^T)(βLL^T+I)^-1 (1)

in the formulas (1) and (2), T is a transposed symbol; beta is a positive penalty factor, and needs to be initialized; i is an identity matrix; pi is the lagrange multiplier.

2. The batch linear query method based on differential privacy of claim 1, wherein:

wherein the data independence process based on relevance analysis in step 2 comprises the following steps:

step 2-1: scanning the original data set R to obtain the frequency of each attribute in the original data set R, and performing descending arrangement according to the attribute frequency to obtain an attribute frequency descending list;

step 2-2: setting a minimum support degree, and removing attributes with the frequency not greater than the minimum support degree and data corresponding to the attributes according to the attribute frequency reduction sequence table;

step 2-3: storing the attribute with the frequency not greater than the minimum support degree and the residual original data set R' of the corresponding data, which are removed, by a prefix tree to form an FP-tree, and establishing a linked list for the nodes appearing for the first time;

step 2-4: sorting the FP-tree by using an FP-growth algorithm, and excavating an association mode;

step 2-5: judging whether the leaf nodes are single paths or not, removing the leaf nodes when the judgment result is yes, generating a prefix path set, and entering the step 2-6; if not, generating a set of prefix paths of each path to form a new FP-tree, and returning to the step 2-4;

step 2-6: acquiring the set of the prefix paths generated in the step 2-5, and defining the set as the associated attribute of the data;

3. The batch linear query method based on differential privacy of claim 1, wherein:

wherein the parallel gradient descent matrix decomposition in step 3 comprises the following steps:

step 3-1: generating an initial load matrix according to the query requirement of a user based on the data query result set M obtained in the step 1;

step 3-2: converting the initial load matrix into a data-independent load matrix W according to the relevance attribute of the data obtained by the relevance processing of the data based on the relevance analysis in the step 2;

step 3-3: running the Map process:

decomposing the data-independent load matrix W into W and BL, B being a first matrix of decomposition results and L being a second matrix of decomposition results, wherein the B matrix has m rows and n columns, the L matrix has n rows and r columns, m represents the number of query records, r represents the maximum query attribute scale, n represents the number of nodes,

performing a matrix decomposition algorithm in parallel: decomposing B into B₁，B₂，…，B_i，…，B_nN matrices, wherein B_iA decomposition matrix representing B at the ith node; decomposing L into L according to rows₁，L₂，…，L_i，…，L_nN matrices, where L_iA decomposition matrix representing L at the ith node, i.e. a data-independent load matrix W into n parts, each part comprising a B_iMatrix and one L_iMatrix, where the number of rows in each part is m/n, m is the number of rows of W, n is the number of nodes in the distributed system, and the decomposition matrix of W at the ith node is represented as W_i＝B_iL_i，

Map process for distributed computing is introduced: firstly, accessing the decomposed data set, traversing each row of data, recording the row number a, then, rounding the output key value as the packet number a/n, and making value as the m/n row of data in the data,

performing Combiner process, aggregating data in each group to form data to be processed, distributing divided parts to n nodes,

step 3-4: run Reduce procedure:

calculating at each node a matrix norm τ of the difference of the positive penalty term factor β and the product of the decomposition matrices of the data-independent load matrices W and W, τ being calculated by the formula τ | W-B_iL_iII, and updating beta and tau, stopping iteration when beta > 1000 and tau < 0.001,

reduce process with distributed computing: distributing B and L to each node, and calculating B by each node_iAnd L_iAnd a group number a/n, the Reduce process written in the cloud computing realizes integration,

4. The batch linear query method based on differential privacy of claim 1, wherein:

wherein, the self-adaptive noise adding in the step 4 comprises the following steps:

step 4-1: by

Calculating the upper bound of the privacy budget epsilon, selecting epsilon according to the authority of the user,

in equation (3), ε is the privacy budget; l is a second matrix of the decomposition result of the data-independent load matrix; ρ represents a correlation coefficient in the range of [ -1,1 ]; Δ q is sensitivity;

step 4-2: adding Laplace noise satisfying epsilon to L and D by utilizing a Laplace noise mechanism;

step 4-3: restoring the attribute with the discarding frequency not more than the minimum support in the step 2 and the data corresponding to the attribute;

step 4-4: a noisy query result dataset S is obtained.

5. The batch linear query method based on differential privacy of claim 4, wherein:

the higher the authority of the user is, the closer the selected epsilon value is to the upper bound, the smaller the privacy protection degree is, and the higher the query precision is; the lower the authority of the user, the smaller the selected epsilon value, the greater the privacy protection degree and the lower the query precision.