CN108280366B - Batch linear query method based on differential privacy - Google Patents

Batch linear query method based on differential privacy Download PDF

Info

Publication number
CN108280366B
CN108280366B CN201810042656.8A CN201810042656A CN108280366B CN 108280366 B CN108280366 B CN 108280366B CN 201810042656 A CN201810042656 A CN 201810042656A CN 108280366 B CN108280366 B CN 108280366B
Authority
CN
China
Prior art keywords
data
matrix
attribute
decomposition
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810042656.8A
Other languages
Chinese (zh)
Other versions
CN108280366A (en
Inventor
王迪
袁健
申泽宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN201810042656.8A priority Critical patent/CN108280366B/en
Publication of CN108280366A publication Critical patent/CN108280366A/en
Application granted granted Critical
Publication of CN108280366B publication Critical patent/CN108280366B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6227Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2141Access rights, e.g. capability lists, access control lists, access tables, access matrices

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A batch linear query method based on differential privacy comprises the following steps: step 1: inquiring an original data set R to obtain a data inquiry result set M; step 2: sorting the attribute frequency of the R in a descending order, screening the attribute with the frequency not greater than the minimum support degree, and discarding the attribute and the data corresponding to the attribute; performing data independence processing on the attribute with the attribute frequency larger than the minimum support degree to obtain an irrelevant data set D with the attribute frequency larger than the minimum support degree; and step 3: establishing a data-independent load matrix W on the basis of establishing an initial load matrix by using M, and decomposing the W in parallel by using a parallel gradient descent matrix decomposition technology to obtain a first matrix B of a complete decomposition result of the W and a second matrix L of the decomposition result; and 4, step 4: performing self-adaptive noise addition based on the difference privacy, adding Laplace noise to the L and the D, and restoring the discarded attribute and data to obtain a noise-added query result data set S; and 5: and returning the S to the user.

Description

Batch linear query method based on differential privacy
Technical Field
The invention relates to the technical field of computers, in particular to a batch linear query method based on differential privacy.
Background
With the development of the internet, humans enter the big data era. When large data is processed, batch linear query is the most common operation, but the query scale is very large, the query process is complicated, and the performance is low. In addition, in the process of using big data, many sensitive information is easily leaked, and the query precision (data availability) and the privacy protection degree cannot be guaranteed at the same time.
The algorithm in the prior art cannot simultaneously guarantee the algorithm performance, the query precision and the privacy protection degree of batch linear query. In the aspect of algorithm performance, the existing algorithm has high complexity and is not suitable for large-scale batch linear query; in terms of query accuracy, existing algorithms add noise to the query results to reduce the amount of noise required, thereby optimizing query accuracy. However, when the query sequence is given by the user arbitrarily, the computation overhead required by these mechanisms to find the optimal noise distribution is very large, and grows exponentially with the increase of data dimension, and cannot be used for large data sets; in the privacy protection degree, the added noise volume and the user authority are not considered in the existing algorithm, the appropriate noise volume added to users with different authorities cannot be ensured, and for users with high authorities, if the added noise is too much, the noise interference is large, and the query precision is reduced; for a low-authority user, if the noise is too low, the degree of privacy protection is insufficient.
Disclosure of Invention
The present invention is made to solve the above problems, and an object of the present invention is to provide a batch linear query method based on differential privacy.
The invention provides a batch linear query method based on differential privacy, which is characterized by comprising the following steps: step 1: inquiring an original data set R to obtain a data inquiry result set M; step 2: arranging the attribute frequency of the original data set R in a descending order, setting the attribute with the minimum support screening frequency not greater than the minimum support and discarding the attribute and the data corresponding to the attribute; performing data independence processing after the attribute with the attribute frequency larger than the minimum support degree adopts an FP-tree to obtain the associated attribute of the data to obtain an irrelevant data set D with the attribute frequency larger than the minimum support degree; and step 3: establishing an initial load matrix by using the data query result set M, establishing a data-independent load matrix W on the basis of the initial load matrix by using the attribute correlation in the step 2, and decomposing the data-independent load matrix W in parallel by using a parallel gradient descent matrix decomposition technology to obtain a first matrix B of a complete decomposition result of the data-independent load matrix W and a second matrix L of the decomposition result; and 4, step 4: performing adaptive noise addition based on differential privacy, adding Laplace noise to the second matrix L of the decomposition result and the irrelevant data set D with the attribute frequency greater than the minimum support degree, and restoring the attribute with the frequency not greater than the minimum support degree and the data corresponding to the attribute discarded in the step 2 to obtain a noise addition query result data set S; and 5: and returning the noisy query result data set S to the user.
The batch linear query method based on the differential privacy provided by the invention can also have the following characteristics: wherein the data independence process based on relevance analysis in step 2 comprises the following steps: step 2-1: scanning the original data set R to obtain the frequency of each attribute in the original data set R, and performing descending arrangement according to the attribute frequency to obtain an attribute frequency descending list; step 2-2: setting a minimum support degree, and removing attributes with the frequency not greater than the minimum support degree and data corresponding to the attributes according to the attribute frequency reduction sequence table; step 2-3: storing the attribute with the frequency not greater than the minimum support degree and the residual original data set R' of the corresponding data, which are removed, by a prefix tree to form an FP-tree, and establishing a linked list for the nodes appearing for the first time; step 2-4: sorting the FP-tree by using an FP-growth algorithm, and excavating an association mode; step 2-5: judging whether the leaf nodes are single paths or not, removing the leaf nodes when the judgment result is yes, generating a prefix path set, and entering the step 2-6; if not, generating a set of prefix paths of each path to form a new FP-tree, and returning to the step 2-4; step 2-6: acquiring the set of the prefix paths generated in the step 2-5, and defining the set as the associated attribute of the data; step 2-7: and carrying out data independence processing, and removing redundant data by utilizing the relevance of the attributes.
The batch linear query method based on the differential privacy provided by the invention can also have the following characteristics: wherein the parallel gradient descent matrix decomposition in step 3 comprises the following steps: step 3-1: generating an initial load matrix according to the query requirement of a user based on the data query result set M obtained in the step 1; step 3-2: converting the initial load matrix into a data-independent load matrix W according to the relevance attribute of the data obtained by the relevance processing of the data based on the relevance analysis in the step 2; step 3-3: running the Map process: decomposing a data-independent load matrix W into W and BL, B being a first matrix of decomposition results and L being a second matrix of decomposition results, wherein the B matrix has m rows and n columns, the L matrix has n rows and r columns, m represents the number of query records, r represents the maximum query attribute scale, and n represents the number of nodes, wherein the gradient of the first matrix B of decomposition results and the second matrix L of decomposition results
Figure BDA0001549933190000041
The calculation is made by the following formula:
B=(βWLT+πLT)(βLLT+I)-1 (1)
Figure BDA0001549933190000042
in the formulas (1) and (2), T is a transposed symbol; beta is a positive penaltyThe term, β, needs to be initialized; i is an identity matrix; pi is a Lagrangian multiplier; performing a matrix decomposition algorithm in parallel: decomposing B into B1,B2,…,Bi,…,BnN matrices, wherein BiA decomposition matrix representing B at the ith node; decomposing L into L according to rows1,L2,…,Li,…,LnN matrices, where LiA decomposition matrix representing L at the ith node, i.e. a data-independent load matrix W into n parts, each part comprising a BiMatrix and one LiMatrix, where the number of rows in each part is m/n, m is the number of rows of W, n is the number of nodes in the distributed system, and the decomposition matrix of W at the ith node is represented as Wi=BiLiThe Map process of distributed computing is introduced: firstly, accessing a decomposed data set, traversing each row of data, recording a row number a, then, rounding an output key value as a packet number a/n, making value as m/n row of data in the data, and carrying out a Combiner process: aggregating the data in each group to form data to be processed, distributing the divided parts to n nodes, and carrying out steps 3-4: run Reduce procedure: calculating at each node a matrix norm τ of the difference between the positive penalty term factor β and the product of the data-independent load matrix W and the decomposition matrix, τ being calculated by the formula τ | W-BiLiII, and updating beta and tau, stopping iteration when beta > 1000 and tau < 0.001, introducing a Reduce process of distributed computation: distributing B and L to each node, and calculating B by each nodeiAnd LiAnd the group number a/n is written into the Reduce process of cloud computing to realize integration, and B with the same group number is written into the cloud computing to form a groupiAnd LiSplicing is performed by line number a, thus obtaining complete L, B.
The batch linear query method based on the differential privacy provided by the invention can also have the following characteristics: wherein, the self-adaptive noise adding in the step 4 comprises the following steps: step 4-1: by
Figure BDA0001549933190000051
Calculating the upper bound of the privacy budget epsilon, selecting epsilon according to the authority of the user,in equation (3), ε is the privacy budget; l is a second matrix of the decomposition result of the load matrix; ρ represents a correlation coefficient in the range of [ -1,1 [ ]](ii) a Δ q is sensitivity; step 4-2: adding Laplace noise satisfying epsilon to L and D by utilizing a Laplace noise mechanism; step 4-3: restoring the attribute with the discarding frequency not more than the minimum support in the step 2 and the data corresponding to the attribute; step 4-4: a noisy query result dataset S is obtained.
The batch linear query method based on the differential privacy provided by the invention can also have the following characteristics: the higher the authority of the user is, the closer the selected epsilon value is to the upper bound, the smaller the privacy protection degree is, and the higher the query precision is; the lower the authority of the user, the smaller the selected epsilon value, the greater the privacy protection degree and the lower the query precision.
Action and Effect of the invention
Aiming at the characteristics of batch linear query, the method realizes data independence processing based on relevance analysis, reduces redundant information, and improves query performance by adopting a parallel gradient descent matrix decomposition algorithm for processing. In addition, the method is based on a differential privacy protection algorithm and combines a user authority design self-adaptive noise adding algorithm to generate a reasonable amount of noise, so that privacy protection is realized. Therefore, the batch linear query method based on the differential privacy is not only an efficient linear query algorithm, but also a privacy protection algorithm which gives consideration to query precision and privacy protection degree.
Drawings
FIG. 1 is an overall flow diagram of a batch linear query method based on differential privacy in an embodiment of the invention;
FIG. 2 is a flow diagram of data independence processing based on relevance analysis in an embodiment of the invention;
FIG. 3 is a flow chart of a parallel gradient descent matrix decomposition in an embodiment of the invention; and FIG. 4 is a flow chart of adaptive noise addition in an embodiment of the present invention.
Detailed Description
In order to make the technical means, the creation features, the achievement purposes and the effects of the present invention easy to understand, the following embodiments specifically describe the batch linear query method based on differential privacy in conjunction with the accompanying drawings.
FIG. 1 is an overall flowchart of a batch linear query method based on differential privacy in an embodiment of the present invention.
As shown in fig. 1, the batch linear query method based on differential privacy of the present invention includes the following steps:
step 1: inquiring an original data set R to obtain a data inquiry result set M, wherein the original data set R comprises attributes and data, and the attributes are repeated; the data query result set M also contains attributes and data, wherein the attributes have duplicates.
Step 2: setting an attribute with the minimum support screening frequency not greater than the minimum support and discarding the attribute and data corresponding to the attribute; and acquiring the associated attribute of the data by adopting an FP-tree for the attribute with the attribute frequency greater than the minimum support degree, then performing data independence processing, removing the redundant attribute, adding the data with the redundant attribute to the data with the associated attribute, and obtaining an irrelevant data set D with the attribute frequency greater than the minimum support degree. The above-mentioned discarded attributes and their data are not processed in parallel in step 2, and are finally restored in the subsequent steps.
FIG. 2 is a flow diagram of data independence processing based on relevance analysis in an embodiment of the invention.
The data independence processing based on the relevance analysis in the step 2 comprises the following steps:
step 2-1: and scanning the original data set R to obtain the frequency of each attribute in the original data set R, and performing descending arrangement according to the attribute frequency to obtain an attribute frequency descending list.
Step 2-2: and setting the minimum support degree, and removing the attribute with the frequency not greater than the minimum support degree and the data thereof according to the attribute frequency reduction sequence table.
Step 2-3: and storing the residual original data set R' without the attribute with the frequency not more than the minimum support degree by using a data structure of an prefix tree to form an FP-tree, and establishing a linked list for the nodes appearing for the first time.
Step 2-4: and (5) sorting the FP-tree by using an FP-growth algorithm, and excavating an association mode.
Step 2-5: judging whether the leaf nodes are single paths or not, removing the leaf nodes when the judgment result is yes, generating a set of prefix paths, and entering the step 2-6; and if not, generating a set of prefix paths of each path to form a new FP-tree, and returning to the step 2-4.
Step 2-6: and acquiring the set of prefix paths generated in the step 2-5, and defining the set as the associated attribute of the data.
Step 2-7: and carrying out data independence processing, and removing redundant data by utilizing the relevance of the attributes.
And step 3: and (3) establishing an initial load matrix by using the query result set M, and establishing an irrelevant load matrix W on the basis of the initial load matrix by using the data correlation in the step 2. And decomposing the data-independent load matrix W in parallel by adopting a parallel gradient descent matrix decomposition technology to obtain a first matrix B of a complete decomposition result of the data-independent load matrix W and a second matrix L of the decomposition result.
FIG. 3 is a flow diagram of a parallel gradient descent matrix decomposition technique in an embodiment of the invention.
The parallel gradient descent matrix decomposition technology in the step 3 comprises the following steps:
step 3-1: and (4) generating an initial load matrix according to the query requirement of the user based on the query result set M obtained in the step (1). The initial load matrix contains only a matrix of attributes, with duplicates of the attributes.
Step 3-2: and (3) according to the relevance attribute of the data obtained by the relevance processing of the data analyzed by the relevance, performing conversion elimination on the relevance data to generate a data-irrelevant load matrix W. The data-independent load matrix W is a matrix containing attributes, which are not duplicated.
Step 3-3: running the Map process:
and decomposing the data-independent load matrix W into W and BL, B being a first matrix of the decomposition result, and L being a second matrix of the decomposition result, wherein the B matrix has m rows and n columns, the L matrix has n rows and r columns, m represents the query record number, r represents the maximum query attribute scale, and n represents the node number.
Wherein the gradient of the first matrix B of the decomposition results and the second matrix L of the decomposition results
Figure BDA0001549933190000081
The calculation is made by the following formula:
B=(βWLT+πLT)(βLLT+I)-1 (1)
Figure BDA0001549933190000082
in the formulas (1) and (2), G represents a function of L; t is a transposed symbol; beta is a positive penalty factor, and needs to be initialized; i is an identity matrix; pi is the lagrange multiplier.
Performing a matrix decomposition algorithm in parallel: decomposing B into B1,B2,…,Bi,…,BnN matrices, wherein BiA decomposition matrix representing B at the ith node; decomposing L into L according to rows1,L2,…,Li,…,LnN matrices, where LiA decomposition matrix representing L at the ith node, i.e. a data-independent load matrix W into n parts, each part comprising a BiMatrix and one LiMatrix, where the number of rows in each part is m/n, m is the number of rows of W, n is the number of nodes in the distributed system, and the decomposition matrix of W at the ith node is represented as Wi=BiLi
Map process for distributed computing is introduced: firstly, accessing a decomposed data set, traversing each row of data, recording a row number a, then, rounding an output key value as a packet number a/n, and making value as m/n row of data in the data.
And carrying out Combiner process, namely aggregating the data in each group to form data to be processed, and distributing the divided parts to n nodes.
Step 3-4: run Reduce procedure:
calculating a positive penalty factor beta and a data-independent negative on each nodeA matrix norm τ of the difference of the product of the carrier matrix W and the decomposition matrix, τ being given by the formula τ | W-BiLiII and update β and τ, stopping iteration when β > 1000 and τ < 0.001.
Reduce process with distributed computing: distributing B and L to each node, and calculating B by each nodeiAnd LiAnd the group number a/n is written into the Reduce process of cloud computing to realize integration.
B with the same group numberiAnd LiSplicing is performed by line number a, thus obtaining complete L, B.
And 4, step 4: and carrying out self-adaptive noise addition based on differential privacy, and adding Laplace noise to a second matrix L of the decomposition result of the data-independent load matrix W and an independent data set D pointed by the attribute with the frequency greater than the minimum support degree.
In this embodiment, the amount of added noise is not specified in advance, but the privacy budget ε is chosen according to a formula in combination with the user's rights. The higher the authority of the user is, the closer the selected epsilon value is to the upper bound, and the smaller the privacy protection degree is; the higher the query accuracy (data availability); the lower the authority of the user, the smaller the selected epsilon value, the greater the privacy protection degree, and the lower the query precision (data availability).
Fig. 4 is a flow chart of adaptive noise addition in an embodiment of the present invention.
The self-adaptive noise adding in the step 4 comprises the following steps:
step 4-1: by
Figure BDA0001549933190000101
And calculating the upper bound of the privacy budget epsilon, and selecting the epsilon according to the authority of the user.
In formula (3), ε is the privacy budget used to measure the privacy protection level; l is a second matrix of the decomposition result of the load matrix; ρ represents a correlation coefficient in the range of [ -1,1 ]; Δ q is the sensitivity.
Step 4-2: and adding Laplace noise meeting the epsilon to L and D by utilizing a Laplace noise mechanism to realize differential privacy protection. Since the result of the batch linear query is WD ═ BLD, and B changes with the change of L, the differential privacy protection can be achieved by only adding laplacian noise satisfying epsilon to L and D using the laplacian noise mechanism.
Step 4-3: and restoring the attribute and the data thereof which are discarded in the step 2 and have the frequency not more than the minimum support degree. Since the data removed in step 2 contains attributes, it is only necessary to add the data directly to the noisy result set WD during restoration, where the noisy result set WD contains attributes and data.
Step 4-4: and combining the WD subjected to noise addition, the restored attributes and the data thereof to form a noise addition query result data set S. S is a set, which is composed of attributes with frequencies of adding noise WD and restoring no greater than the minimum support and data thereof, where W is an independent load matrix (a matrix including only attributes, and is not repeated), D represents an independent data set pointed to by attributes with frequencies greater than the minimum support, does not include attributes, and is a part of the initial query result set M (including attributes and data in all query results). L represents a second matrix of W decomposition results. W ═ BL, and B varies with L, so W varies with L.
And 5: and returning the noisy query result data set S obtained in the step 4-4 to the user.
Effects and effects of the embodiments
Aiming at the characteristics of batch linear query, the redundant information is reduced through data independence processing, and meanwhile, the query performance is improved by adopting a parallel gradient descent matrix decomposition algorithm for calculation processing. In addition, the embodiment is based on a differential privacy protection algorithm, and combines a user authority design self-adaptive noise adding algorithm to generate a reasonable amount of noise, so that privacy protection is realized. Therefore, the batch linear query method based on the differential privacy of the embodiment is not only an efficient linear query algorithm, but also a privacy protection algorithm which takes account of query precision and privacy protection degree.
Further, in order to give consideration to both the query precision and the privacy protection degree, the embodiment does not specify the added noise volume in advance, but proposes to select the privacy budget epsilon according to a formula and in combination with the user authority, so as to display that a reasonable amount of noise is adaptively added, thereby reducing the noise interference degree, improving the query precision and ensuring the data usefulness. Different epsilon can be selected for users with different authorities, so that the query precision and the differential privacy protection degree are organically related to the authority of the users.
The above embodiments are preferred examples of the present invention, and are not intended to limit the scope of the present invention.

Claims (5)

1. A batch linear query method based on differential privacy is characterized by comprising the following steps:
step 1: inquiring an original data set R to obtain a data inquiry result set M;
step 2: arranging the attribute frequency of the original data set R in a descending order, setting the attribute with the minimum support screening frequency not greater than the minimum support and discarding the attribute and the data corresponding to the attribute; performing data independence processing after the attribute with the attribute frequency larger than the minimum support degree adopts an FP-tree to obtain the associated attribute of the data to obtain an irrelevant data set D with the attribute frequency larger than the minimum support degree;
and step 3: establishing an initial load matrix by using the data query result set M, establishing a data-independent load matrix W on the basis of the initial load matrix by using the attribute correlation in the step 2, and decomposing the data-independent load matrix W in parallel by using a parallel gradient descent matrix decomposition technology to obtain a first matrix B of a complete decomposition result of the data-independent load matrix W and a second matrix L of the decomposition result;
and 4, step 4: performing adaptive noise addition based on differential privacy, adding Laplace noise to the second matrix L of the decomposition result and the irrelevant data set D with the attribute frequency greater than the minimum support degree, and restoring the attribute with the frequency not greater than the minimum support degree and the data corresponding to the attribute discarded in the step 2 to obtain a noise addition query result data set S;
and 5: the noisy query result dataset S is returned to the user,
wherein the gradient of the first matrix B of the decomposition results and the second matrix L of the decomposition results
Figure FDA0003128150520000011
The calculation is made by the following formula:
B=(βWLT+πLT)(βLLT+I)-1 (1)
Figure FDA0003128150520000012
in the formulas (1) and (2), T is a transposed symbol; beta is a positive penalty factor, and needs to be initialized; i is an identity matrix; pi is the lagrange multiplier.
2. The batch linear query method based on differential privacy of claim 1, wherein:
wherein the data independence process based on relevance analysis in step 2 comprises the following steps:
step 2-1: scanning the original data set R to obtain the frequency of each attribute in the original data set R, and performing descending arrangement according to the attribute frequency to obtain an attribute frequency descending list;
step 2-2: setting a minimum support degree, and removing attributes with the frequency not greater than the minimum support degree and data corresponding to the attributes according to the attribute frequency reduction sequence table;
step 2-3: storing the attribute with the frequency not greater than the minimum support degree and the residual original data set R' of the corresponding data, which are removed, by a prefix tree to form an FP-tree, and establishing a linked list for the nodes appearing for the first time;
step 2-4: sorting the FP-tree by using an FP-growth algorithm, and excavating an association mode;
step 2-5: judging whether the leaf nodes are single paths or not, removing the leaf nodes when the judgment result is yes, generating a prefix path set, and entering the step 2-6; if not, generating a set of prefix paths of each path to form a new FP-tree, and returning to the step 2-4;
step 2-6: acquiring the set of the prefix paths generated in the step 2-5, and defining the set as the associated attribute of the data;
step 2-7: and carrying out data independence processing, and removing redundant data by utilizing the relevance of the attributes.
3. The batch linear query method based on differential privacy of claim 1, wherein:
wherein the parallel gradient descent matrix decomposition in step 3 comprises the following steps:
step 3-1: generating an initial load matrix according to the query requirement of a user based on the data query result set M obtained in the step 1;
step 3-2: converting the initial load matrix into a data-independent load matrix W according to the relevance attribute of the data obtained by the relevance processing of the data based on the relevance analysis in the step 2;
step 3-3: running the Map process:
decomposing the data-independent load matrix W into W and BL, B being a first matrix of decomposition results and L being a second matrix of decomposition results, wherein the B matrix has m rows and n columns, the L matrix has n rows and r columns, m represents the number of query records, r represents the maximum query attribute scale, n represents the number of nodes,
performing a matrix decomposition algorithm in parallel: decomposing B into B1,B2,…,Bi,…,BnN matrices, wherein BiA decomposition matrix representing B at the ith node; decomposing L into L according to rows1,L2,…,Li,…,LnN matrices, where LiA decomposition matrix representing L at the ith node, i.e. a data-independent load matrix W into n parts, each part comprising a BiMatrix and one LiMatrix, where the number of rows in each part is m/n, m is the number of rows of W, n is the number of nodes in the distributed system, and the decomposition matrix of W at the ith node is represented as Wi=BiLi
Map process for distributed computing is introduced: firstly, accessing the decomposed data set, traversing each row of data, recording the row number a, then, rounding the output key value as the packet number a/n, and making value as the m/n row of data in the data,
performing Combiner process, aggregating data in each group to form data to be processed, distributing divided parts to n nodes,
step 3-4: run Reduce procedure:
calculating at each node a matrix norm τ of the difference of the positive penalty term factor β and the product of the decomposition matrices of the data-independent load matrices W and W, τ being calculated by the formula τ | W-BiLiII, and updating beta and tau, stopping iteration when beta > 1000 and tau < 0.001,
reduce process with distributed computing: distributing B and L to each node, and calculating B by each nodeiAnd LiAnd a group number a/n, the Reduce process written in the cloud computing realizes integration,
b with the same group numberiAnd LiSplicing is performed by line number a, thus obtaining complete L, B.
4. The batch linear query method based on differential privacy of claim 1, wherein:
wherein, the self-adaptive noise adding in the step 4 comprises the following steps:
step 4-1: by
Figure FDA0003128150520000041
Calculating the upper bound of the privacy budget epsilon, selecting epsilon according to the authority of the user,
in equation (3), ε is the privacy budget; l is a second matrix of the decomposition result of the data-independent load matrix; ρ represents a correlation coefficient in the range of [ -1,1 ]; Δ q is sensitivity;
step 4-2: adding Laplace noise satisfying epsilon to L and D by utilizing a Laplace noise mechanism;
step 4-3: restoring the attribute with the discarding frequency not more than the minimum support in the step 2 and the data corresponding to the attribute;
step 4-4: a noisy query result dataset S is obtained.
5. The batch linear query method based on differential privacy of claim 4, wherein:
the higher the authority of the user is, the closer the selected epsilon value is to the upper bound, the smaller the privacy protection degree is, and the higher the query precision is; the lower the authority of the user, the smaller the selected epsilon value, the greater the privacy protection degree and the lower the query precision.
CN201810042656.8A 2018-01-17 2018-01-17 Batch linear query method based on differential privacy Active CN108280366B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810042656.8A CN108280366B (en) 2018-01-17 2018-01-17 Batch linear query method based on differential privacy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810042656.8A CN108280366B (en) 2018-01-17 2018-01-17 Batch linear query method based on differential privacy

Publications (2)

Publication Number Publication Date
CN108280366A CN108280366A (en) 2018-07-13
CN108280366B true CN108280366B (en) 2021-10-01

Family

ID=62803867

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810042656.8A Active CN108280366B (en) 2018-01-17 2018-01-17 Batch linear query method based on differential privacy

Country Status (1)

Country Link
CN (1) CN108280366B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763954B (en) * 2018-05-17 2022-03-01 西安电子科技大学 Linear regression model multidimensional Gaussian difference privacy protection method and information security system
CN111914285B (en) * 2020-06-09 2022-06-17 深圳大学 Geographic distributed graph calculation method and system based on differential privacy
CN111475854B (en) * 2020-06-24 2020-10-20 支付宝(杭州)信息技术有限公司 Collaborative computing method and system for protecting data privacy of two parties
CN112818386B (en) * 2021-01-20 2021-11-12 海南大学 DIKW-mode-crossing typed private information resource differential protection method and system
CN112507710B (en) * 2021-02-05 2021-05-25 支付宝(杭州)信息技术有限公司 Method and device for estimating word frequency in differential privacy protection data

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8627488B2 (en) * 2011-12-05 2014-01-07 At&T Intellectual Property I, L.P. Methods and apparatus to anonymize a dataset of spatial data
CN104050267B (en) * 2014-06-23 2017-10-03 中国科学院软件研究所 The personalized recommendation method and system of privacy of user protection are met based on correlation rule
CN104537025B (en) * 2014-12-19 2017-10-10 北京邮电大学 Frequent episodes method for digging
CN107092837A (en) * 2017-04-25 2017-08-25 华中科技大学 A kind of Mining Frequent Itemsets and system for supporting difference privacy

Also Published As

Publication number Publication date
CN108280366A (en) 2018-07-13

Similar Documents

Publication Publication Date Title
CN108280366B (en) Batch linear query method based on differential privacy
Zhu et al. Differential privacy and applications
Papadakis et al. Meta-blocking: Taking entity resolutionto the next level
US8627488B2 (en) Methods and apparatus to anonymize a dataset of spatial data
CN104123288B (en) A kind of data query method and device
US7707005B2 (en) Generating histograms of population data by scaling from sample data
EP3690677A1 (en) Differentially private query budget refunding
US20150363404A1 (en) Minimizing index maintenance costs for database storage regions using hybrid zone maps and indices
WO2020023759A1 (en) Secure electronic messaging system
Bischoff et al. Computing many-body wave functions with guaranteed precision: The first-order Møller-Plesset wave function for the ground state of helium atom
CN111475848B (en) Global and local low noise training method for guaranteeing privacy of edge calculation data
CN102915365A (en) Hadoop-based construction method for distributed search engine
CN116762069A (en) Metadata classification
Lee et al. Streamlined mean field variational Bayes for longitudinal and multilevel data analysis
CN109492429B (en) Privacy protection method for data release
Yang et al. Sparse hierarchical solvers with guaranteed convergence
Wang et al. Discover community leader in social network with PageRank
Ke et al. Optimal network membership estimation under severe degree heterogeneity
Qiao et al. An effective data privacy protection algorithm based on differential privacy in edge computing
CN117574436B (en) Tensor-based big data privacy security protection method
Liu et al. A hybrid with distributed pooling blockchain protocol for image storage
Bante et al. Big data analytics using hadoop map reduce framework and data migration process
US11068481B2 (en) Optimized full-spectrum order statistics-based cardinality estimation
Kayid et al. [Retracted] EM Algorithm for Estimating the Parameters of Quasi‐Lindley Model with Application
Zhang et al. Research on the construction of university data platform based on hybrid architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant