CN116167078A - Differential privacy synthetic data publishing method based on maximum weight matching - Google Patents
Differential privacy synthetic data publishing method based on maximum weight matching Download PDFInfo
- Publication number
- CN116167078A CN116167078A CN202310060144.5A CN202310060144A CN116167078A CN 116167078 A CN116167078 A CN 116167078A CN 202310060144 A CN202310060144 A CN 202310060144A CN 116167078 A CN116167078 A CN 116167078A
- Authority
- CN
- China
- Prior art keywords
- dimensional
- low
- data
- count value
- noise
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000004364 calculation method Methods 0.000 claims abstract description 14
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 10
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 5
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 5
- 238000009826 distribution Methods 0.000 claims description 38
- 238000012545 processing Methods 0.000 claims description 15
- 238000012805 post-processing Methods 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 6
- 230000002194 synthesizing effect Effects 0.000 claims description 6
- 230000001174 ascending effect Effects 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 230000035945 sensitivity Effects 0.000 claims description 3
- 238000004220 aggregation Methods 0.000 claims 1
- 230000002776 aggregation Effects 0.000 claims 1
- 238000007781 pre-processing Methods 0.000 abstract 1
- 238000002474 experimental method Methods 0.000 description 9
- 239000002131 composite material Substances 0.000 description 7
- 238000011160 research Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a differential privacy synthetic data publishing method based on maximum weight matching, which comprises the steps of preprocessing collected user data through a reliable third-party server; then, the processed data attribute is information represented by using a graph model method to obtain an attribute association graph; selecting a group of proper low-dimensional marginal sets according to a maximum weight matching algorithm; then adding noise to the set of low-dimensional margins, respectively, the noise satisfying the differential privacy definition; performing post-treatment on the noisy low-dimensional marginal set to obtain a group of standardized low-dimensional marginal set; and carrying out data synthesis according to the group of low-dimensional marginal sets, so that the synthesized data set is similar to the original data set as much as possible in statistical information, and finally, carrying out data release on the synthesized data set. By adopting the technical method, the calculation complexity can be reduced while the utility of the synthesized data set is ensured, and the method has better utility for high-dimensional data.
Description
Technical Field
The invention belongs to the field of information security, and particularly relates to a differential privacy synthetic data publishing method based on maximum weight matching.
Background
With the rapid development of information technology, online registration, online shopping, online trip and the like are gradually integrated into the life of people, various organizations (such as hospitals, bus buses and the like) can easily obtain detailed information of users, a large amount of user data is accumulated by a plurality of organizations, and effective data support can be provided for subsequent tasks such as prediction analysis and the like by carrying out statistical analysis on the data, so that great research value is brought to the data. To meet the needs of research and innovation, related organizations may release the obtained data information, which often includes private information of individuals, and if the private data is not properly protected, the private data of users is easily revealed, which causes immeasurable loss. Therefore, research into a data distribution method based on privacy protection is indispensable.
Methods for protecting data privacy are called disclosure restrictions, and these technologies aim to provide protection for sensitive information, and issue data information to the public so that researchers can perform statistical analysis on the data, and one common method for protecting private data in the issued data is anonymization method, namely deleting sensitive information in the data, but many researches have proved that deleting sensitive information alone cannot effectively realize privacy protection, and an attacker can still obtain sensitive information through identification and analysis of other attributes, so that strong privacy guarantee cannot be provided for the data issuing process.
One reliability scheme of current privacy protection is a differential privacy model, which does not care how much background knowledge an attacker has, and achieves privacy protection effect by adding proper noise to query or analysis results, thus providing reliable privacy guarantee for data distribution. The differential privacy model provides clear definition and effective proof in mathematics, and the model can be simply understood as adding or deleting a certain record in the adjacent data set, and is insensitive to the calculation processing result of the data set, that is, the probability of personal information being identified in a small range of the data processed by the differential privacy model.
In recent years, many research schemes for releasing private data based on a differential privacy model, such as a haar wavelet transformation mode, a histogram mode, a division mode and the like, are available, however, these methods are designed for specific tasks, require a certain expertise in processing by a third party central server, and have a certain difficulty in fully utilizing data. Therefore, a synthetic data release scheme based on differential privacy is provided, the synthetic data set can approximate to the original data, replace the original data to perform analysis tasks, achieve the effect of protecting privacy, and can ensure the accuracy of the synthetic data set. However, the computational complexity of the currently proposed synthetic data distribution scheme based on the differential privacy method tends to be high, and a large amount of noise is added to the high-dimensional data set, resulting in the synthetic data set not being usable.
Disclosure of Invention
The invention provides a method for data release based on differential privacy synthetic data with maximum weight matching, which is characterized in that a probability map model constructed by an original data set is subjected to maximum weight matching algorithm to obtain low-dimensional distribution, the low-dimensional distribution is subjected to post-processing to synthesize data after proper noise is added based on the differential privacy method, and then a synthetic data set is released, so that privacy protection of user information can be achieved, and the utility of the synthetic data set is improved.
In order to achieve the above object, the present invention adopts the following technical scheme.
A differential privacy synthetic data release method based on maximum weight matching is characterized in that a probability map model constructed by an original data set is subjected to maximum weight matching algorithm to obtain a low-dimensional marginal set, appropriate noise is added to low-dimensional distribution based on the differential privacy method, data synthesis is performed through post-processing, and then a synthetic data set is released, so that privacy protection of user information can be achieved, and the utility of the synthetic data set is improved.
The method comprises the following steps:
s1, a server aggregates the collected user data to obtain an initial data set, and a weighted probability graph model is built according to the data set;
s2, the server applies a maximum weight matching algorithm according to the generated probability graph model to obtain a group of high-correlation low-dimensional marginal sets;
s3, adding proper Gaussian noise to the low-dimensional marginal set according to the definition of the differential privacy model and a reasonable distribution method of privacy budget;
s4, after noise is added, the problem that the probability data is negative and inconsistent is caused, so that post-processing is needed;
s5, synthesizing the high-dimensional data set by using the low-dimensional marginal set subjected to noise adding processing and post-processing so as to approximate the original data set on statistical information, and finally, releasing the synthesized data set by the server.
In step S1, a correlation coefficient may be calculated according to the relationship between the attributes, and this may be used as a weight value of the probability map. Given data set D, an attribute map G (V, E), v= { V, is generated 1 ,V 2 ,…,V d The method comprises the steps of representing attribute nodes, wherein d attributes are shared, the weight of an edge connecting two nodes in a graph G (V, E) is a correlation coefficient of the attribute represented by the two nodes, and the calculation method of the correlation coefficient is as follows:
wherein ,Vi ,V j Represents the ith and jth attribute nodes, pr [ V ] i ,V j ]Representing attribute V i and Vj Joint probability of Pr [ V ] i ],Pr[V j ]Respectively represent attribute V i And attribute V j Is a probability of (2).
In step S2, applying a maximum weight matching algorithm to the constructed probability map model includes the following processes:
s21, initializing and selecting a set M of low-dimensional margins as an empty set;
s22, selecting an edge with the largest weight value in the probability graph model G, representing that two connected attribute nodes have higher correlation, and taking the attribute node pair as a selected low-dimensional margin m i Adding the attribute nodes into the M set, and deleting all edges connected with the attribute nodes in the graph G;
s23, repeating the step S22 until no attribute nodes exist in the attribute graph G, and obtaining the low-dimensional margin of the maximum weight.
In step S3, for allocating a privacy budget and adding noise to the low dimensional margin comprises the following process:
s31, distributing privacy budgets, namely obtaining different privacy duty ratios according to expected square deviations of optimized noise scales, and adding proper noise for different low-dimensional distributions, wherein the optimization problem can be expressed as:
s.t.q 1 +q 2 +…+q k =1and 0≤q i ,i=1,2,…,k
wherein ,represents the domain size corresponding to k low-dimensional margins, { q 1 ,q 2 ,…q k And privacy budget duty cycle for the corresponding view.
S32, (epsilon, delta) -differential privacy definition: is provided with a random mechanism A, S A For a set of all possible outputs of a, δ is ≡0 for e > 0 on a given two adjacent data sets D and D', the random mechanism a is said to satisfy (e, δ) -differential privacy if the following inequality is satisfied.
Pr[A(D)∈S A ]≤e ε Pr[A(D')∈S A ]+δ
S33, defining zero-concentration differential privacy (ρ -zCDP). There is a random mechanism a that is called zero-centered differential privacy (ρ -zCDP) for all α e (1, +%) given two adjacent data sets D and D', if the following inequality is satisfied.
wherein ,Dα (A (D) ||A (D ') is the α -renyi divergence between the two distributions of A (D) and A (D'), representing the privacy loss random variable.
S34, theorem: if random mechanism A satisfies ρ -zCDP, then random mechanism A satisfies for any δ > 0-differential privacy.
S35, gaussian mechanism definition: given f X n R is a sensitive query, and for the input dataset D, the gaussian mechanism a satisfies the following equation:
A(D)=f(D)+N(0,σ 2 )
wherein σ is the noise scale, which can be derived by definitionΔ f Representing the sensitivity of the sensitive query f.
S36, according to the definition and theorem, the calculation mode of the added noise scale can be obtained as follows:
for i=1,…k
in step S4, since the statistics value becomes a decimal value and negative numbers may occur after noise is added, post-processing is required for the low-dimensional distribution after noise addition, which specifically includes the following steps:
s41, non-negative processing, namely, the count of each unit in noise distribution is required to be non-negative, and the negative counting unit is corrected, so that the utility can be improved.
To maintain the noise scale, negative counts were collected and counted as negative_sum, and the units of negative counts were all set to 0, the positive count units were arranged in ascending order, and the total negative count negative_sum was consumed starting from the smallest positive count until negative_sum was 0.
S42, normalization processing is carried out on the count value after noise addition, the count sum is non-integer and can change, so that the sum is not equal to the record number, and normalization processing is needed to improve accuracy.
Dividing the current count value by the total count value to obtain the proportion of the current count value in the total count value, obtaining a normalized value, namely adding the proportion to be 1, multiplying the proportion by the recorded number of the original data to obtain a final count value, separating an integer part from a decimal part of the count value, adding the decimal part, and adding the integer value obtained by adding to the maximum value of the integer part because the decimal situation possibly occurs.
In step S5, the synthesis of the data set from the low-dimensional noise distribution comprises the following steps:
s51, initializing a synthetic data set D syn Synthesizing according to the target distribution (namely, the noise distribution obtained by the steps);
s52, adding min { c ] for the initial count value of the unit is smaller than the target distribution count value t -c s ,αc s Recording a number of times, wherein c t Representing a target count value, c s Represents the initial count value, alpha represents the attenuation factor, and the calculation mode is thatα 0 The initial value, k, the decay rate, t, the iteration number, and s the step size.
S53, for the initial count value of the unit is larger than the target distribution count value, the min { c } is reduced s -c t ,βc s Recording, likewise, c t Representing a target count value, c s Represents an initial count value, and beta is calculated by the following way
The beneficial effects are that: compared with the prior art, the method has the substantial progress and remarkable characteristics that the low-dimensional marginal set with high correlation is obtained by applying the maximum weight matching algorithm, the global correlation score can be maximized, the utility of the synthesized data set is ensured, the calculation complexity is reduced, and the method has better utility for high-dimensional data.
Drawings
FIG. 1 is a schematic flow chart of an example of the invention.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings.
The invention provides a differential privacy data release method based on maximum weight, and the implementation steps are specifically described as follows with reference to fig. 1.
In step S1, the server aggregates the collected data to obtain an initial dataset, constructs a weighted probability map model according to the dataset, and generates an attribute map G (V, E), v= { V, given the dataset D 1 ,V 2 ,…,V d The method comprises the steps of representing attribute nodes, wherein d attributes are shared, the weight of an edge connecting two nodes in a graph G (V, E) is a correlation coefficient of the attribute represented by the two nodes, and the calculation method of the correlation coefficient is as follows:
wherein ,Vi ,V j Represents the ith and jth attribute nodes, pr [ V ] i ,V j ]Representing attribute V i and Vj Joint probability of Pr [ V ] i ],Pr[V j ]Respectively represent attribute V i And attribute V j Is a probability of (2).
In step S2, the server applies a maximum weight matching algorithm according to the generated probability map model to obtain a low-dimensional margin with high correlation, and the method includes the following steps:
s21, initializing and selecting a set M of low-dimensional margins as an empty set;
s22, selecting weight values in the probability map model GThe largest edge, representing that the two attribute nodes connected with the edge have higher correlation, and the attribute node pair of the largest edge is selected as the low-dimensional margin m i Adding the attribute nodes into the M set, and deleting all edges connected with the attribute nodes in the graph G;
s23, repeating the step S22 until no attribute nodes exist in the attribute graph G, and obtaining the low-dimensional margin of the maximum weight.
In step S3, according to the differential privacy model and the reasonable distribution of the privacy budget, adding appropriate gaussian noise to the low-dimensional margin, comprising the following procedures:
s31, distributing privacy budgets, namely obtaining different privacy duty ratios according to expected square deviations of optimized noise scales, and adding proper noise for different low-dimensional distributions, wherein the optimization problem can be expressed as:
s.t.q 1 +q 2 +…+q k =1 and 0≤q i ,i=1,2,…,k
wherein ,represents the domain size corresponding to k low-dimensional margins, { q 1 ,q 2 ,…q k And privacy budget duty cycle for the corresponding view.
S32, (epsilon, delta) -differential privacy definition: is provided with a random mechanism A, S A For a set of all possible outputs of a, δ is ≡0 for e > 0 on a given two adjacent data sets D and D', the random mechanism a is said to satisfy (e, δ) -differential privacy if the following inequality is satisfied.
Pr[A(D)∈S A ]≤e ε Pr[A(D')∈S A ]+δ
S33, defining zero-concentration differential privacy (ρ -zCDP). There is a random mechanism a that is called zero-centered differential privacy (ρ -zCDP) for all α e (1, +%) given two adjacent data sets D and D', if the following inequality is satisfied.
wherein ,Dα (A (D) ||A (D ') is the α -renyi divergence between the two distributions of A (D) and A (D'), representing the privacy loss random variable.
S34, theorem: if random mechanism A satisfies ρ -zCDP, then random mechanism A satisfies for any δ > 0-differential privacy.
S35, gaussian mechanism definition: given f X n R is a sensitive query, and for the input dataset D, the gaussian mechanism a satisfies the following equation:
A(D)=f(D)+N(0,σ 2 )
wherein σ is the noise scale, which can be derived by definitionΔ f Representing the sensitivity of the sensitive query f.
S36, according to the definition and theorem, the calculation mode of the added noise scale can be obtained as follows:
in step S4, post-processing is performed on the low-dimensional distribution after noise addition, which specifically includes the following steps:
s41, non-negative processing, namely, the count of each unit in noise distribution is required to be non-negative, and the negative counting unit is corrected, so that the utility can be improved.
To maintain the noise scale, negative counts were collected and counted as negative_sum, and the units of negative counts were all set to 0, the positive count units were arranged in ascending order, and the total negative count negative_sum was consumed starting from the smallest positive count until negative_sum was 0.
S42, normalization processing is carried out on the count value after noise addition, the count sum is non-integer and can change, so that the sum is not equal to the record number, and normalization processing is needed to improve accuracy.
Dividing the current count value by the total count value to obtain the proportion of the current count value in the total count value, obtaining a normalized value, namely adding the proportion to be 1, multiplying the proportion by the recorded number of the original data to obtain a final count value, separating an integer part from a decimal part of the count value, adding the decimal part, and adding the integer value obtained by adding to the maximum value of the integer part because the decimal situation possibly occurs.
In step S5, the high-dimensional data is approximately synthesized using the privacy-processed low-dimensional distribution, and the server will issue the resulting synthesized data set, comprising the following procedures:
s51, initializing a synthetic data set D syn Synthesizing according to the target distribution (namely, the noise distribution obtained by the steps);
s52, adding min { c ] for the initial count value of the unit is smaller than the target distribution count value t -c s ,αc s Recording a number of times, wherein c t Representing a target count value, c s Represents the initial count value, alpha represents the attenuation factor, and the calculation mode is thatα 0 The initial value, k, the decay rate, t, the iteration number, and s the step size.
S53, for the initial count value of the unit is larger than the target distribution count value, the min { c } is reduced s -c t ,βc s Recording, likewise, c t Representing a target count value, c s Represents an initial count value, and beta is calculated by the following way
Examples
The method provided by the invention is applied to experiments, wherein the data set adopted in the experiments is one data set on Adult and UCI, personal information of 45222 users including age, education degree, wages and the like is recorded in the data set, the invention obtains high-correlation low-dimensional distribution according to the information of the original data set, approximately synthesizes the high-dimensional data set by the low-dimensional distribution, and aims at measuring privacy protection effect, in the experiments, different privacy budgets are set for calculation, the privacy budgets are set as 0.4,0.8,1.2,1.6,2.0, and in the experiments, the server performs a synthetic data set algorithm according to the collected data sets to obtain the synthetic data set for release.
The server evaluates the composite data set according to SVM classification results and errors of k-way marginal edges of the original data set and the composite data set, experimental results of the maximum weight matching-based differential privacy composite data release method on the data set are shown in tables 1 and 2, in order to ensure SVM classification results and avoid the influence of randomness on the experimental results, five-fold cross validation is carried out on the composite data set, accuracy (ACC) is used as an evaluation standard of the experiment, experiments for counting errors of k-way marginal edges of the original data set and the composite data set are carried out, different k values are taken for verification, k=2, 3,4 are selected in the experiments, 400 different marginal edges are selected randomly according to k=2, 3,4 respectively, L1 errors of the two marginal edges are calculated, and average values are calculated to obtain error results of k-way marginal edges of the composite data set and the original data set under different privacy budgets.
TABLE 1 SVM Classification experiment results under different privacy budgets
TABLE 2 error results for k-way margin at different privacy budgets
As can be seen from table 1, the accuracy does not vary much with the change of the privacy budget, but it can be observed that the accuracy increases with the increase of the privacy budget as a whole, and the accuracy can reach more than 0.7, so that a better classification result can be ensured. Meanwhile, although the results of 5 experiments are slightly different, the fluctuation degree is small. Table 2 counts the error results of the k-way margin of the composite dataset and the original dataset under different privacy budgets, and it can be seen from the results that the margin error of 2-way is within 0.1, the error difference is very small compared with the margin of 2-way obtained by selecting the original dataset, the margin errors of 3-way and 4-way are both within 0.25, and the accuracy is relatively high, and the error is reduced along with the increase of the privacy budgets as a whole.
Claims (6)
1. The differential privacy synthetic data release method based on maximum weight matching is characterized in that the method obtains a low-dimensional marginal set by applying a maximum weight matching algorithm to a probability graph model constructed by an original data set, adds proper noise to low-dimensional distribution based on a differential privacy method, performs data synthesis through post-processing, and releases a synthetic data set, thereby realizing privacy protection of user information and improving the utility of the synthetic data set;
the method comprises the following processing steps:
s1, a server aggregates the collected user data to obtain an initial data set, and a weighted probability graph model is built according to the data set;
s2, the server applies a maximum weight matching algorithm according to the generated probability graph model to obtain a group of high-correlation low-dimensional marginal sets;
s3, adding proper Gaussian noise to the low-dimensional marginal set according to the definition of the differential privacy model and a reasonable distribution method of privacy budget;
s4, processing the problem of negative occurrence number and inconsistent data of probability data after noise is added;
s5, synthesizing the high-dimensional data set by using the low-dimensional marginal set subjected to noise adding processing and post-processing so as to approximate the original data set on statistical information, and finally, releasing the synthesized data set by the server.
2. The method for publishing differential privacy synthetic data based on maximum weight matching according to claim 1, wherein the aggregation of data in step S1 includes calculating a correlation coefficient according to the correlation between attributes, and the calculation process is as follows:
given data set D, an attribute map G (V, E), v= { V, is generated 1 ,V 2 ,...,V d The weight of the edge connecting two nodes in the graph G (V, E) is the correlation coefficient of the attribute represented by the two nodes, and the calculation expression of the correlation coefficient is as follows:
wherein ,Vi ,V j Represents the ith and jth attribute nodes, pr [ V ] i ,V j ]Representing attribute V i and Vj Joint probability of Pr [ V ] i ],Pr[V j ]Respectively represent attribute V i And attribute V j Is a probability of (2).
3. The method for publishing differential privacy synthetic data based on maximum weight matching according to claim 1, wherein the step S2 of solving the low-dimensional marginal set comprises the following procedures:
s21, initializing and selecting a set M of low-dimensional margins as an empty set;
s22, selecting an edge with the largest weight value in the probability graph model G, representing that two connected attribute nodes have higher correlation, and taking the attribute node pair as a selected low-dimensional margin m i Adding the attribute nodes into the M set, and deleting all edges connected with the attribute nodes in the graph G;
s23, repeating the step S22 until no attribute nodes exist in the attribute graph G, and obtaining the low-dimensional margin of the maximum weight.
4. The method for publishing differential privacy synthetic data based on maximum weight matching according to claim 1, wherein step S3 is to add appropriate noise to the low-dimensional marginal set according to differential privacy definition, and specifically calculated as follows:
s31, distributing privacy budgets, namely obtaining different privacy duty ratios according to expected square deviations of optimized noise scales, and adding proper noise for different low-dimensional distributions, wherein the optimization problem is expressed as:
s.t.q 1 +q 2 +...+q k =1 and 0≤q i ,i=1,2,...,k
wherein ,represents the domain size corresponding to k low-dimensional margins, { q 1 ,q 2 ,...q k -privacy budget duty cycle of the corresponding view;
s32, (epsilon, delta) -differential privacy definition: is provided with a random mechanism A, S A For a set of all possible outputs of a, δ is ≡0 for e > 0 on a given two adjacent data sets D and D', the random mechanism a is said to satisfy (e, δ) -differential privacy if the following inequality is satisfied.
Pr[A(D)∈S A ]≤e ε Pr[A(D')∈S A ]+δ
S33, defining zero-concentration differential privacy (ρ -zCDP): there is a random mechanism a, which is called zero-centered differential privacy (ρ -zCDP) for all α e (1, +%) given two adjacent data sets D and D', if the following inequality is satisfied, the following expression exists:
wherein ,Dα (A (D) ||A (D ') is the α -renyi divergence between the two distributions of A (D) and A (D'), representing the privacy loss random variable;
s34, theorem: if random mechanism A satisfies ρ -zCDP, then random mechanism A satisfies for any δ > 0-differential privacy;
s35, gaussian mechanism definition: given f X n R is a sensitive query, and for the input dataset D, the gaussian mechanism a satisfies the following equation:
A(D)=f(D)+N(0,σ 2 )
wherein σ is the noise scale, which can be derived by definitionΔ f Representing the sensitivity of the sensitive query f;
s36, according to the definition and theorem, the calculation mode of the added noise scale is as follows:
5. the method for publishing differential privacy synthetic data based on maximum weight matching according to claim 1, wherein step S4 of post-processing the denoised low-dimensional marginal set comprises:
s41, nonnegative processing, namely, the count of each unit in noise distribution is nonnegative, and the negative count unit is corrected to improve the utility;
to maintain the noise scale, collecting negative counts to be counted as negative_sum, setting the units of the negative counts to be 0, arranging positive count units in an ascending order, and consuming the total negative count of negative_sum from the minimum positive count until the negative_sum is 0;
s42, normalization processing is carried out on the count value after noise addition, which is non-integer, and the sum of the count changes, so that the sum is not equal to the record number, and the normalization processing is carried out on the count value to improve the accuracy;
dividing the current count value by the total count value to obtain the proportion of the current count value in the total count value, obtaining a normalized value, namely adding the proportion to be 1, multiplying the proportion by the recorded number of the original data to obtain a final count value, separating an integer part from a decimal part of the count value, adding the decimal part, and adding the integer value obtained by adding to the maximum value of the integer part due to the possible decimal situation.
6. The differential privacy synthesis data distribution method based on maximum weight matching according to claim 1, wherein step S5 comprises:
s51, initializing a synthetic data set D syn Synthesizing according to the target distribution, namely synthesizing by the noise distribution;
s52, adding min { c ] for the initial count value of the unit is smaller than the target distribution count value t -c s ,αc s Recording a number of times, wherein c t Representing a target count value, c s Represents the initial count value, alpha represents the attenuation factor, and the calculation mode is thatα 0 Representing an initial value, k representing an attenuation rate, t representing the number of iterations, and s representing a step size;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310060144.5A CN116167078A (en) | 2023-01-15 | 2023-01-15 | Differential privacy synthetic data publishing method based on maximum weight matching |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310060144.5A CN116167078A (en) | 2023-01-15 | 2023-01-15 | Differential privacy synthetic data publishing method based on maximum weight matching |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116167078A true CN116167078A (en) | 2023-05-26 |
Family
ID=86415848
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310060144.5A Pending CN116167078A (en) | 2023-01-15 | 2023-01-15 | Differential privacy synthetic data publishing method based on maximum weight matching |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116167078A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118094635A (en) * | 2024-04-23 | 2024-05-28 | 国网智能电网研究院有限公司 | Privacy-protected data interaction relation diagram structure calculation method and system |
-
2023
- 2023-01-15 CN CN202310060144.5A patent/CN116167078A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118094635A (en) * | 2024-04-23 | 2024-05-28 | 国网智能电网研究院有限公司 | Privacy-protected data interaction relation diagram structure calculation method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wu | Analysis of parameter selections for fuzzy c-means | |
Winkler | Re-identification methods for masked microdata | |
CN110555316B (en) | Privacy protection table data sharing method based on cluster anonymity | |
Yu et al. | Outlier-eliminated k-means clustering algorithm based on differential privacy preservation | |
Wang et al. | Comparative study of monthly inflow prediction methods for the Three Gorges Reservoir | |
Gan et al. | Data clustering with actuarial applications | |
CN112307078B (en) | Data stream differential privacy histogram publishing method based on sliding window | |
CN108197492A (en) | A kind of data query method and system based on difference privacy budget allocation | |
Devarajan et al. | A unified statistical approach to non-negative matrix factorization and probabilistic latent semantic indexing | |
CN102156755A (en) | K-cryptonym improving method | |
CN116167078A (en) | Differential privacy synthetic data publishing method based on maximum weight matching | |
CN114491644A (en) | Differential privacy data publishing method meeting personalized privacy budget allocation | |
CN113094746A (en) | High-dimensional data publishing method based on localized differential privacy and related equipment | |
CN109918674A (en) | A kind of case string and method based on the modeling of case element similitude | |
Mirshani et al. | Formal privacy for functional data with gaussian perturbations | |
CN116186757A (en) | Method for publishing condition feature selection differential privacy data with enhanced utility | |
WO2022257457A1 (en) | Product data fusion method, apparatus and device, and storage medium | |
CN117235800B (en) | Data query protection method of personalized privacy protection mechanism based on secret specification | |
Shaham et al. | Models and mechanisms for spatial data fairness | |
CN108959956B (en) | Differential privacy data publishing method based on Bayesian network | |
Li et al. | PPDP-PCAO: an efficient high-dimensional data releasing method with differential privacy protection | |
Cheng et al. | Algorithm for k-anonymity based on ball-tree and projection area density partition | |
CN111859117A (en) | Information recommendation method and device, electronic equipment and readable storage medium | |
CN116776173A (en) | Power measurement data desensitization method based on convolutional neural network | |
Zhong et al. | Model-based clustering with soft balancing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |