CN116167078A - Differential privacy synthetic data publishing method based on maximum weight matching - Google Patents

Differential privacy synthetic data publishing method based on maximum weight matching Download PDF

Info

Publication number
CN116167078A
CN116167078A CN202310060144.5A CN202310060144A CN116167078A CN 116167078 A CN116167078 A CN 116167078A CN 202310060144 A CN202310060144 A CN 202310060144A CN 116167078 A CN116167078 A CN 116167078A
Authority
CN
China
Prior art keywords
dimensional
low
data
count value
noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310060144.5A
Other languages
Chinese (zh)
Inventor
张淼
邓海
叶欣欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202310060144.5A priority Critical patent/CN116167078A/en
Publication of CN116167078A publication Critical patent/CN116167078A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a differential privacy synthetic data publishing method based on maximum weight matching, which comprises the steps of preprocessing collected user data through a reliable third-party server; then, the processed data attribute is information represented by using a graph model method to obtain an attribute association graph; selecting a group of proper low-dimensional marginal sets according to a maximum weight matching algorithm; then adding noise to the set of low-dimensional margins, respectively, the noise satisfying the differential privacy definition; performing post-treatment on the noisy low-dimensional marginal set to obtain a group of standardized low-dimensional marginal set; and carrying out data synthesis according to the group of low-dimensional marginal sets, so that the synthesized data set is similar to the original data set as much as possible in statistical information, and finally, carrying out data release on the synthesized data set. By adopting the technical method, the calculation complexity can be reduced while the utility of the synthesized data set is ensured, and the method has better utility for high-dimensional data.

Description

Differential privacy synthetic data publishing method based on maximum weight matching
Technical Field
The invention belongs to the field of information security, and particularly relates to a differential privacy synthetic data publishing method based on maximum weight matching.
Background
With the rapid development of information technology, online registration, online shopping, online trip and the like are gradually integrated into the life of people, various organizations (such as hospitals, bus buses and the like) can easily obtain detailed information of users, a large amount of user data is accumulated by a plurality of organizations, and effective data support can be provided for subsequent tasks such as prediction analysis and the like by carrying out statistical analysis on the data, so that great research value is brought to the data. To meet the needs of research and innovation, related organizations may release the obtained data information, which often includes private information of individuals, and if the private data is not properly protected, the private data of users is easily revealed, which causes immeasurable loss. Therefore, research into a data distribution method based on privacy protection is indispensable.
Methods for protecting data privacy are called disclosure restrictions, and these technologies aim to provide protection for sensitive information, and issue data information to the public so that researchers can perform statistical analysis on the data, and one common method for protecting private data in the issued data is anonymization method, namely deleting sensitive information in the data, but many researches have proved that deleting sensitive information alone cannot effectively realize privacy protection, and an attacker can still obtain sensitive information through identification and analysis of other attributes, so that strong privacy guarantee cannot be provided for the data issuing process.
One reliability scheme of current privacy protection is a differential privacy model, which does not care how much background knowledge an attacker has, and achieves privacy protection effect by adding proper noise to query or analysis results, thus providing reliable privacy guarantee for data distribution. The differential privacy model provides clear definition and effective proof in mathematics, and the model can be simply understood as adding or deleting a certain record in the adjacent data set, and is insensitive to the calculation processing result of the data set, that is, the probability of personal information being identified in a small range of the data processed by the differential privacy model.
In recent years, many research schemes for releasing private data based on a differential privacy model, such as a haar wavelet transformation mode, a histogram mode, a division mode and the like, are available, however, these methods are designed for specific tasks, require a certain expertise in processing by a third party central server, and have a certain difficulty in fully utilizing data. Therefore, a synthetic data release scheme based on differential privacy is provided, the synthetic data set can approximate to the original data, replace the original data to perform analysis tasks, achieve the effect of protecting privacy, and can ensure the accuracy of the synthetic data set. However, the computational complexity of the currently proposed synthetic data distribution scheme based on the differential privacy method tends to be high, and a large amount of noise is added to the high-dimensional data set, resulting in the synthetic data set not being usable.
Disclosure of Invention
The invention provides a method for data release based on differential privacy synthetic data with maximum weight matching, which is characterized in that a probability map model constructed by an original data set is subjected to maximum weight matching algorithm to obtain low-dimensional distribution, the low-dimensional distribution is subjected to post-processing to synthesize data after proper noise is added based on the differential privacy method, and then a synthetic data set is released, so that privacy protection of user information can be achieved, and the utility of the synthetic data set is improved.
In order to achieve the above object, the present invention adopts the following technical scheme.
A differential privacy synthetic data release method based on maximum weight matching is characterized in that a probability map model constructed by an original data set is subjected to maximum weight matching algorithm to obtain a low-dimensional marginal set, appropriate noise is added to low-dimensional distribution based on the differential privacy method, data synthesis is performed through post-processing, and then a synthetic data set is released, so that privacy protection of user information can be achieved, and the utility of the synthetic data set is improved.
The method comprises the following steps:
s1, a server aggregates the collected user data to obtain an initial data set, and a weighted probability graph model is built according to the data set;
s2, the server applies a maximum weight matching algorithm according to the generated probability graph model to obtain a group of high-correlation low-dimensional marginal sets;
s3, adding proper Gaussian noise to the low-dimensional marginal set according to the definition of the differential privacy model and a reasonable distribution method of privacy budget;
s4, after noise is added, the problem that the probability data is negative and inconsistent is caused, so that post-processing is needed;
s5, synthesizing the high-dimensional data set by using the low-dimensional marginal set subjected to noise adding processing and post-processing so as to approximate the original data set on statistical information, and finally, releasing the synthesized data set by the server.
In step S1, a correlation coefficient may be calculated according to the relationship between the attributes, and this may be used as a weight value of the probability map. Given data set D, an attribute map G (V, E), v= { V, is generated 1 ,V 2 ,…,V d The method comprises the steps of representing attribute nodes, wherein d attributes are shared, the weight of an edge connecting two nodes in a graph G (V, E) is a correlation coefficient of the attribute represented by the two nodes, and the calculation method of the correlation coefficient is as follows:
Figure BDA0004061109530000031
wherein ,Vi ,V j Represents the ith and jth attribute nodes, pr [ V ] i ,V j ]Representing attribute V i and Vj Joint probability of Pr [ V ] i ],Pr[V j ]Respectively represent attribute V i And attribute V j Is a probability of (2).
In step S2, applying a maximum weight matching algorithm to the constructed probability map model includes the following processes:
s21, initializing and selecting a set M of low-dimensional margins as an empty set;
s22, selecting an edge with the largest weight value in the probability graph model G, representing that two connected attribute nodes have higher correlation, and taking the attribute node pair as a selected low-dimensional margin m i Adding the attribute nodes into the M set, and deleting all edges connected with the attribute nodes in the graph G;
s23, repeating the step S22 until no attribute nodes exist in the attribute graph G, and obtaining the low-dimensional margin of the maximum weight.
In step S3, for allocating a privacy budget and adding noise to the low dimensional margin comprises the following process:
s31, distributing privacy budgets, namely obtaining different privacy duty ratios according to expected square deviations of optimized noise scales, and adding proper noise for different low-dimensional distributions, wherein the optimization problem can be expressed as:
Figure BDA0004061109530000032
s.t.q 1 +q 2 +…+q k =1and 0≤q i ,i=1,2,…,k
wherein ,
Figure BDA0004061109530000033
represents the domain size corresponding to k low-dimensional margins, { q 1 ,q 2 ,…q k And privacy budget duty cycle for the corresponding view.
S32, (epsilon, delta) -differential privacy definition: is provided with a random mechanism A, S A For a set of all possible outputs of a, δ is ≡0 for e > 0 on a given two adjacent data sets D and D', the random mechanism a is said to satisfy (e, δ) -differential privacy if the following inequality is satisfied.
Pr[A(D)∈S A ]≤e ε Pr[A(D')∈S A ]+δ
S33, defining zero-concentration differential privacy (ρ -zCDP). There is a random mechanism a that is called zero-centered differential privacy (ρ -zCDP) for all α e (1, +%) given two adjacent data sets D and D', if the following inequality is satisfied.
Figure BDA0004061109530000041
wherein ,Dα (A (D) ||A (D ') is the α -renyi divergence between the two distributions of A (D) and A (D'), representing the privacy loss random variable.
S34, theorem: if random mechanism A satisfies ρ -zCDP, then random mechanism A satisfies for any δ > 0
Figure BDA0004061109530000042
-differential privacy.
S35, gaussian mechanism definition: given f X n R is a sensitive query, and for the input dataset D, the gaussian mechanism a satisfies the following equation:
A(D)=f(D)+N(0,σ 2 )
wherein σ is the noise scale, which can be derived by definition
Figure BDA0004061109530000043
Δ f Representing the sensitivity of the sensitive query f.
S36, according to the definition and theorem, the calculation mode of the added noise scale can be obtained as follows:
Figure BDA0004061109530000044
for i=1,…k
in step S4, since the statistics value becomes a decimal value and negative numbers may occur after noise is added, post-processing is required for the low-dimensional distribution after noise addition, which specifically includes the following steps:
s41, non-negative processing, namely, the count of each unit in noise distribution is required to be non-negative, and the negative counting unit is corrected, so that the utility can be improved.
To maintain the noise scale, negative counts were collected and counted as negative_sum, and the units of negative counts were all set to 0, the positive count units were arranged in ascending order, and the total negative count negative_sum was consumed starting from the smallest positive count until negative_sum was 0.
S42, normalization processing is carried out on the count value after noise addition, the count sum is non-integer and can change, so that the sum is not equal to the record number, and normalization processing is needed to improve accuracy.
Dividing the current count value by the total count value to obtain the proportion of the current count value in the total count value, obtaining a normalized value, namely adding the proportion to be 1, multiplying the proportion by the recorded number of the original data to obtain a final count value, separating an integer part from a decimal part of the count value, adding the decimal part, and adding the integer value obtained by adding to the maximum value of the integer part because the decimal situation possibly occurs.
In step S5, the synthesis of the data set from the low-dimensional noise distribution comprises the following steps:
s51, initializing a synthetic data set D syn Synthesizing according to the target distribution (namely, the noise distribution obtained by the steps);
s52, adding min { c ] for the initial count value of the unit is smaller than the target distribution count value t -c s ,αc s Recording a number of times, wherein c t Representing a target count value, c s Represents the initial count value, alpha represents the attenuation factor, and the calculation mode is that
Figure BDA0004061109530000051
α 0 The initial value, k, the decay rate, t, the iteration number, and s the step size.
S53, for the initial count value of the unit is larger than the target distribution count value, the min { c } is reduced s -c t ,βc s Recording, likewise, c t Representing a target count value, c s Represents an initial count value, and beta is calculated by the following way
Figure BDA0004061109530000052
The beneficial effects are that: compared with the prior art, the method has the substantial progress and remarkable characteristics that the low-dimensional marginal set with high correlation is obtained by applying the maximum weight matching algorithm, the global correlation score can be maximized, the utility of the synthesized data set is ensured, the calculation complexity is reduced, and the method has better utility for high-dimensional data.
Drawings
FIG. 1 is a schematic flow chart of an example of the invention.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings.
The invention provides a differential privacy data release method based on maximum weight, and the implementation steps are specifically described as follows with reference to fig. 1.
In step S1, the server aggregates the collected data to obtain an initial dataset, constructs a weighted probability map model according to the dataset, and generates an attribute map G (V, E), v= { V, given the dataset D 1 ,V 2 ,…,V d The method comprises the steps of representing attribute nodes, wherein d attributes are shared, the weight of an edge connecting two nodes in a graph G (V, E) is a correlation coefficient of the attribute represented by the two nodes, and the calculation method of the correlation coefficient is as follows:
Figure BDA0004061109530000053
wherein ,Vi ,V j Represents the ith and jth attribute nodes, pr [ V ] i ,V j ]Representing attribute V i and Vj Joint probability of Pr [ V ] i ],Pr[V j ]Respectively represent attribute V i And attribute V j Is a probability of (2).
In step S2, the server applies a maximum weight matching algorithm according to the generated probability map model to obtain a low-dimensional margin with high correlation, and the method includes the following steps:
s21, initializing and selecting a set M of low-dimensional margins as an empty set;
s22, selecting weight values in the probability map model GThe largest edge, representing that the two attribute nodes connected with the edge have higher correlation, and the attribute node pair of the largest edge is selected as the low-dimensional margin m i Adding the attribute nodes into the M set, and deleting all edges connected with the attribute nodes in the graph G;
s23, repeating the step S22 until no attribute nodes exist in the attribute graph G, and obtaining the low-dimensional margin of the maximum weight.
In step S3, according to the differential privacy model and the reasonable distribution of the privacy budget, adding appropriate gaussian noise to the low-dimensional margin, comprising the following procedures:
s31, distributing privacy budgets, namely obtaining different privacy duty ratios according to expected square deviations of optimized noise scales, and adding proper noise for different low-dimensional distributions, wherein the optimization problem can be expressed as:
Figure BDA0004061109530000061
s.t.q 1 +q 2 +…+q k =1 and 0≤q i ,i=1,2,…,k
wherein ,
Figure BDA0004061109530000062
represents the domain size corresponding to k low-dimensional margins, { q 1 ,q 2 ,…q k And privacy budget duty cycle for the corresponding view.
S32, (epsilon, delta) -differential privacy definition: is provided with a random mechanism A, S A For a set of all possible outputs of a, δ is ≡0 for e > 0 on a given two adjacent data sets D and D', the random mechanism a is said to satisfy (e, δ) -differential privacy if the following inequality is satisfied.
Pr[A(D)∈S A ]≤e ε Pr[A(D')∈S A ]+δ
S33, defining zero-concentration differential privacy (ρ -zCDP). There is a random mechanism a that is called zero-centered differential privacy (ρ -zCDP) for all α e (1, +%) given two adjacent data sets D and D', if the following inequality is satisfied.
Figure BDA0004061109530000063
wherein ,Dα (A (D) ||A (D ') is the α -renyi divergence between the two distributions of A (D) and A (D'), representing the privacy loss random variable.
S34, theorem: if random mechanism A satisfies ρ -zCDP, then random mechanism A satisfies for any δ > 0
Figure BDA0004061109530000071
-differential privacy.
S35, gaussian mechanism definition: given f X n R is a sensitive query, and for the input dataset D, the gaussian mechanism a satisfies the following equation:
A(D)=f(D)+N(0,σ 2 )
wherein σ is the noise scale, which can be derived by definition
Figure BDA0004061109530000072
Δ f Representing the sensitivity of the sensitive query f.
S36, according to the definition and theorem, the calculation mode of the added noise scale can be obtained as follows:
Figure BDA0004061109530000073
in step S4, post-processing is performed on the low-dimensional distribution after noise addition, which specifically includes the following steps:
s41, non-negative processing, namely, the count of each unit in noise distribution is required to be non-negative, and the negative counting unit is corrected, so that the utility can be improved.
To maintain the noise scale, negative counts were collected and counted as negative_sum, and the units of negative counts were all set to 0, the positive count units were arranged in ascending order, and the total negative count negative_sum was consumed starting from the smallest positive count until negative_sum was 0.
S42, normalization processing is carried out on the count value after noise addition, the count sum is non-integer and can change, so that the sum is not equal to the record number, and normalization processing is needed to improve accuracy.
Dividing the current count value by the total count value to obtain the proportion of the current count value in the total count value, obtaining a normalized value, namely adding the proportion to be 1, multiplying the proportion by the recorded number of the original data to obtain a final count value, separating an integer part from a decimal part of the count value, adding the decimal part, and adding the integer value obtained by adding to the maximum value of the integer part because the decimal situation possibly occurs.
In step S5, the high-dimensional data is approximately synthesized using the privacy-processed low-dimensional distribution, and the server will issue the resulting synthesized data set, comprising the following procedures:
s51, initializing a synthetic data set D syn Synthesizing according to the target distribution (namely, the noise distribution obtained by the steps);
s52, adding min { c ] for the initial count value of the unit is smaller than the target distribution count value t -c s ,αc s Recording a number of times, wherein c t Representing a target count value, c s Represents the initial count value, alpha represents the attenuation factor, and the calculation mode is that
Figure BDA0004061109530000081
α 0 The initial value, k, the decay rate, t, the iteration number, and s the step size.
S53, for the initial count value of the unit is larger than the target distribution count value, the min { c } is reduced s -c t ,βc s Recording, likewise, c t Representing a target count value, c s Represents an initial count value, and beta is calculated by the following way
Figure BDA0004061109530000082
Examples
The method provided by the invention is applied to experiments, wherein the data set adopted in the experiments is one data set on Adult and UCI, personal information of 45222 users including age, education degree, wages and the like is recorded in the data set, the invention obtains high-correlation low-dimensional distribution according to the information of the original data set, approximately synthesizes the high-dimensional data set by the low-dimensional distribution, and aims at measuring privacy protection effect, in the experiments, different privacy budgets are set for calculation, the privacy budgets are set as 0.4,0.8,1.2,1.6,2.0, and in the experiments, the server performs a synthetic data set algorithm according to the collected data sets to obtain the synthetic data set for release.
The server evaluates the composite data set according to SVM classification results and errors of k-way marginal edges of the original data set and the composite data set, experimental results of the maximum weight matching-based differential privacy composite data release method on the data set are shown in tables 1 and 2, in order to ensure SVM classification results and avoid the influence of randomness on the experimental results, five-fold cross validation is carried out on the composite data set, accuracy (ACC) is used as an evaluation standard of the experiment, experiments for counting errors of k-way marginal edges of the original data set and the composite data set are carried out, different k values are taken for verification, k=2, 3,4 are selected in the experiments, 400 different marginal edges are selected randomly according to k=2, 3,4 respectively, L1 errors of the two marginal edges are calculated, and average values are calculated to obtain error results of k-way marginal edges of the composite data set and the original data set under different privacy budgets.
TABLE 1 SVM Classification experiment results under different privacy budgets
Figure BDA0004061109530000083
Figure BDA0004061109530000091
TABLE 2 error results for k-way margin at different privacy budgets
Figure BDA0004061109530000092
As can be seen from table 1, the accuracy does not vary much with the change of the privacy budget, but it can be observed that the accuracy increases with the increase of the privacy budget as a whole, and the accuracy can reach more than 0.7, so that a better classification result can be ensured. Meanwhile, although the results of 5 experiments are slightly different, the fluctuation degree is small. Table 2 counts the error results of the k-way margin of the composite dataset and the original dataset under different privacy budgets, and it can be seen from the results that the margin error of 2-way is within 0.1, the error difference is very small compared with the margin of 2-way obtained by selecting the original dataset, the margin errors of 3-way and 4-way are both within 0.25, and the accuracy is relatively high, and the error is reduced along with the increase of the privacy budgets as a whole.

Claims (6)

1. The differential privacy synthetic data release method based on maximum weight matching is characterized in that the method obtains a low-dimensional marginal set by applying a maximum weight matching algorithm to a probability graph model constructed by an original data set, adds proper noise to low-dimensional distribution based on a differential privacy method, performs data synthesis through post-processing, and releases a synthetic data set, thereby realizing privacy protection of user information and improving the utility of the synthetic data set;
the method comprises the following processing steps:
s1, a server aggregates the collected user data to obtain an initial data set, and a weighted probability graph model is built according to the data set;
s2, the server applies a maximum weight matching algorithm according to the generated probability graph model to obtain a group of high-correlation low-dimensional marginal sets;
s3, adding proper Gaussian noise to the low-dimensional marginal set according to the definition of the differential privacy model and a reasonable distribution method of privacy budget;
s4, processing the problem of negative occurrence number and inconsistent data of probability data after noise is added;
s5, synthesizing the high-dimensional data set by using the low-dimensional marginal set subjected to noise adding processing and post-processing so as to approximate the original data set on statistical information, and finally, releasing the synthesized data set by the server.
2. The method for publishing differential privacy synthetic data based on maximum weight matching according to claim 1, wherein the aggregation of data in step S1 includes calculating a correlation coefficient according to the correlation between attributes, and the calculation process is as follows:
given data set D, an attribute map G (V, E), v= { V, is generated 1 ,V 2 ,...,V d The weight of the edge connecting two nodes in the graph G (V, E) is the correlation coefficient of the attribute represented by the two nodes, and the calculation expression of the correlation coefficient is as follows:
Figure FDA0004061109520000011
wherein ,Vi ,V j Represents the ith and jth attribute nodes, pr [ V ] i ,V j ]Representing attribute V i and Vj Joint probability of Pr [ V ] i ],Pr[V j ]Respectively represent attribute V i And attribute V j Is a probability of (2).
3. The method for publishing differential privacy synthetic data based on maximum weight matching according to claim 1, wherein the step S2 of solving the low-dimensional marginal set comprises the following procedures:
s21, initializing and selecting a set M of low-dimensional margins as an empty set;
s22, selecting an edge with the largest weight value in the probability graph model G, representing that two connected attribute nodes have higher correlation, and taking the attribute node pair as a selected low-dimensional margin m i Adding the attribute nodes into the M set, and deleting all edges connected with the attribute nodes in the graph G;
s23, repeating the step S22 until no attribute nodes exist in the attribute graph G, and obtaining the low-dimensional margin of the maximum weight.
4. The method for publishing differential privacy synthetic data based on maximum weight matching according to claim 1, wherein step S3 is to add appropriate noise to the low-dimensional marginal set according to differential privacy definition, and specifically calculated as follows:
s31, distributing privacy budgets, namely obtaining different privacy duty ratios according to expected square deviations of optimized noise scales, and adding proper noise for different low-dimensional distributions, wherein the optimization problem is expressed as:
Figure FDA0004061109520000021
s.t.q 1 +q 2 +...+q k =1 and 0≤q i ,i=1,2,...,k
wherein ,
Figure FDA0004061109520000022
represents the domain size corresponding to k low-dimensional margins, { q 1 ,q 2 ,...q k -privacy budget duty cycle of the corresponding view;
s32, (epsilon, delta) -differential privacy definition: is provided with a random mechanism A, S A For a set of all possible outputs of a, δ is ≡0 for e > 0 on a given two adjacent data sets D and D', the random mechanism a is said to satisfy (e, δ) -differential privacy if the following inequality is satisfied.
Pr[A(D)∈S A ]≤e ε Pr[A(D')∈S A ]+δ
S33, defining zero-concentration differential privacy (ρ -zCDP): there is a random mechanism a, which is called zero-centered differential privacy (ρ -zCDP) for all α e (1, +%) given two adjacent data sets D and D', if the following inequality is satisfied, the following expression exists:
Figure FDA0004061109520000023
wherein ,Dα (A (D) ||A (D ') is the α -renyi divergence between the two distributions of A (D) and A (D'), representing the privacy loss random variable;
s34, theorem: if random mechanism A satisfies ρ -zCDP, then random mechanism A satisfies for any δ > 0
Figure FDA0004061109520000024
-differential privacy;
s35, gaussian mechanism definition: given f X n R is a sensitive query, and for the input dataset D, the gaussian mechanism a satisfies the following equation:
A(D)=f(D)+N(0,σ 2 )
wherein σ is the noise scale, which can be derived by definition
Figure FDA0004061109520000031
Δ f Representing the sensitivity of the sensitive query f;
s36, according to the definition and theorem, the calculation mode of the added noise scale is as follows:
Figure FDA0004061109520000032
5. the method for publishing differential privacy synthetic data based on maximum weight matching according to claim 1, wherein step S4 of post-processing the denoised low-dimensional marginal set comprises:
s41, nonnegative processing, namely, the count of each unit in noise distribution is nonnegative, and the negative count unit is corrected to improve the utility;
to maintain the noise scale, collecting negative counts to be counted as negative_sum, setting the units of the negative counts to be 0, arranging positive count units in an ascending order, and consuming the total negative count of negative_sum from the minimum positive count until the negative_sum is 0;
s42, normalization processing is carried out on the count value after noise addition, which is non-integer, and the sum of the count changes, so that the sum is not equal to the record number, and the normalization processing is carried out on the count value to improve the accuracy;
dividing the current count value by the total count value to obtain the proportion of the current count value in the total count value, obtaining a normalized value, namely adding the proportion to be 1, multiplying the proportion by the recorded number of the original data to obtain a final count value, separating an integer part from a decimal part of the count value, adding the decimal part, and adding the integer value obtained by adding to the maximum value of the integer part due to the possible decimal situation.
6. The differential privacy synthesis data distribution method based on maximum weight matching according to claim 1, wherein step S5 comprises:
s51, initializing a synthetic data set D syn Synthesizing according to the target distribution, namely synthesizing by the noise distribution;
s52, adding min { c ] for the initial count value of the unit is smaller than the target distribution count value t -c s ,αc s Recording a number of times, wherein c t Representing a target count value, c s Represents the initial count value, alpha represents the attenuation factor, and the calculation mode is that
Figure FDA0004061109520000041
α 0 Representing an initial value, k representing an attenuation rate, t representing the number of iterations, and s representing a step size;
s53, for the initial count value of the unit is larger than the target distribution count value, the min { c } is reduced s -c t ,βc s Recording, likewise, c t Representing a target count value, c s Represents an initial count value, and beta is calculated by the following way
Figure FDA0004061109520000042
CN202310060144.5A 2023-01-15 2023-01-15 Differential privacy synthetic data publishing method based on maximum weight matching Pending CN116167078A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310060144.5A CN116167078A (en) 2023-01-15 2023-01-15 Differential privacy synthetic data publishing method based on maximum weight matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310060144.5A CN116167078A (en) 2023-01-15 2023-01-15 Differential privacy synthetic data publishing method based on maximum weight matching

Publications (1)

Publication Number Publication Date
CN116167078A true CN116167078A (en) 2023-05-26

Family

ID=86415848

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310060144.5A Pending CN116167078A (en) 2023-01-15 2023-01-15 Differential privacy synthetic data publishing method based on maximum weight matching

Country Status (1)

Country Link
CN (1) CN116167078A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118094635A (en) * 2024-04-23 2024-05-28 国网智能电网研究院有限公司 Privacy-protected data interaction relation diagram structure calculation method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118094635A (en) * 2024-04-23 2024-05-28 国网智能电网研究院有限公司 Privacy-protected data interaction relation diagram structure calculation method and system

Similar Documents

Publication Publication Date Title
Wu Analysis of parameter selections for fuzzy c-means
Winkler Re-identification methods for masked microdata
CN110555316B (en) Privacy protection table data sharing method based on cluster anonymity
Yu et al. Outlier-eliminated k-means clustering algorithm based on differential privacy preservation
Wang et al. Comparative study of monthly inflow prediction methods for the Three Gorges Reservoir
Gan et al. Data clustering with actuarial applications
CN112307078B (en) Data stream differential privacy histogram publishing method based on sliding window
CN108197492A (en) A kind of data query method and system based on difference privacy budget allocation
Devarajan et al. A unified statistical approach to non-negative matrix factorization and probabilistic latent semantic indexing
CN102156755A (en) K-cryptonym improving method
CN116167078A (en) Differential privacy synthetic data publishing method based on maximum weight matching
CN114491644A (en) Differential privacy data publishing method meeting personalized privacy budget allocation
CN113094746A (en) High-dimensional data publishing method based on localized differential privacy and related equipment
CN109918674A (en) A kind of case string and method based on the modeling of case element similitude
Mirshani et al. Formal privacy for functional data with gaussian perturbations
CN116186757A (en) Method for publishing condition feature selection differential privacy data with enhanced utility
WO2022257457A1 (en) Product data fusion method, apparatus and device, and storage medium
CN117235800B (en) Data query protection method of personalized privacy protection mechanism based on secret specification
Shaham et al. Models and mechanisms for spatial data fairness
CN108959956B (en) Differential privacy data publishing method based on Bayesian network
Li et al. PPDP-PCAO: an efficient high-dimensional data releasing method with differential privacy protection
Cheng et al. Algorithm for k-anonymity based on ball-tree and projection area density partition
CN111859117A (en) Information recommendation method and device, electronic equipment and readable storage medium
CN116776173A (en) Power measurement data desensitization method based on convolutional neural network
Zhong et al. Model-based clustering with soft balancing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination