CN116167078A

CN116167078A - Differential privacy synthetic data publishing method based on maximum weight matching

Info

Publication number: CN116167078A
Application number: CN202310060144.5A
Authority: CN
Inventors: 张淼; 邓海; 叶欣欣
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2023-01-15
Filing date: 2023-01-15
Publication date: 2023-05-26

Abstract

The invention discloses a differential privacy synthetic data publishing method based on maximum weight matching, which comprises the steps of preprocessing collected user data through a reliable third-party server; then, the processed data attribute is information represented by using a graph model method to obtain an attribute association graph; selecting a group of proper low-dimensional marginal sets according to a maximum weight matching algorithm; then adding noise to the set of low-dimensional margins, respectively, the noise satisfying the differential privacy definition; performing post-treatment on the noisy low-dimensional marginal set to obtain a group of standardized low-dimensional marginal set; and carrying out data synthesis according to the group of low-dimensional marginal sets, so that the synthesized data set is similar to the original data set as much as possible in statistical information, and finally, carrying out data release on the synthesized data set. By adopting the technical method, the calculation complexity can be reduced while the utility of the synthesized data set is ensured, and the method has better utility for high-dimensional data.

Description

Differential Privacy Synthetic Data Publishing Method Based on Maximum Weight Matching

技术领域technical field

本发明属于信息安全领域，具体是一种基于最大权重匹配的差分隐私合成数据发布方法。The invention belongs to the field of information security, in particular to a differential privacy synthetic data publishing method based on maximum weight matching.

背景技术Background technique

随着信息技术的飞速发展，线上挂号、线上购物、线上出行等逐渐融入到人们的生活中，各种组织机构(如医院、公交总站等)可以很容易的获得用户的详细信息，众多的组织机构积累了大量的用户数据，通过对这些数据进行统计分析，能够为预测分析等后续任务提供有效的数据支撑，这些数据带来了巨大的研究价值。为满足研究以及创新的需求，相关组织会发布其所获得的数据信息，其中往往包含个体的隐私信息，如果对隐私数据保护不当，很容易泄露用户的隐私数据，进而造成不可估量的损失。因此，基于隐私保护的数据发布方法的研究是必不可少的。With the rapid development of information technology, online registration, online shopping, online travel, etc. are gradually integrated into people's lives. Various organizations (such as hospitals, bus terminals, etc.) can easily obtain detailed information about users. Many organizations have accumulated a large amount of user data. Statistical analysis of these data can provide effective data support for follow-up tasks such as predictive analysis. These data have brought huge research value. In order to meet the needs of research and innovation, relevant organizations will publish the data information they have obtained, which often contains individual private information. If the private data is not properly protected, it is easy to leak the private data of users, causing immeasurable losses. Therefore, research on data release methods based on privacy protection is essential.

保护数据隐私的方法被称为披露限制，这些技术旨在为敏感信息提供保护，同时向公众发布数据信息，以便研究者对数据进行统计分析，在发布数据中保护隐私数据的一个常见方法是匿名化方法，即删除数据中的敏感信息，但许多研究已经证明，单独删除敏感信息并不能有效的实现隐私保护，攻击者仍可以通过对其他属性的识别和分析得到敏感信息，不能为数据发布过程提供强大的隐私保障。Methods to protect data privacy are known as disclosure restrictions. These techniques are designed to provide protection for sensitive information while releasing data information to the public so that researchers can perform statistical analysis on the data. A common method of protecting private data in published data is anonymity However, many studies have proved that deleting sensitive information alone cannot effectively achieve privacy protection, and attackers can still obtain sensitive information through identification and analysis of other attributes, which cannot provide information for the data release process. Provide strong privacy guarantees.

当前隐私保护的一个可靠性方案是差分隐私模型，该模型不关心攻击者拥有多少背景知识，通过向查询或者分析结果中添加适当的噪声来达到隐私保护效果，为数据发布提供了可靠的隐私保证。差分隐私模型在数学上提供了明确的定义和有效的证明，该模型可以简单理解为在相邻数据集中添加或者删除某条记录，对于数据集的计算处理结果是不敏感的，也就是说，经过差分隐私模型处理后的数据，个人信息被识别的概率在很小的范围内。A reliable scheme for current privacy protection is the differential privacy model. This model does not care how much background knowledge the attacker has. It achieves the privacy protection effect by adding appropriate noise to the query or analysis results, and provides a reliable privacy guarantee for data release. . The differential privacy model provides a clear definition and effective proof in mathematics. This model can be simply understood as adding or deleting a certain record in an adjacent data set, which is insensitive to the calculation and processing results of the data set, that is, After the data processed by the differential privacy model, the probability of personal information being identified is within a small range.

近年来，基于差分隐私模型的隐私数据发布的研究方案有很多，如采用哈尔小波变换的方式、直方图的方式、基于划分的方式等，然而，这些方法是针对特定任务进行设计的，第三方中心服务器处理时需要一定的专业知识，对于数据的充分利用存在一定难度。因此，基于差分隐私的合成数据发布方案被提出，合成数据集可以近似原始数据，代替原始数据进行分析任务，达到保护隐私的效果，并且能够保证其准确度。然而，当前提出的基于差分隐私方法的合成数据发布方案的计算复杂度往往比较高，并且，对于高维数据集会添加大量的噪声，导致合成的数据集无法使用。In recent years, there have been many research schemes for publishing private data based on differential privacy models, such as using Haar wavelet transform, histogram, and partition-based methods. However, these methods are designed for specific tasks. The processing of the three-party central server requires certain professional knowledge, and it is difficult to make full use of the data. Therefore, a synthetic data release scheme based on differential privacy is proposed. The synthetic data set can approximate the original data, replace the original data for analysis tasks, achieve the effect of protecting privacy, and can guarantee its accuracy. However, the computational complexity of the currently proposed synthetic data release scheme based on the differential privacy method is often relatively high, and a large amount of noise will be added to the high-dimensional data set, making the synthetic data set unusable.

发明内容Contents of the invention

本发明提供一种基于最大权重匹配的差分隐私合成数据进行数据发布的方法，通过对原始数据集构建的概率图模型应用最大权重匹配算法得到低维分布，对低维分布基于差分隐私方法添加合适的噪声后，经过后处理进行数据合成，之后发布合成数据集，由此可以做到对用户信息的隐私保护，并提高合成数据集的效用。The present invention provides a method for publishing data based on differential privacy synthetic data based on maximum weight matching. The low-dimensional distribution is obtained by applying the maximum weight matching algorithm to the probability graph model constructed by the original data set, and the low-dimensional distribution is added based on the differential privacy method. After the noise, the data is synthesized after post-processing, and then the synthetic data set is released, so as to protect the privacy of user information and improve the utility of the synthetic data set.

为实现上述发明目的，本发明所采用的技术方案如下。In order to realize the above-mentioned purpose of the invention, the technical scheme adopted in the present invention is as follows.

一种基于最大权重匹配的差分隐私合成数据发布方法，通过对原始数据集构建的概率图模型应用最大权重匹配算法得到低维边际集合，对低维分布基于差分隐私方法添加合适的噪声后，经过后处理进行数据合成，之后发布合成数据集，由此可以做到对用户信息的隐私保护，并提高合成数据集的效用。A differential privacy synthetic data publishing method based on maximum weight matching. By applying the maximum weight matching algorithm to the probabilistic graphical model constructed by the original data set, a low-dimensional marginal set is obtained. After adding appropriate noise to the low-dimensional distribution based on the differential privacy method, after The post-processing is for data synthesis, and then the synthetic data set is published, so as to protect the privacy of user information and improve the utility of the synthetic data set.

所述方法包括以下步骤：The method comprises the steps of:

S1、服务器将收集得到的用户数据进行聚合，得到初始数据集，根据数据集构建带权重的概率图模型；S1. The server aggregates the collected user data to obtain an initial data set, and constructs a weighted probability graph model based on the data set;

S2、服务器根据生成的概率图模型，应用最大权重匹配算法，得到一组高相关性的低维边际集合；S2. According to the generated probability graph model, the server applies the maximum weight matching algorithm to obtain a set of highly correlated low-dimensional marginal sets;

S3、根据差分隐私模型的定义，以及隐私预算的合理分配方法，对低维边际集合添加合适的高斯噪声；S3. According to the definition of the differential privacy model and the reasonable allocation method of the privacy budget, add appropriate Gaussian noise to the low-dimensional marginal set;

S4、添加噪声后，会导致概率数据出现负数以及数据不一致的问题，因此，需要进行后处理；S4. After adding noise, it will lead to negative numbers and inconsistencies in the probability data, so post-processing is required;

S5、使用经过添加噪声处理以及后处理的低维边际集合，合成高维数据集，以在统计信息上近似原始数据集，最后，服务器发布合成数据集。S5. Synthesize a high-dimensional data set by using the low-dimensional marginal set processed by adding noise and post-processing, so as to approximate the original data set in terms of statistical information. Finally, the server publishes the synthesized data set.

在步骤S1中，可以根据属性之间的关系计算得到相关性系数，并以此做为概率图的权重值。给定数据集D，生成属性图G(V,E)，V＝{V₁,V₂,…,V_d}表示属性结点，共有d个属性，图G(V,E)中连接两个结点的边的权重为两个结点所代表属性的相关性系数，相关性系数的计算方法如下：In step S1, the correlation coefficient can be calculated according to the relationship between the attributes, and used as the weight value of the probability map. Given a data set D, generate an attribute graph G(V,E), V={V ₁ ,V ₂ ,…,V _d } represents an attribute node, there are d attributes in total, and two graphs are connected in G(V,E) The weight of the edge of a node is the correlation coefficient of the attributes represented by the two nodes. The calculation method of the correlation coefficient is as follows:

其中，V_i,V_j分别表示第i和第j个属性结点，Pr[V_i,V_j]表示属性V_i和V_j的联合概率，Pr[V_i],Pr[V_j]分别表示属性V_i和属性V_j的概率。Among them, V _i , V _j represent the i-th and j-th attribute nodes respectively, Pr[V _i ,V _j ] represents the joint probability of attributes V _i and V _j , Pr[V _i ], Pr[V _j ] respectively Denotes the probability of attribute V _i and attribute V _j .

在步骤S2中，对构建的概率图模型应用最大权重匹配算法包括如下过程：In step S2, applying the maximum weight matching algorithm to the constructed probabilistic graphical model includes the following process:

S21、初始化选择低维边际的集合M为空集合；S21. Initially select the set M of the low-dimensional margin as an empty set;

S22、选择概率图模型G中权重值最大的边，表示其连接的两个属性结点有较高的相关性，将其属性结点对作为选择的低维边际m_i添加到M集合中，并删除图G中这两个属性结点以及它们连接的所有边；S22. Select the edge with the largest weight value in the probability graph model G, indicating that the two attribute nodes connected by it have a high correlation, and add its attribute node pair as the selected low-dimensional margin _mi to the M set, And delete these two attribute nodes and all the edges connected by them in graph G;

S23、重复步骤S22，直到属性图G中没有属性结点，则可以得到最大权重的低维边际。S23. Repeat step S22 until there is no attribute node in the attribute graph G, then the low-dimensional margin with the maximum weight can be obtained.

在步骤S3中，对于分配隐私预算并对低维边际添加噪声包括如下过程：In step S3, assigning privacy budget and adding noise to the low-dimensional margin includes the following process:

S31、分配隐私预算，根据优化噪声尺度的期望平方差得到不同的隐私占比进行分配，为不同的低维分布添加合适的噪声，这个优化问题可以表示为：S31. Allocate the privacy budget. According to the expected square difference of the optimized noise scale, different privacy ratios are obtained for allocation, and appropriate noise is added for different low-dimensional distributions. This optimization problem can be expressed as:

s.t.q₁+q₂+…+q_k＝1and 0≤q_i,i＝1,2,…,kstq ₁ +q ₂ +...+q _k ＝1and 0≤q _i , i＝1,2,...,k

其中，

表示k个低维边际对应的域大小，{q₁,q₂,…q_k}表示对应视图的隐私预算占比。in,

Indicates the domain size corresponding to k low-dimensional margins, and {q ₁ ,q ₂ ,…q _k } indicates the privacy budget proportion of the corresponding view.

S32、(ε,δ)-差分隐私定义：设有一个随机机制A，S_A为A所有可能的输出构成的集合，在给定的两个相邻数据集D和D'上，对于ε＞0,δ≥0，若满足下列不等式，则称随机机制A是满足(ε,δ)-差分隐私的。S32, (ε,δ)-differential privacy definition: There is a random mechanism A, S _A is the set of all possible outputs of A, on the given two adjacent data sets D and D', for ε＞ 0,δ≥0, if the following inequality is satisfied, the random mechanism A is said to satisfy (ε,δ)-differential privacy.

Pr[A(D)∈S_A]≤e^εPr[A(D')∈S_A]+δPr[A(D)∈S _A ]≤e ^ε Pr[A(D')∈S _A ]+δ

S33、零集中差分隐私(ρ-zCDP)定义。设有一个随机机制A，在给定两个相邻数据集D和D'上，对于所有α∈(1,∞)，若满足下列不等式，则称随机机制A是零集中差分隐私(ρ-zCDP)的。S33. Definition of zero-set differential privacy (ρ-zCDP). Assuming a random mechanism A, on the given two adjacent data sets D and D', for all α∈(1,∞), if the following inequality is satisfied, the random mechanism A is said to be zero-set differential privacy (ρ- zCDP).

其中，D_α(A(D)||A(D')是A(D)和A(D')两个分布之间的α-Rényi散度，表示隐私损失随机变量。Among them, D _α (A(D)||A(D') is the α-Rényi divergence between the two distributions A(D) and A(D'), which represents the privacy loss random variable.

S34、定理：若随机机制A满足ρ-zCDP，则对于任意δ＞0，随机机制A是满足

-差分隐私的。S34. Theorem: If random mechanism A satisfies ρ-zCDP, then for any δ>0, random mechanism A satisfies

- Differentially private.

S35、高斯机制定义：给定f:Xⁿ→R是一个敏感查询，对于输入数据集D，高斯机制A满足下列等式：S35. Gaussian mechanism definition: Given f:X ⁿ → R is a sensitive query, for the input data set D, the Gaussian mechanism A satisfies the following equation:

A(D)＝f(D)+N(0,σ²)A(D)＝f(D)+N(0,σ ² )

其中，σ是噪声尺度，根据定义可以得到

Δ_f表示敏感查询f的敏感度。Among them, σ is the noise scale, according to the definition, we can get

_Δf denotes the sensitivity of the sensitive query f.

S36、根据以上定义和定理，可以得到添加的噪声尺度的计算方式如下：S36. According to the above definition and theorem, the calculation method of the added noise scale can be obtained as follows:

for i＝1,…kfor i=1,...k

在步骤S4中，由于添加噪声之后会导致统计值变为小数，且有可能出现负数，因此，需要对加噪后的低维分布进行后处理，具体包括如下过程：In step S4, since adding noise will cause the statistical value to become a decimal, and there may be a negative number, therefore, it is necessary to post-process the low-dimensional distribution after adding noise, which specifically includes the following process:

S41、非负性处理，对于噪声分布中每个单元的计数必须都是非负的，对负的计数单元进行修正，可以提高效用。S41. Non-negativity processing. The count of each unit in the noise distribution must be non-negative, and correction of negative count units can improve utility.

为保持噪声尺度，收集负计数记作negative_sum，并将负计数的单元均置为0，将正计数单元按升序排列，从最小的正计数开始消耗掉总的负计数nagetive_sum，直到nagative_sum为0。In order to maintain the noise scale, the negative counts are collected and recorded as negative_sum, and the negative count units are all set to 0, and the positive count units are arranged in ascending order, and the total negative counts are consumed starting from the smallest positive count until nagative_sum is 0.

S42、归一化处理，对于加噪后的计数值是非整数的，并且计数总和会变化，导致总和不等于记录数，对此需要进行归一化处理，以提高准确度。S42. Normalization processing. The count value after adding noise is non-integer, and the sum of the counts will change, resulting in the sum not equal to the number of records. Normalization processing is required to improve the accuracy.

以当前计数值除以总计数值得到其在总计数中所占比例，可以得到归一化数值，即比例加和为1，之后以该比例乘以原始数据的记录数，可以得到最终计数值，由于可能出现小数的情况，可以将计数值的整数部分与小数部分分开，小数部分进行加和，将加和得到的整数值加到整数部分最大的数值上。Divide the current count value by the total count value to get its proportion in the total count, and you can get a normalized value, that is, the sum of the proportions is 1, and then multiply this ratio by the number of records of the original data to get the final count value, Since there may be decimals, the integer part and the fractional part of the count value can be separated, the fractional parts are summed, and the integer value obtained by the addition is added to the largest value of the integer part.

在步骤S5中，根据低维的噪声分布合成数据集的包括如下过程：In step S5, the synthetic data set according to the low-dimensional noise distribution includes the following process:

S51、初始化一个合成数据集D_syn，根据目标分布(即由上述步骤得到的噪声分布)进行合成；S51. Initialize a synthetic data set D _syn and perform synthesis according to the target distribution (that is, the noise distribution obtained by the above steps);

S52、对于单元的初始计数值小于目标分布计数值的，添加min{c_t-c_s,αc_s}次记录，其中，c_t表示目标计数值，c_s表示初始计数值，α表示衰减因子，计算方式为

α₀表示初始值，k表示衰减率，t表示迭代次数，s表示步长。S52. If the initial count value of the unit is less than the target distribution count value, add min{c _t -c _s ,αc _s } records, where c _t represents the target count value, c _s represents the initial count value, and α represents the decay factor , calculated as

α ₀ represents the initial value, k represents the decay rate, t represents the number of iterations, and s represents the step size.

S53、对于单元的初始计数值大于目标分布计数值的，减少min{c_s-c_t,βc_s}次记录，同样的，c_t表示目标计数值，c_s表示初始计数值，β的计算方式为

S53. If the initial count value of the unit is greater than the target distribution count value, reduce min{c _s -c _t ,βc _s } records. Similarly, c _t represents the target count value, c _s represents the initial count value, and the calculation of β by

有益效果：与现有技术相比，本发明实质性进步和显著特点是通过应用最大权重匹配算法得到高相关性的低维边际集合，能够最大化全局相关性分数，在保证合成数据集效用的同时，降低计算复杂度，对于高维数据有更好的效用。Beneficial effects: Compared with the prior art, the substantial progress and notable feature of the present invention is that a high-correlation low-dimensional marginal set can be obtained by applying the maximum weight matching algorithm, which can maximize the global correlation score, while ensuring the utility of the synthetic data set At the same time, it reduces computational complexity and has better utility for high-dimensional data.

附图说明Description of drawings

图1为本发明实例的流程示意图。Figure 1 is a schematic flow diagram of an example of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明的技术方案作进一步说明。The technical solution of the present invention will be further described below in conjunction with the accompanying drawings.

本发明提供的一种基于最大权重的差分隐私数据发布方法，参阅图1，其实施步骤具体阐述如下。A maximum weight-based differential privacy data release method provided by the present invention is shown in FIG. 1 , and its implementation steps are specifically described as follows.

在步骤S1中，服务器将收集得到的数据进行聚合，得到初始数据集，根据数据集构建带权重的概率图模型，给定数据集D，生成属性图G(V,E)，V＝{V₁,V₂,…,V_d}表示属性结点，共有d个属性，图G(V,E)中连接两个结点的边的权重为两个结点所代表属性的相关性系数，相关性系数的计算方法如下：In step S1, the server aggregates the collected data to obtain an initial data set, constructs a weighted probability graph model based on the data set, and generates an attribute graph G(V,E) given a data set D, V={V ₁ ,V ₂ ,…,V _d } represent attribute nodes, and there are d attributes in total. The weight of the edge connecting two nodes in the graph G(V,E) is the correlation coefficient of the attributes represented by the two nodes, The calculation method of the correlation coefficient is as follows:

在步骤S2中，服务器根据生成的概率图模型，应用最大权重匹配算法，得到高相关性的低维边际，包括如下过程：In step S2, the server applies the maximum weight matching algorithm according to the generated probability graph model to obtain highly correlated low-dimensional margins, including the following process:

在步骤S3中，根据差分隐私模型，以及隐私预算的合理分配，对低维边际添加合适的高斯噪声，包括如下过程：In step S3, according to the differential privacy model and the reasonable allocation of the privacy budget, appropriate Gaussian noise is added to the low-dimensional margin, including the following process:

s.t.q₁+q₂+…+q_k＝1 and 0≤q_i,i＝1,2,…,kstq ₁ +q ₂ +...+q _k ＝1 and 0≤q _i , i＝1,2,...,k

其中，

- Differentially private.

A(D)＝f(D)+N(0,σ²)A(D)＝f(D)+N(0,σ ² )

其中，σ是噪声尺度，根据定义可以得到

_Δf denotes the sensitivity of the sensitive query f.

在步骤S4中，对加噪后的低维分布进行后处理，具体包括如下过程：In step S4, post-processing is performed on the noise-added low-dimensional distribution, specifically including the following process:

在步骤S5中，使用经过隐私处理的低维分布，近似合成高维数据，服务器将发布得到的合成数据集，包括如下过程：In step S5, the high-dimensional data is approximately synthesized using the low-dimensional distribution that has undergone privacy processing, and the server will publish the obtained synthetic data set, including the following process:

实施例Example

应用本发明提供的方法进行实验，实验中采用的数据集是Adult，UCI上的一个数据集，里面记录了45222个用户的个人信息，包括年龄、教育程度、工资等信息，本发明将根据原始数据集的信息得到高相关性的低维分布，以低维分布近似合成高维数据集，为度量隐私保护效果，在实验中，设置不同的隐私预算进行计算，隐私预算设置为0.4，0.8，1.2，1.6，2.0，在实验中服务器根据收集得到的数据集进行合成数据集算法，得到合成数据集以进行发布。Apply the method provided by the present invention to carry out experiment, the data set that adopts in the experiment is Adult, a data set on UCI, the personal information of 45222 users is recorded inside, comprises the information such as age, educational level, salary, the present invention will be based on original The information of the data set has a high-correlation low-dimensional distribution, and the high-dimensional data set is approximated by the low-dimensional distribution. In order to measure the effect of privacy protection, in the experiment, different privacy budgets are set for calculation. The privacy budget is set to 0.4, 0.8, 1.2, 1.6, 2.0, in the experiment, the server performs a synthetic data set algorithm based on the collected data set, and obtains a synthetic data set for publishing.

服务器对于合成数据集根据SVM分类结果以及原始数据集与合成数据集的k-way边际的误差进行评估，基于最大权重匹配的差分隐私合成数据发布方法在该数据集上的实验结果如表1、表2所示，为保证SVM分类结果，避免随机性影响实验结果，将合成数据集进行五折交叉验证，使用精确度(Accuracy，ACC)作为实验的评估标准，针对统计原始数据集与合成数据集的k-way边际的误差的实验，取不同的k值进行验证，实验中选取k＝2，3，4，根据k＝2，3，4，分别随机选取400条不同的边际，计算其L1误差，并求平均值，以得到不同隐私预算下的合成数据集与原始数据集的k-way边际的误差结果。The server evaluates the synthetic data set according to the SVM classification results and the k-way margin error between the original data set and the synthetic data set. The experimental results of the differential privacy synthetic data publishing method based on maximum weight matching on this data set are shown in Table 1. As shown in Table 2, in order to ensure the SVM classification results and avoid randomness affecting the experimental results, the synthetic data set was subjected to 50-fold cross-validation, and the accuracy (Accuracy, ACC) was used as the evaluation standard of the experiment. In the experiment of the error of the k-way margin of the set, different k values are taken for verification. In the experiment, k=2, 3, 4 are selected. According to k=2, 3, 4, 400 different margins are randomly selected respectively, and the calculated L1 error, and averaged to obtain the error results of the k-way margin of the synthetic dataset and the original dataset under different privacy budgets.

表1不同隐私预算下的SVM分类实验结果Table 1 Experimental results of SVM classification under different privacy budgets

表2不同隐私预算下的k-way边际的误差结果Table 2 Error results of k-way margin under different privacy budgets

通过表1可以看到，随着隐私预算的变化，精确度的变化不是很大，但能观察到整体上是随着隐私预算的增大而增大的，且精确度能够达到0.7以上，可以保证得到一个较好的分类结果。同时，虽然5次实验结果略有不同，但是波动程度都很小。表2统计了不同隐私预算下的合成数据集与原始数据集的k-way边际的误差结果，通过结果可以看到，2-way的边际误差在0.1以内，与原始数据集选择得到的2-way边际相比，误差相差非常小，3-way、4-way的边际误差也均在0.25以内，可见，其精确度相对较高，且在整体上误差是随着隐私预算的增大而减小的。It can be seen from Table 1 that with the change of the privacy budget, the accuracy does not change very much, but it can be observed that the overall increase increases with the increase of the privacy budget, and the accuracy can reach more than 0.7, which can be Guaranteed to get a better classification result. At the same time, although the results of the five experiments are slightly different, the degree of fluctuation is very small. Table 2 counts the error results of the k-way margin of the synthetic data set and the original data set under different privacy budgets. From the results, it can be seen that the 2-way marginal error is within 0.1, which is comparable to the 2-way margin obtained by the original data set selection. Compared with the 3-way margin, the error difference is very small, and the 3-way and 4-way margin errors are also within 0.25. It can be seen that the accuracy is relatively high, and the overall error decreases with the increase of the privacy budget. small.

Claims

1. The differential privacy synthetic data release method based on maximum weight matching is characterized in that the method obtains a low-dimensional marginal set by applying a maximum weight matching algorithm to a probability graph model constructed by an original data set, adds proper noise to low-dimensional distribution based on a differential privacy method, performs data synthesis through post-processing, and releases a synthetic data set, thereby realizing privacy protection of user information and improving the utility of the synthetic data set;

the method comprises the following processing steps:

s1, a server aggregates the collected user data to obtain an initial data set, and a weighted probability graph model is built according to the data set;

s2, the server applies a maximum weight matching algorithm according to the generated probability graph model to obtain a group of high-correlation low-dimensional marginal sets;

s3, adding proper Gaussian noise to the low-dimensional marginal set according to the definition of the differential privacy model and a reasonable distribution method of privacy budget;

s4, processing the problem of negative occurrence number and inconsistent data of probability data after noise is added;

s5, synthesizing the high-dimensional data set by using the low-dimensional marginal set subjected to noise adding processing and post-processing so as to approximate the original data set on statistical information, and finally, releasing the synthesized data set by the server.

2. The method for publishing differential privacy synthetic data based on maximum weight matching according to claim 1, wherein the aggregation of data in step S1 includes calculating a correlation coefficient according to the correlation between attributes, and the calculation process is as follows:

given data set D, an attribute map G (V, E), v= { V, is generated ₁ ,V ₂ ,...,V _d The weight of the edge connecting two nodes in the graph G (V, E) is the correlation coefficient of the attribute represented by the two nodes, and the calculation expression of the correlation coefficient is as follows:

wherein ,V_i ,V _j Represents the ith and jth attribute nodes, pr [ V ] _i ,V _j ]Representing attribute V _i and V_j Joint probability of Pr [ V ] _i ],Pr[V _j ]Respectively represent attribute V _i And attribute V _j Is a probability of (2).

3. The method for publishing differential privacy synthetic data based on maximum weight matching according to claim 1, wherein the step S2 of solving the low-dimensional marginal set comprises the following procedures:

s21, initializing and selecting a set M of low-dimensional margins as an empty set;

s22, selecting an edge with the largest weight value in the probability graph model G, representing that two connected attribute nodes have higher correlation, and taking the attribute node pair as a selected low-dimensional margin m _i Adding the attribute nodes into the M set, and deleting all edges connected with the attribute nodes in the graph G;

s23, repeating the step S22 until no attribute nodes exist in the attribute graph G, and obtaining the low-dimensional margin of the maximum weight.

4. The method for publishing differential privacy synthetic data based on maximum weight matching according to claim 1, wherein step S3 is to add appropriate noise to the low-dimensional marginal set according to differential privacy definition, and specifically calculated as follows:

s31, distributing privacy budgets, namely obtaining different privacy duty ratios according to expected square deviations of optimized noise scales, and adding proper noise for different low-dimensional distributions, wherein the optimization problem is expressed as:

s.t.q ₁ +q ₂ +...+q _k ＝1 and 0≤q _i ,i＝1,2,...,k

wherein ,

represents the domain size corresponding to k low-dimensional margins, { q ₁ ,q ₂ ,...q _k -privacy budget duty cycle of the corresponding view;

s32, (epsilon, delta) -differential privacy definition: is provided with a random mechanism A, S _A For a set of all possible outputs of a, δ is ≡0 for e > 0 on a given two adjacent data sets D and D', the random mechanism a is said to satisfy (e, δ) -differential privacy if the following inequality is satisfied.

Pr[A(D)∈S _A ]≤e ^ε Pr[A(D')∈S _A ]+δ

S33, defining zero-concentration differential privacy (ρ -zCDP): there is a random mechanism a, which is called zero-centered differential privacy (ρ -zCDP) for all α e (1, +%) given two adjacent data sets D and D', if the following inequality is satisfied, the following expression exists:

wherein ,D_α (A (D) ||A (D ') is the α -renyi divergence between the two distributions of A (D) and A (D'), representing the privacy loss random variable;

s34, theorem: if random mechanism A satisfies ρ -zCDP, then random mechanism A satisfies for any δ > 0

-differential privacy;

s35, gaussian mechanism definition: given f X ⁿ R is a sensitive query, and for the input dataset D, the gaussian mechanism a satisfies the following equation:

A(D)＝f(D)+N(0,σ ² )

wherein σ is the noise scale, which can be derived by definition

Δ _f Representing the sensitivity of the sensitive query f;

s36, according to the definition and theorem, the calculation mode of the added noise scale is as follows:

5. the method for publishing differential privacy synthetic data based on maximum weight matching according to claim 1, wherein step S4 of post-processing the denoised low-dimensional marginal set comprises:

s41, nonnegative processing, namely, the count of each unit in noise distribution is nonnegative, and the negative count unit is corrected to improve the utility;

to maintain the noise scale, collecting negative counts to be counted as negative_sum, setting the units of the negative counts to be 0, arranging positive count units in an ascending order, and consuming the total negative count of negative_sum from the minimum positive count until the negative_sum is 0;

s42, normalization processing is carried out on the count value after noise addition, which is non-integer, and the sum of the count changes, so that the sum is not equal to the record number, and the normalization processing is carried out on the count value to improve the accuracy;

dividing the current count value by the total count value to obtain the proportion of the current count value in the total count value, obtaining a normalized value, namely adding the proportion to be 1, multiplying the proportion by the recorded number of the original data to obtain a final count value, separating an integer part from a decimal part of the count value, adding the decimal part, and adding the integer value obtained by adding to the maximum value of the integer part due to the possible decimal situation.

6. The differential privacy synthesis data distribution method based on maximum weight matching according to claim 1, wherein step S5 comprises:

s51, initializing a synthetic data set D _syn Synthesizing according to the target distribution, namely synthesizing by the noise distribution;

s52, adding min { c ] for the initial count value of the unit is smaller than the target distribution count value _t -c _s ,αc _s Recording a number of times, wherein c _t Representing a target count value, c _s Represents the initial count value, alpha represents the attenuation factor, and the calculation mode is that

α ₀ Representing an initial value, k representing an attenuation rate, t representing the number of iterations, and s representing a step size;

s53, for the initial count value of the unit is larger than the target distribution count value, the min { c } is reduced _s -c _t ,βc _s Recording, likewise, c _t Representing a target count value, c _s Represents an initial count value, and beta is calculated by the following way