CN115098882A - Local differential privacy multidimensional data publishing method and system based on incremental learning - Google Patents
Local differential privacy multidimensional data publishing method and system based on incremental learning Download PDFInfo
- Publication number
- CN115098882A CN115098882A CN202210699743.7A CN202210699743A CN115098882A CN 115098882 A CN115098882 A CN 115098882A CN 202210699743 A CN202210699743 A CN 202210699743A CN 115098882 A CN115098882 A CN 115098882A
- Authority
- CN
- China
- Prior art keywords
- attribute
- distribution
- data
- correlation
- junction tree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 80
- 238000009826 distribution Methods 0.000 claims abstract description 108
- 238000005070 sampling Methods 0.000 claims abstract description 20
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 14
- 230000004931 aggregating effect Effects 0.000 claims abstract description 7
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 4
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 4
- 238000004590 computer program Methods 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 12
- 238000003860 storage Methods 0.000 claims description 9
- 230000008030 elimination Effects 0.000 claims description 8
- 238000003379 elimination reaction Methods 0.000 claims description 8
- 238000000354 decomposition reaction Methods 0.000 claims description 6
- 238000013138 pruning Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 2
- 238000003780 insertion Methods 0.000 claims description 2
- 230000037431 insertion Effects 0.000 claims description 2
- 238000010187 selection method Methods 0.000 claims description 2
- 101150043283 ccdA gene Proteins 0.000 claims 2
- 238000004220 aggregation Methods 0.000 description 17
- 230000002776 aggregation Effects 0.000 description 16
- 238000010586 diagram Methods 0.000 description 8
- 238000009825 accumulation Methods 0.000 description 6
- 230000000875 corresponding effect Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000007405 data analysis Methods 0.000 description 3
- 238000013480 data collection Methods 0.000 description 3
- 230000004807 localization Effects 0.000 description 3
- 244000141353 Prunus domestica Species 0.000 description 2
- 239000002131 composite material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000009966 trimming Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the field of data security and privacy protection, and provides a multidimensional data issuing method and system based on local differential privacy of incremental learning, wherein the correlation of all attribute pairs is learned by aggregating a first batch of user disturbance data; constructing a dependency graph model according to the correlation of the attribute pairs, and converting the constructed dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm; based on the second batch of user data, estimating the distribution of the cliques by adopting a corresponding estimation method according to the number and size types of the attributes contained in each clique to obtain the joint distribution of each clique in the junction tree model; and generating a data set which also comprises the same number of record synthesis and publishing the data set by a data generation method based on sampling according to the junction tree model and the joint distribution of each group in the junction tree model.
Description
Technical Field
The invention belongs to the field of data security and privacy protection, and particularly relates to a local differential privacy multidimensional data publishing method and system based on incremental learning.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
In the multidimensional data publishing problem of localized differential privacy, each user individual has a record containing a plurality of discrete attributes (continuous attributes are converted into discrete attributes by an equal width range of domain discretization into a fixed number), such as census data and the like. In practice, it is desirable for a data analyst to be able to perform any type of data analysis or mining on a data set to mine or extract the vast amount of underlying information behind the data to provide accurate and reliable predictions for populations and individuals. Therefore, the aggregation server needs to collect and publish data owned by all user individuals.
However, the data often contains sensitive information of the individual user, and the user does not want to share the actual data of the individual to any third-party data collector. Therefore, there is a need to address multidimensional data publishing approaches that satisfy localized differential privacy.
The localization differential privacy is used as a strict and quantifiable privacy protection model, the model does not depend on any third-party entity which declares that the model is credible, privacy protection is provided for real data of each user from the perspective of the user individual, and even if the third-party aggregation server is malicious, the privacy of the user individual can be guaranteed not to be revealed. In the model, a user locally adds certain scale noise to own real data for disturbance, and then uploads the disturbed data to the aggregation server. After receiving the disturbance data uploaded by all users, the aggregation server can only obtain some statistical information through calculation, and cannot deduce any personal sensitive information about the users from the statistical information.
Based on this model, existing work has proposed some solutions to solve this problem. In existing work, the aggregation server first collects the complete data of all users at once. In order to support all subsequent calculations while satisfying the localized differential privacy, each user needs to disturb the whole record of the user and send the disturbed result to the aggregator for aggregation. The aggregation server aggregates the disturbance data uploaded by the users, and provides any required distribution information for constructing the probability map model and generating the data by using an Expectation visualization algorithm, namely the joint distribution information of all attribute pairs and the distribution information of each group in the junction tree.
However, the above method has the following technical problems:
(i) when constructing the probabilistic graph model, it is necessary to calculate the correlation of all the paired attributes to determine the structure of the dependency graph. However, for multi-dimensional data, there are a large number of attribute pairs. To satisfy localized differential privacy, directly computing the correlation of all these pairwise attributes results in a large amount of noise being injected into the results, which severely degrades the accuracy of the dependency graph structure and the synthesized data.
(ii) When the distribution of the cliques of the junction tree is estimated, some big cliques may contain too many attributes, and the problem of high dimension is still faced when the distribution is calculated, so that the problem cannot be solved well.
Disclosure of Invention
In order to solve at least one technical problem in the background art, the present invention provides a local differential privacy based multidimensional data publishing method and system based on incremental learning, which constructs a probability graph model by an incremental learning-based method, then generates a set of low-dimensional distributions with noise by using the probability graph model, and then uses them to approximate the overall distribution of an input data set to generate a synthetic data set. Compared with the existing method, the method can provide privacy protection for each user individual and simultaneously obviously improve the precision of data release.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a multidimensional data publishing method based on local differential privacy of incremental learning, which comprises the following steps:
learning the correlation of all attribute pairs by aggregating the first batch of user disturbance data;
constructing a dependency graph model according to the correlation of the attribute pairs, and converting the constructed dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm;
based on the second batch of user data, estimating the distribution of the cliques by adopting a corresponding estimation method according to the number and size types of the attributes contained in each clique to obtain the joint distribution of each clique in the junction tree model;
and generating a data set which also comprises the same number of record synthesis and publishing the data set by a data generation method based on sampling according to the junction tree model and the joint distribution of each group in the junction tree model.
A second aspect of the present invention provides a multidimensional data distribution system for local differential privacy based on incremental learning, comprising:
the correlation learning module is used for learning the correlation of all attribute pairs by aggregating the disturbance data of the first batch of users;
the junction tree model building module is used for building a dependency graph model according to the correlation of the attribute pairs and converting the built dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm;
the distribution calculation module of each group of the junction tree is used for estimating the distribution of the groups by adopting a corresponding estimation method according to the number and the size type of the attributes contained in each group based on the second batch of user data to obtain the joint distribution of each group in the junction tree model;
and the data publishing module is used for generating a data set which also comprises the same number of records and is synthesized by a data generating method based on sampling according to the junction tree model and the joint distribution of each group in the junction tree model and publishing the data set.
A third aspect of the invention provides a computer-readable storage medium.
A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method for multi-dimensional data distribution of local differential privacy based on incremental learning as described above.
A fourth aspect of the invention provides a computer apparatus.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the method for multi-dimensional data distribution based on incremental learning-based local differential privacy as described above when executing the program.
Compared with the prior art, the invention has the beneficial effects that:
the method adopts an incremental learning-based method to construct a probability graph model, gradually prunes weak-correlation attribute pairs, allocates more data and privacy budgets to useful attribute pairs to correctly identify the correlation among all attribute pairs in an attribute set, thereby being capable of more effectively constructing the probability graph model, then utilizes the probability graph model to generate a group of low-dimensional distributions with noise, and then uses the low-dimensional distributions to approximate the overall distribution of an input data set so as to generate a synthetic data set.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flowchart illustrating a multidimensional data publishing method based on incremental learning local differential privacy according to an embodiment of the present invention;
figure 2 is a markov network of an embodiment of the present invention;
figure 3 is a markov network junction tree according to an embodiment of the present invention;
FIG. 4 is a graph comparing the algorithmic effects of the method of the present invention and existing methods.
Detailed Description
The invention is further described with reference to the following figures and examples.
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
In a localization scenario, each user individual has a piece of data containing multiple attributes. In order to complete the data distribution task, data of all user individuals needs to be collected for data distribution. However, the data often contains sensitive information about the individual user. Therefore, there is a need to address the issue of multidimensional data distribution that satisfies localized differential privacy.
In order to solve the first technical problem in the background art of the application, the invention gradually prunes the attribute pairs with weak correlation, and allocates more data and privacy budgets to the useful attribute pairs so as to correctly identify the correlation among all the attribute pairs in the attribute set, thereby being capable of more effectively constructing a probability graph model and remarkably improving the quality of a synthetic data set.
In order to solve the second technical problem in the background art of the present application, the present invention further provides a new large clique distribution calculation method based on joint distribution decomposition and redundancy elimination, which effectively solves the distribution estimation of the large cliques.
Interpretation of terms:
the invention relates to two key elements, namely a localized differential privacy, junction tree model.
These two elements are first described below; and then, a formalized definition of a multidimensional data publishing problem under a localized differential privacy scene is given.
1. Localized differential privacy
In a local setting of differential privacy, an untrusted aggregator wishes to collect the user's personal information to accomplish the corresponding data analysis task. Localized differential privacy (Loca)l differential privacy) provides a random response algorithmThe data analysis task of the aggregation server is met, and meanwhile the privacy of the user is protected. Localized differential privacy may be defined as:
localized differential privacy: a definition domain and a value domain are respectivelyEpsilon-localized differential privacy random algorithm ofWhen epsilon>At 0, give an arbitrary outputAnd arbitrary two inputs x 1 ,Satisfies the following conditions:
the privacy parameter epsilon measures the privacy protection strength of individual sensitive information, and smaller epsilon represents greater privacy protection strength.
2. Junction tree model
To overcome dimension cursing, it is critical to find conditional independence between attributes in the real dataset to break up the joint probability distribution into modular components. Probabilistic graphical models are elegant tools to identify such modular structures, while markov networks are the most widely used graphical models based on undirected graphs. The join tree algorithm provides a feasible method for accurately deducing joint probability distribution from a Markov network. The junction tree is defined as follows:
a junction tree model: for a given Markov network G, a tree is transformed from GIf and only ifOf (2) any two clusters C i ,Of (2) intersection C i ∩C j Present in connection C i And C j In each node on a unique path between, callIs a junction tree. Wherein:known as junction treesIn the form of a mass of (a),indicating the intersection of two adjacent blobs.
Examples are: FIG. 3 shows a junction tree constructed from the Markov network of FIG. 2, where the elliptical nodes are the cliques of the junction tree and the rectangular nodes represent separators, namely: note that the edges AD and AE in the markov network of fig. 2, connected by dashed lines, are introduced by the junction tree algorithm and are not themselves involved, we call them chords.
For the attribute set ofOf the data setGiven junction treeStructure of (5), and joint distribution Pr (C) i ) And Pr (S) ij ),The joint distribution of (a) may be expressed as:
3. problem definition
The formalization of the multidimensional data distribution problem to satisfy localized differential privacy is described as follows:
assume that there are N data owners and 1 aggregation server. Each data owner U k (1. ltoreq. k. ltoreq.N) has a chain containing d discrete attributes { A ≦ 1 ,…,A d Recording of } ofWhereinRepresenting a user U k The ith attribute A of i The value of (c). The data of all users form a data setThe aggregation server wants to know about the data setCombined distribution of all attributes Pr (A) 1 ,…, d ) To generate a new composite data set for publication. In order to protect the privacy of individual users, the aggregation server collects the records of all N users on the premise of meeting the localized differential privacy and generates a data set which is identical to the original data setSynthetic datasets with approximate distributionsNamely:
Example one
As shown in fig. 1, the present embodiment provides a multidimensional data publishing method based on local differential privacy of incremental learning, including the following steps:
step 1: learning the correlation of all attribute pairs by aggregating disturbance data uploaded by a first batch of users;
and 2, step: constructing a dependency graph model according to the correlation of the attribute pairs, and converting the constructed dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm;
and step 3: based on the second batch of user data, estimating the distribution of the cliques by adopting a corresponding estimation method according to the number and size types of the attributes contained in each clique to obtain the distribution of each clique in the junction tree model;
and 4, step 4: and generating and distributing a data set which also comprises the same number of record synthesis by a data generation method based on sampling according to the connection tree model and the distribution of each group in the connection tree model.
The present embodiment relates to two types of entities: n data owners and 1 aggregation server. Each data owner U k (1. ltoreq. k. ltoreq.N) has a record containing d discrete attributesAggregation server uses original data set composed of data of N data ownersGenerate an ANDSynthetic datasets with similar distributionsAnd data release is carried out, and the privacy protection requirement of each data owner is ensured.
In order to more clearly understand the technical contents of the present invention, the following detailed description is given;
step 1: and collecting the first batch of user data and calculating to generate a probability graph model.
The method specifically comprises the following substeps:
step 1.1: and grouping the first group of users.
The first group of usersPartitioning into T disjoint groups, namely:each group ofRespectively for the t-th iteration in step 1.3.
Step 1.2: the dependency graph model is initialized.
A dependency graph G is initialized, wherein the dependency graph G comprises d vertexes, and any two vertexes are provided with connecting edges. Let E { (i, j) |0 ≦ i ≠ j < d } be the set of edges for G (i.e., the set of attribute pairs).
Step 1.3: and constructing a dependency graph model.
In order to better construct the dependency graph model, the embodiment uses an incremental learning-based dependency graph model construction method, which consists of T iterations. In each iteration, some new data is collected for each attribute pair remaining in the set E of attribute pairs, respectively, and the correlations of these attribute pairs are re-estimated to eliminate attribute pairs with weaker correlations.
Taking the t-th iteration as an example, the method specifically comprises the following sub-steps:
step 1.3.1: and grouping the users participating in the iteration according to the residual attribute pairs.
Grouping users according to the edge set E of the current dependency graph G Wherein:the users of the group are only used for data collection on attribute pair e at the t-th iteration.
Step 1.3.2: user data is collected. In consideration of privacy protection, a user needs to add a proper amount of disturbance to own real data and report the disturbance result to the aggregation server.
To a first orderFor example, assuming that e is (i, j), the method specifically includes the following sub-steps:
step 1.3.2.1: the user first associates himself with attribute A i And A j And converting the real data into a single attribute, and encoding the converted attribute value by using a one-hot (one-hot) encoding rule to obtain an encoding result S.
Step 1.3.2.2: setting a value of a parameter epsilon representing the privacy protection strength, and outputting a disturbance result S' by using an OUE algorithm with the following probability for the code value S of the user obtained in the step 1.3.2.1, namely:
Step 1.3.3: the aggregation server calculates the joint distribution { P ] of all the remaining attribute pairs in the current E by adopting an unbiased estimation method according to the disturbance data uploaded by the user t (A i ,A j ) Where the subscript t denotes that the distribution is the result calculated by the data collection of the t-th iteration.
Step 1.3.4: and (6) accumulating the data. Since in each iteration, data for the attribute pairs that are not pruned needs to be collected continuously and their joint distribution recalculated. The data are accumulated, more accurate distribution information can be obtained, and the accuracy of judging the correlation of the non-trimmed attribute pair is improved.
Meanwhile, in order to improve the efficiency of data accumulation, the embodiment provides a more efficient data accumulation method, and the accumulation result of the t-th round can be directly obtained according to the accumulation result of the t-1 th round and the calculation result of the t-th round.
In particular, for arbitrary attribute combinationsAssume that the t-th collection includesIs combined with the attributes ofAnd defineRepresenting upload attribute combinationsThe number of users of (a) is,representThe marginal table of (a) is,representThe term (2) is used in the following description,1≤j≤i t . Then the t-th wheelThe accumulated result of the margin table of (1)Can be expressed as:
wherein:
therefore, through the calculation of the formula, the joint distribution result after the accumulation of all the residual attribute pairs in the E and the t-1 round data collection result can be obtained
Step 1.3.5: the mutual information is recalculated.
In this embodiment, the correlation between the attribute pairs is measured using mutual information of the two attributes. To pairAttribute pair A i ,A j Of mutual informationThe calculation formula of (c) is as follows:
wherein (A) i ,A j ) In the form of an attribute pair, the attribute pair,are respectively attribute A i ,A j Domain of (a), Pr (a) m ) And Pr (a) n ) Respectively representMiddle mth value a m Marginal distribution of andmiddle nth value a n Marginal distribution of (A), Pr (a) m ,a n ) Denotes a m And a n The joint distribution of (1).
Wherein: associated distribution Pr (A) i ,A j ) For the combined distribution obtained after step 1.3.4 data accumulationPr(a m ,a n ) Is the term Pr (A) in the joint distribution i =a m ,A j =a n ) Abbreviations of (a).
Step 1.3.6: trimming the invalid edge.
In general, for any attribute pair A i ,A j Usually using a predetermined correlation thresholdInformation with itSize of (2)The relationship is used to determine whether there is a correlation between the attribute pairs. However, sampling errors can result due to the use of partial data to estimate the strength of attribute versus correlation. Meanwhile, since the smaller the data amount is in LDP, the larger the disturbance error is. Therefore, the correlation threshold is directly usedMay result in erroneous pruning of otherwise highly correlated attribute pairs.
In order to reduce errors caused by sampling and disturbance and improve the accuracy of edge selection, the embodiment uses an edge pruning method based on threshold relaxation. To pairThe method specifically comprises the following substeps:
step 1.3.6.2: a scaled correlation threshold/is calculated.
Assume that a total of n records among the currently collected records contain attribute A i And A j A calculated from the n records i And A j Is represented byDue to the randomness of the sampling and perturbation, willConsidered as a random variable. For true mutual informationAnd mutual experience informationSatisfies the following conditions:
Thus, given a confidence level of 1- α, the scaled correlation threshold, l, satisfies:
then l can be calculated as:
step 1.3.6.3: re-estimating the correlation and pruning the edges. If it is notRepresents attribute pair A i ,A j There is still a high probability of having a strong correlation, so the edge e is kept (i, j) in the dependency graph G; instead, we delete the edge from the dependency graph G, i.e., E ═ E- { (i, j) }.
Step 1.4: a junction tree structure. According to the dependency graph G constructed in the step 1.3, the dependency graph is converted into a set of clusters through a junction tree algorithmOf a connection treeWherein: group C i Contains at least one attribute, | C i L represents the number of attributes in the blob,is the size of the bolus.
Step 2: a second batch of user data is collected and the distribution of cliques of the junction tree is calculated.
In obtaining a junction treeLater, to generate a synthetic dataset, clique sets need to be estimatedThe joint distribution of each clique. Since the number of attributes included in each clique and the size of the clique are different, for different types of cliques, the present embodiment estimates the distribution of the cliques by using different methods. The method specifically comprises the following substeps:
step 2.1: and (4) classifying the clusters.
Dividing the ball into two groups according to the size of the ballIn detail, given an empirical blob size threshold σ, pairIf it isCalled ball C i Is in the form of small ball and put inPerforming the following steps; on the contrary, ifCalled ball C i Is in large ball form and is put into groupIn (1).
Step 2.2: and (4) decomposing the large lumps.
Since large cliques tend to contain more attributes, a straightforward estimate of their distribution still faces the problem of dimension cursing. To pairAccording to the chain rule, the joint distribution can be decomposed into the product of a plurality of conditional probabilities, namely:
wherein: II type i Is represented by A i The conditions of (1).
Π i There is redundancy, i.e. when given Π i In some properties of (II), Π i With A i Are condition independent. By means of pairs II i Performing redundancy deletion to obtainThe purpose of reducing dimension can be achieved becauseGreater than i Containing fewer attributes. Intuitively, the order of decomposition affects the effect of deleting redundancy.
Therefore, the present embodiment uses a heuristic based on a forward search strategy to determine a better decomposition order.
Order attribute setThe method comprisesAnd (4) secondary circulation, wherein each circulation selects one attribute A capable of deleting more redundancies from the rest attributes in the attribute set Q h As a result of the target attribute and having the redundancy deletedAs a condition, key value pairForm of (2) is stored in a dictionaryIn (1).
Ball C i The joint distribution of (a) is expressed as:
step 2.2.1: if the domain size | Q | ═ pi of the current attribute set Q A∈Q |Ω A I is less than or equal to the blob size threshold σ, i.e.: if Q < sigma > is less than or equal to the sum of the attributes A and Q, randomly selecting one attribute A from the attribute set Q h E.g. Q as the target attribute, and making the attribute set Q divide the attribute A h All other attributes exceptAs a condition, i.e. the h-th term of the factorization is the conditional probabilityWill attribute A h Delete from attribute set Q, i.e.: q ═ Q \ A h ;
Step 2.2.2: conversely, if | Q->σ forThe embodiment uses the attribute A j For the target attribute, a characteristic selection method of maximum-Redundancy-minimum-correlation (minimum-Redundancy-maximum-Relevance) is used for performing Q \ A on the attribute set j Performing redundancy elimination to obtain an attribute set after the redundancy elimination
The following results were calculated:
wherein: the mutual information values of any two attributes are the calculation results in step 1.
Based on the calculated redundancy elimination resultSelecting the order variance from the attribute set QMinimum attribute A h E.g. Q as target attribute, and making redundant elimination resultAs a condition, then the h term of the factorization isAnd let Q ═ Q \ A h 。
Step 2.3: a distribution estimation is performed using the second batch of user data.
In order to obtain the distribution of each clique, the joint distribution of some attribute combinations needs to be calculated, and the method specifically comprises the following sub-steps:
step 2.3.1: a combination of attributes is determined.
Set of settingsAndthe method is used for storing the attribute combination needing to be distributed and the attribute combination with known distribution, and specifically comprises the following steps:
To pairIf | C i If | ≧ 3, then C is i Is put intoPerforming the following steps; otherwise, store inIn (1).
Step 2.3.2: and (4) grouping users. Second batch of usersIs divided intoThe disjoint groups, namely:each group ofRespectively for attribute combinationThe distribution of (2) is calculated.
Step 2.3.3: user data is collected and joint distributions are estimated.
To pairThe joint distribution estimation of (2) is the same as the calculation steps for the first group of users in steps 1.3.3 and 1.3.4, and thus can be obtainedTo is pairIt can be obtained directly from the calculation result of step 1.
Step 2.4: the distribution of the cliques is calculated. To pairThe present embodiment estimates the joint distributionDistribution of conditionsCan be distributed from the unionThus obtaining the product. Thus, according to the decomposition formula of the joint distribution in step 2.2, it is calculatedCombined distribution of Pr (C) i )。
And step 3: a composite data set is generated.
The aggregation server obtains the junction tree according to the step 1And the distribution of each cluster obtained in step 2Generating a synthetic data set comprising N records by a sampling-based data generation method
Taking the generation of one record as an example, the method specifically comprises the following sub-steps:
step 3.1: randomly selecting a cliqueAccording to its distribution Pr (C) i ) Sampling to obtain C i Sampling results of all attributes inThen all and clusters C are selected i Associated groups
Step 3.2: to pairWe derived from the conditional distributionMiddle pair attributeSampling is performed, wherein:representing an unsampled set of attributes, the conditional distribution can be from Pr (C) r ) And (4) obtaining. Then all are reacted with C r Insertion of connected and unaccessed blobs intoTo the end of (c).
Step 3.3: and (3.2) repeatedly executing the step until the sampling results X of all the attributes are obtained, and finishing the generation.
In order to better verify the effect of the multidimensional data publishing method based on incremental learning and meeting the local differential privacy, referring to fig. 4, privincorr is the method, and the method in the invention is compared between two public data sets Adult and TPC-E with the existing method. The comparison method comprises a non-incremental method NoIncremental, a non-privacy version NoPrivJTree and two multidimensional k marginal issuing methods CALM and FT which meet the localization differential privacy. Experimental results show that the synthetic data set generated by the method can provide better effect.
Example two
The embodiment provides a multidimensional data distribution system based on local differential privacy of incremental learning, which comprises:
the correlation learning module is used for learning the correlation of all attribute pairs by aggregating disturbance data uploaded by a first group of users;
the junction tree model building module is used for building a dependency graph model according to the correlation of the attribute pairs and converting the built dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm;
the distribution calculation module of each group of the junction tree is used for estimating the distribution of the groups by adopting a corresponding estimation method according to the number and the size type of the attributes contained in each group based on the second batch of user data to obtain the joint distribution of each group in the junction tree model;
and the data publishing module is used for generating a data set which also comprises the same number of records and is synthesized by a data generating method based on sampling according to the junction tree model and the joint distribution of each group in the junction tree model and publishing the data set.
EXAMPLE III
The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps in the local differential privacy based multidimensional data distribution method based on incremental learning as described above.
Example four
The present embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the processor implements the steps in the incremental learning-based local differential privacy multidimensional data distribution method as described above.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. The local differential privacy multidimensional data publishing method based on incremental learning is characterized by comprising the following steps:
learning the correlation of all attribute pairs by aggregating the first batch of user disturbance data;
constructing a dependency graph model according to the correlation of the attribute pairs, and converting the constructed dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm;
based on the second batch of user data, estimating the distribution of the cliques by adopting a corresponding estimation method according to the number and size types of the attributes contained in each clique to obtain the joint distribution of each clique in the junction tree model;
and generating a data set which also comprises the same number of record synthesis and publishing the data set by a data generation method based on sampling according to the junction tree model and the joint distribution of each group in the junction tree model.
2. The incremental learning-based multi-dimensional data publishing method for local differential privacy according to claim 1, wherein the building of the dependency graph model according to the correlation of attribute pairs comprises:
constructing by adopting a dependency graph model construction method based on incremental learning according to the edge set of the current dependency graph, wherein the method comprises the following steps: and performing T-round iteration, respectively collecting new data for each attribute pair remaining in the attribute pair set in each iteration, re-estimating the correlation between the attribute pairs, and removing the attribute pairs with weak correlation by adopting an edge pruning method based on threshold relaxation to obtain a pruned edge.
3. The incremental learning-based multi-dimensional data distribution method for local differential privacy as claimed in claim 2, wherein the correlation between the attribute pair is measured by using the mutual information of the two attributes, and the mutual information of the attribute pair is calculated by the formula:
in the formula, A j ,A j In the form of an attribute pair, the attribute pair,are respectively attribute A i ,A j Domain of (a), Pr (a) m ) And Pr (a) n ) Respectively representMiddle mth value a m Marginal distribution of (A) andthe nth value a n Marginal distribution of (A), Pr (a) m ,a n ) Denotes a m And a n The joint distribution of (c).
4. The incremental learning-based multi-dimensional data distribution method for local differential privacy as claimed in claim 2, wherein the step of removing the attribute pair with weak correlation by using an edge pruning method based on threshold relaxation comprises:
calculating a correlation threshold value based on the set dependency parameter;
calculating a scaled correlation threshold in combination with the correlation threshold, the given confidence level and mutual information of the recorded attribute pairs;
recalculating the re-estimated correlation, if the correlation of the attribute pair is greater than or equal to the scaled correlation threshold value, indicating that the attributes have strong correlation, and keeping an edge in the dependency graph; otherwise, the edge is removed from the dependency graph.
5. The incremental learning-based local differential privacy multidimensional data distribution method according to claim 1, wherein the estimating the distribution of the cliques by using a corresponding estimation method according to the number and size types of the attributes included in each clique to obtain the joint distribution of each clique in the junction tree model comprises:
dividing all the clusters into two groups of big clusters and small clusters according to the number and the size types of the clusters;
confirming an optimal decomposition sequence by adopting a heuristic method of a forward search strategy, and decomposing the large clusters according to the optimal sequence to obtain condition distribution;
and obtaining the joint distribution of each group in the junction tree model based on the second batch of user data and the condition distribution and based on a joint distribution formula.
6. The incremental learning-based multi-dimensional data distribution method for local differential privacy as claimed in claim 1, wherein the heuristic method using a forward search strategy confirms an optimal decomposition order, and decomposing the cliques according to the optimal order comprises:
if | Q | ≦ σ, randomly select an attribute A from Q h E.g. Q as the target attribute, letAs conditions, i.e. conditional distribution of the h-th term of factorization
If | Q |>σ forWith A j For the target attribute, using the characteristic selection method of maximum redundancy minimum correlation to the attribute set Q \ A j Performing redundancy elimination to obtain an attribute set after the redundancy elimination
Selecting an order from the calculated redundancy elimination resultMinimum A h E.g. Q as the target attribute, letAs a condition, the h term of factorization is the conditional distributionAnd let Q ═ Q \ A h ,
7. The incremental learning-based local differential privacy multidimensional data distribution method as recited in claim 1, wherein the record generation process comprises:
randomly selecting a small group, sampling according to the distribution of the small group to obtain the sampling results of all the attributes of the small group, and then selecting all the groups related to the small group
For is toFrom conditional distributionMiddle pair attributeA sampling is performed in which, among other things,representing an unsampled attribute set with a conditional distribution from Pr (C) r ) Obtaining, then mixing all with C r Insertion of connected and unaccessed blobs intoTo the end of (1);
and repeatedly executing sampling until sampling results of all attributes are obtained, and finishing the generation.
8. The multidimensional data publishing system of the local differential privacy based on the incremental learning is characterized by comprising the following components:
the correlation learning module is used for learning the correlation of all attribute pairs by aggregating the disturbance data of the first batch of users;
the junction tree model building module is used for building a dependency graph model according to the correlation of the attribute pairs and converting the built dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm;
the distribution calculation module of each group of the junction tree is used for estimating the distribution of the groups by adopting a corresponding estimation method according to the number and the size types of the attributes contained in each group based on the second batch of user data to obtain the joint distribution of each group in the junction tree model;
and the data publishing module is used for generating a data set which also comprises the same number of records and is synthesized by a data generating method based on sampling according to the junction tree model and the joint distribution of each group in the junction tree model and publishing the data set.
9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method for multi-dimensional data distribution for local differential privacy based on incremental learning according to any one of claims 1 to 7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in the method for multi-dimensional data distribution of local differential privacy based on incremental learning according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210699743.7A CN115098882B (en) | 2022-06-20 | 2022-06-20 | Multi-dimensional data release method and system based on local differential privacy of incremental learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210699743.7A CN115098882B (en) | 2022-06-20 | 2022-06-20 | Multi-dimensional data release method and system based on local differential privacy of incremental learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115098882A true CN115098882A (en) | 2022-09-23 |
CN115098882B CN115098882B (en) | 2024-08-06 |
Family
ID=83292213
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210699743.7A Active CN115098882B (en) | 2022-06-20 | 2022-06-20 | Multi-dimensional data release method and system based on local differential privacy of incremental learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115098882B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115329898A (en) * | 2022-10-10 | 2022-11-11 | 国网浙江省电力有限公司杭州供电公司 | Distributed machine learning method and system based on differential privacy policy |
CN116795850A (en) * | 2023-05-31 | 2023-09-22 | 山东大学 | Method, device and storage medium for concurrent execution of massive transactions of alliance chains |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111368888A (en) * | 2020-02-25 | 2020-07-03 | 重庆邮电大学 | Service function chain fault diagnosis method based on deep dynamic Bayesian network |
CN113094746A (en) * | 2021-03-31 | 2021-07-09 | 北京邮电大学 | High-dimensional data publishing method based on localized differential privacy and related equipment |
US20210216902A1 (en) * | 2020-01-09 | 2021-07-15 | International Business Machines Corporation | Hyperparameter determination for a differentially private federated learning process |
CN113569286A (en) * | 2021-03-26 | 2021-10-29 | 东南大学 | Frequent item set mining method based on localized differential privacy |
-
2022
- 2022-06-20 CN CN202210699743.7A patent/CN115098882B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210216902A1 (en) * | 2020-01-09 | 2021-07-15 | International Business Machines Corporation | Hyperparameter determination for a differentially private federated learning process |
CN111368888A (en) * | 2020-02-25 | 2020-07-03 | 重庆邮电大学 | Service function chain fault diagnosis method based on deep dynamic Bayesian network |
CN113569286A (en) * | 2021-03-26 | 2021-10-29 | 东南大学 | Frequent item set mining method based on localized differential privacy |
CN113094746A (en) * | 2021-03-31 | 2021-07-09 | 北京邮电大学 | High-dimensional data publishing method based on localized differential privacy and related equipment |
Non-Patent Citations (1)
Title |
---|
苏炜航;程祥;: "一种基于隐树模型的满足差分隐私的高维数据发布算法", 小型微型计算机系统, no. 04, 15 April 2018 (2018-04-15) * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115329898A (en) * | 2022-10-10 | 2022-11-11 | 国网浙江省电力有限公司杭州供电公司 | Distributed machine learning method and system based on differential privacy policy |
CN116795850A (en) * | 2023-05-31 | 2023-09-22 | 山东大学 | Method, device and storage medium for concurrent execution of massive transactions of alliance chains |
CN116795850B (en) * | 2023-05-31 | 2024-04-12 | 山东大学 | Method, device and storage medium for concurrent execution of massive transactions of alliance chains |
Also Published As
Publication number | Publication date |
---|---|
CN115098882B (en) | 2024-08-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhu et al. | Differential privacy and applications | |
CN115098882B (en) | Multi-dimensional data release method and system based on local differential privacy of incremental learning | |
Han et al. | Statistical analysis with linked data | |
Ding et al. | A multiway p-spectral clustering algorithm | |
CN108334580A (en) | A kind of community discovery method of combination link and attribute information | |
Liseo et al. | Bayesian estimation of population size via linkage of multivariate normal data sets | |
Livi et al. | Graph ambiguity | |
Tortora et al. | Model-based clustering, classification, and discriminant analysis using the generalized hyperbolic distribution: MixGHD R package | |
Vengatesan et al. | Improved T-Cluster based scheme for combination gene scale expression data | |
Kocayusufoglu et al. | Summarizing network processes with network-constrained Boolean matrix factorization | |
Souravlas et al. | Probabilistic community detection in social networks | |
Huang et al. | Statistical inference of diffusion networks | |
Salinas-Gutiérrez et al. | D-vine EDA: a new estimation of distribution algorithm based on regular vines | |
Liu et al. | Variational inference for latent space models for dynamic networks | |
Ting et al. | DEMass: a new density estimator for big data | |
Ibrahim et al. | Mixed membership graph clustering via systematic edge query | |
Yu et al. | Fragmentation coagulation based mixed membership stochastic blockmodel | |
Nabney et al. | Semisupervised learning of hierarchical latent trait models for data visualization | |
Truong et al. | Boolean matrix factorization via nonnegative auxiliary optimization | |
Poli et al. | A genetic algorithm for graphical model selection | |
Maillart et al. | Tail index partition-based rules extraction with application to tornado damage insurance | |
Mirtaheri et al. | Tensor-based method for temporal geopolitical event forecasting | |
Hu et al. | An improved possibilistic clustering based on differential algorithm | |
Karwa et al. | Monte Carlo goodness-of-fit tests for degree corrected and related stochastic blockmodels | |
Murari et al. | A Practical Utility-Based but Objective Approach to Model Selection for Scientific Applications in the Age of Big Data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |