CN115098882A - Local differential privacy multidimensional data publishing method and system based on incremental learning - Google Patents

Local differential privacy multidimensional data publishing method and system based on incremental learning Download PDF

Info

Publication number
CN115098882A
CN115098882A CN202210699743.7A CN202210699743A CN115098882A CN 115098882 A CN115098882 A CN 115098882A CN 202210699743 A CN202210699743 A CN 202210699743A CN 115098882 A CN115098882 A CN 115098882A
Authority
CN
China
Prior art keywords
attribute
distribution
data
correlation
junction tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210699743.7A
Other languages
Chinese (zh)
Other versions
CN115098882B (en
Inventor
郭山清
唐朋
胡程瑜
刘高源
金崇实
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202210699743.7A priority Critical patent/CN115098882B/en
Publication of CN115098882A publication Critical patent/CN115098882A/en
Application granted granted Critical
Publication of CN115098882B publication Critical patent/CN115098882B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of data security and privacy protection, and provides a multidimensional data issuing method and system based on local differential privacy of incremental learning, wherein the correlation of all attribute pairs is learned by aggregating a first batch of user disturbance data; constructing a dependency graph model according to the correlation of the attribute pairs, and converting the constructed dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm; based on the second batch of user data, estimating the distribution of the cliques by adopting a corresponding estimation method according to the number and size types of the attributes contained in each clique to obtain the joint distribution of each clique in the junction tree model; and generating a data set which also comprises the same number of record synthesis and publishing the data set by a data generation method based on sampling according to the junction tree model and the joint distribution of each group in the junction tree model.

Description

Local differential privacy multidimensional data publishing method and system based on incremental learning
Technical Field
The invention belongs to the field of data security and privacy protection, and particularly relates to a local differential privacy multidimensional data publishing method and system based on incremental learning.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
In the multidimensional data publishing problem of localized differential privacy, each user individual has a record containing a plurality of discrete attributes (continuous attributes are converted into discrete attributes by an equal width range of domain discretization into a fixed number), such as census data and the like. In practice, it is desirable for a data analyst to be able to perform any type of data analysis or mining on a data set to mine or extract the vast amount of underlying information behind the data to provide accurate and reliable predictions for populations and individuals. Therefore, the aggregation server needs to collect and publish data owned by all user individuals.
However, the data often contains sensitive information of the individual user, and the user does not want to share the actual data of the individual to any third-party data collector. Therefore, there is a need to address multidimensional data publishing approaches that satisfy localized differential privacy.
The localization differential privacy is used as a strict and quantifiable privacy protection model, the model does not depend on any third-party entity which declares that the model is credible, privacy protection is provided for real data of each user from the perspective of the user individual, and even if the third-party aggregation server is malicious, the privacy of the user individual can be guaranteed not to be revealed. In the model, a user locally adds certain scale noise to own real data for disturbance, and then uploads the disturbed data to the aggregation server. After receiving the disturbance data uploaded by all users, the aggregation server can only obtain some statistical information through calculation, and cannot deduce any personal sensitive information about the users from the statistical information.
Based on this model, existing work has proposed some solutions to solve this problem. In existing work, the aggregation server first collects the complete data of all users at once. In order to support all subsequent calculations while satisfying the localized differential privacy, each user needs to disturb the whole record of the user and send the disturbed result to the aggregator for aggregation. The aggregation server aggregates the disturbance data uploaded by the users, and provides any required distribution information for constructing the probability map model and generating the data by using an Expectation visualization algorithm, namely the joint distribution information of all attribute pairs and the distribution information of each group in the junction tree.
However, the above method has the following technical problems:
(i) when constructing the probabilistic graph model, it is necessary to calculate the correlation of all the paired attributes to determine the structure of the dependency graph. However, for multi-dimensional data, there are a large number of attribute pairs. To satisfy localized differential privacy, directly computing the correlation of all these pairwise attributes results in a large amount of noise being injected into the results, which severely degrades the accuracy of the dependency graph structure and the synthesized data.
(ii) When the distribution of the cliques of the junction tree is estimated, some big cliques may contain too many attributes, and the problem of high dimension is still faced when the distribution is calculated, so that the problem cannot be solved well.
Disclosure of Invention
In order to solve at least one technical problem in the background art, the present invention provides a local differential privacy based multidimensional data publishing method and system based on incremental learning, which constructs a probability graph model by an incremental learning-based method, then generates a set of low-dimensional distributions with noise by using the probability graph model, and then uses them to approximate the overall distribution of an input data set to generate a synthetic data set. Compared with the existing method, the method can provide privacy protection for each user individual and simultaneously obviously improve the precision of data release.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a multidimensional data publishing method based on local differential privacy of incremental learning, which comprises the following steps:
learning the correlation of all attribute pairs by aggregating the first batch of user disturbance data;
constructing a dependency graph model according to the correlation of the attribute pairs, and converting the constructed dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm;
based on the second batch of user data, estimating the distribution of the cliques by adopting a corresponding estimation method according to the number and size types of the attributes contained in each clique to obtain the joint distribution of each clique in the junction tree model;
and generating a data set which also comprises the same number of record synthesis and publishing the data set by a data generation method based on sampling according to the junction tree model and the joint distribution of each group in the junction tree model.
A second aspect of the present invention provides a multidimensional data distribution system for local differential privacy based on incremental learning, comprising:
the correlation learning module is used for learning the correlation of all attribute pairs by aggregating the disturbance data of the first batch of users;
the junction tree model building module is used for building a dependency graph model according to the correlation of the attribute pairs and converting the built dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm;
the distribution calculation module of each group of the junction tree is used for estimating the distribution of the groups by adopting a corresponding estimation method according to the number and the size type of the attributes contained in each group based on the second batch of user data to obtain the joint distribution of each group in the junction tree model;
and the data publishing module is used for generating a data set which also comprises the same number of records and is synthesized by a data generating method based on sampling according to the junction tree model and the joint distribution of each group in the junction tree model and publishing the data set.
A third aspect of the invention provides a computer-readable storage medium.
A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method for multi-dimensional data distribution of local differential privacy based on incremental learning as described above.
A fourth aspect of the invention provides a computer apparatus.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the method for multi-dimensional data distribution based on incremental learning-based local differential privacy as described above when executing the program.
Compared with the prior art, the invention has the beneficial effects that:
the method adopts an incremental learning-based method to construct a probability graph model, gradually prunes weak-correlation attribute pairs, allocates more data and privacy budgets to useful attribute pairs to correctly identify the correlation among all attribute pairs in an attribute set, thereby being capable of more effectively constructing the probability graph model, then utilizes the probability graph model to generate a group of low-dimensional distributions with noise, and then uses the low-dimensional distributions to approximate the overall distribution of an input data set so as to generate a synthetic data set.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flowchart illustrating a multidimensional data publishing method based on incremental learning local differential privacy according to an embodiment of the present invention;
figure 2 is a markov network of an embodiment of the present invention;
figure 3 is a markov network junction tree according to an embodiment of the present invention;
FIG. 4 is a graph comparing the algorithmic effects of the method of the present invention and existing methods.
Detailed Description
The invention is further described with reference to the following figures and examples.
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
In a localization scenario, each user individual has a piece of data containing multiple attributes. In order to complete the data distribution task, data of all user individuals needs to be collected for data distribution. However, the data often contains sensitive information about the individual user. Therefore, there is a need to address the issue of multidimensional data distribution that satisfies localized differential privacy.
In order to solve the first technical problem in the background art of the application, the invention gradually prunes the attribute pairs with weak correlation, and allocates more data and privacy budgets to the useful attribute pairs so as to correctly identify the correlation among all the attribute pairs in the attribute set, thereby being capable of more effectively constructing a probability graph model and remarkably improving the quality of a synthetic data set.
In order to solve the second technical problem in the background art of the present application, the present invention further provides a new large clique distribution calculation method based on joint distribution decomposition and redundancy elimination, which effectively solves the distribution estimation of the large cliques.
Interpretation of terms:
the invention relates to two key elements, namely a localized differential privacy, junction tree model.
These two elements are first described below; and then, a formalized definition of a multidimensional data publishing problem under a localized differential privacy scene is given.
1. Localized differential privacy
In a local setting of differential privacy, an untrusted aggregator wishes to collect the user's personal information to accomplish the corresponding data analysis task. Localized differential privacy (Loca)l differential privacy) provides a random response algorithm
Figure BDA0003703933790000061
The data analysis task of the aggregation server is met, and meanwhile the privacy of the user is protected. Localized differential privacy may be defined as:
localized differential privacy: a definition domain and a value domain are respectively
Figure BDA0003703933790000062
Epsilon-localized differential privacy random algorithm of
Figure BDA0003703933790000063
When epsilon>At 0, give an arbitrary output
Figure BDA0003703933790000064
And arbitrary two inputs x 1 ,
Figure BDA0003703933790000065
Satisfies the following conditions:
Figure BDA0003703933790000066
the privacy parameter epsilon measures the privacy protection strength of individual sensitive information, and smaller epsilon represents greater privacy protection strength.
2. Junction tree model
To overcome dimension cursing, it is critical to find conditional independence between attributes in the real dataset to break up the joint probability distribution into modular components. Probabilistic graphical models are elegant tools to identify such modular structures, while markov networks are the most widely used graphical models based on undirected graphs. The join tree algorithm provides a feasible method for accurately deducing joint probability distribution from a Markov network. The junction tree is defined as follows:
a junction tree model: for a given Markov network G, a tree is transformed from G
Figure BDA0003703933790000067
If and only if
Figure BDA0003703933790000071
Of (2) any two clusters C i ,
Figure BDA0003703933790000072
Of (2) intersection C i ∩C j Present in connection C i And C j In each node on a unique path between, call
Figure BDA0003703933790000073
Is a junction tree. Wherein:
Figure BDA0003703933790000074
known as junction trees
Figure BDA0003703933790000075
In the form of a mass of (a),
Figure BDA0003703933790000076
indicating the intersection of two adjacent blobs.
Examples are: FIG. 3 shows a junction tree constructed from the Markov network of FIG. 2, where the elliptical nodes are the cliques of the junction tree and the rectangular nodes represent separators, namely:
Figure BDA0003703933790000077
Figure BDA0003703933790000078
note that the edges AD and AE in the markov network of fig. 2, connected by dashed lines, are introduced by the junction tree algorithm and are not themselves involved, we call them chords.
For the attribute set of
Figure BDA0003703933790000079
Of the data set
Figure BDA00037039337900000710
Given junction tree
Figure BDA00037039337900000711
Structure of (5), and joint distribution Pr (C) i ) And Pr (S) ij ),
Figure BDA00037039337900000712
The joint distribution of (a) may be expressed as:
Figure BDA00037039337900000713
3. problem definition
The formalization of the multidimensional data distribution problem to satisfy localized differential privacy is described as follows:
assume that there are N data owners and 1 aggregation server. Each data owner U k (1. ltoreq. k. ltoreq.N) has a chain containing d discrete attributes { A ≦ 1 ,…,A d Recording of } of
Figure BDA00037039337900000714
Wherein
Figure BDA00037039337900000715
Representing a user U k The ith attribute A of i The value of (c). The data of all users form a data set
Figure BDA00037039337900000716
The aggregation server wants to know about the data set
Figure BDA00037039337900000717
Combined distribution of all attributes Pr (A) 1 ,…, d ) To generate a new composite data set for publication. In order to protect the privacy of individual users, the aggregation server collects the records of all N users on the premise of meeting the localized differential privacy and generates a data set which is identical to the original data set
Figure BDA00037039337900000718
Synthetic datasets with approximate distributions
Figure BDA00037039337900000719
Namely:
Figure BDA00037039337900000720
wherein:
Figure BDA00037039337900000721
representing a data set
Figure BDA00037039337900000722
The d-dimension of (a) is jointly distributed.
Example one
As shown in fig. 1, the present embodiment provides a multidimensional data publishing method based on local differential privacy of incremental learning, including the following steps:
step 1: learning the correlation of all attribute pairs by aggregating disturbance data uploaded by a first batch of users;
and 2, step: constructing a dependency graph model according to the correlation of the attribute pairs, and converting the constructed dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm;
and step 3: based on the second batch of user data, estimating the distribution of the cliques by adopting a corresponding estimation method according to the number and size types of the attributes contained in each clique to obtain the distribution of each clique in the junction tree model;
and 4, step 4: and generating and distributing a data set which also comprises the same number of record synthesis by a data generation method based on sampling according to the connection tree model and the distribution of each group in the connection tree model.
The present embodiment relates to two types of entities: n data owners and 1 aggregation server. Each data owner U k (1. ltoreq. k. ltoreq.N) has a record containing d discrete attributes
Figure BDA0003703933790000081
Aggregation server uses original data set composed of data of N data owners
Figure BDA0003703933790000082
Generate an AND
Figure BDA0003703933790000083
Synthetic datasets with similar distributions
Figure BDA0003703933790000084
And data release is carried out, and the privacy protection requirement of each data owner is ensured.
In order to more clearly understand the technical contents of the present invention, the following detailed description is given;
step 1: and collecting the first batch of user data and calculating to generate a probability graph model.
The method specifically comprises the following substeps:
step 1.1: and grouping the first group of users.
The first group of users
Figure BDA0003703933790000085
Partitioning into T disjoint groups, namely:
Figure BDA0003703933790000086
each group of
Figure BDA0003703933790000087
Respectively for the t-th iteration in step 1.3.
Step 1.2: the dependency graph model is initialized.
A dependency graph G is initialized, wherein the dependency graph G comprises d vertexes, and any two vertexes are provided with connecting edges. Let E { (i, j) |0 ≦ i ≠ j < d } be the set of edges for G (i.e., the set of attribute pairs).
Step 1.3: and constructing a dependency graph model.
In order to better construct the dependency graph model, the embodiment uses an incremental learning-based dependency graph model construction method, which consists of T iterations. In each iteration, some new data is collected for each attribute pair remaining in the set E of attribute pairs, respectively, and the correlations of these attribute pairs are re-estimated to eliminate attribute pairs with weaker correlations.
Taking the t-th iteration as an example, the method specifically comprises the following sub-steps:
step 1.3.1: and grouping the users participating in the iteration according to the residual attribute pairs.
Grouping users according to the edge set E of the current dependency graph G
Figure BDA0003703933790000091
Figure BDA0003703933790000092
Wherein:
Figure BDA0003703933790000093
the users of the group are only used for data collection on attribute pair e at the t-th iteration.
Step 1.3.2: user data is collected. In consideration of privacy protection, a user needs to add a proper amount of disturbance to own real data and report the disturbance result to the aggregation server.
To a first order
Figure BDA0003703933790000094
For example, assuming that e is (i, j), the method specifically includes the following sub-steps:
step 1.3.2.1: the user first associates himself with attribute A i And A j And converting the real data into a single attribute, and encoding the converted attribute value by using a one-hot (one-hot) encoding rule to obtain an encoding result S.
Step 1.3.2.2: setting a value of a parameter epsilon representing the privacy protection strength, and outputting a disturbance result S' by using an OUE algorithm with the following probability for the code value S of the user obtained in the step 1.3.2.1, namely:
Figure BDA0003703933790000095
wherein:
Figure BDA0003703933790000096
after obtaining the perturbation result S', the user reports it to the aggregation server.
Step 1.3.3: the aggregation server calculates the joint distribution { P ] of all the remaining attribute pairs in the current E by adopting an unbiased estimation method according to the disturbance data uploaded by the user t (A i ,A j ) Where the subscript t denotes that the distribution is the result calculated by the data collection of the t-th iteration.
Step 1.3.4: and (6) accumulating the data. Since in each iteration, data for the attribute pairs that are not pruned needs to be collected continuously and their joint distribution recalculated. The data are accumulated, more accurate distribution information can be obtained, and the accuracy of judging the correlation of the non-trimmed attribute pair is improved.
Meanwhile, in order to improve the efficiency of data accumulation, the embodiment provides a more efficient data accumulation method, and the accumulation result of the t-th round can be directly obtained according to the accumulation result of the t-1 th round and the calculation result of the t-th round.
In particular, for arbitrary attribute combinations
Figure BDA0003703933790000101
Assume that the t-th collection includes
Figure BDA0003703933790000102
Is combined with the attributes of
Figure BDA0003703933790000103
And define
Figure BDA0003703933790000104
Representing upload attribute combinations
Figure BDA0003703933790000105
The number of users of (a) is,
Figure BDA0003703933790000106
represent
Figure BDA0003703933790000107
The marginal table of (a) is,
Figure BDA0003703933790000108
represent
Figure BDA0003703933790000109
The term (2) is used in the following description,
Figure BDA00037039337900001010
1≤j≤i t . Then the t-th wheel
Figure BDA00037039337900001011
The accumulated result of the margin table of (1)
Figure BDA00037039337900001012
Can be expressed as:
Figure BDA00037039337900001013
wherein:
Figure BDA00037039337900001014
therefore, through the calculation of the formula, the joint distribution result after the accumulation of all the residual attribute pairs in the E and the t-1 round data collection result can be obtained
Figure BDA00037039337900001015
Step 1.3.5: the mutual information is recalculated.
In this embodiment, the correlation between the attribute pairs is measured using mutual information of the two attributes. To pair
Figure BDA00037039337900001016
Attribute pair A i ,A j Of mutual information
Figure BDA00037039337900001017
The calculation formula of (c) is as follows:
Figure BDA0003703933790000111
wherein (A) i ,A j ) In the form of an attribute pair, the attribute pair,
Figure BDA0003703933790000112
are respectively attribute A i ,A j Domain of (a), Pr (a) m ) And Pr (a) n ) Respectively represent
Figure BDA0003703933790000113
Middle mth value a m Marginal distribution of and
Figure BDA0003703933790000114
middle nth value a n Marginal distribution of (A), Pr (a) m ,a n ) Denotes a m And a n The joint distribution of (1).
Wherein: associated distribution Pr (A) i ,A j ) For the combined distribution obtained after step 1.3.4 data accumulation
Figure BDA0003703933790000115
Pr(a m ,a n ) Is the term Pr (A) in the joint distribution i =a m ,A j =a n ) Abbreviations of (a).
Step 1.3.6: trimming the invalid edge.
In general, for any attribute pair A i ,A j Usually using a predetermined correlation threshold
Figure BDA0003703933790000116
Information with it
Figure BDA0003703933790000117
Size of (2)The relationship is used to determine whether there is a correlation between the attribute pairs. However, sampling errors can result due to the use of partial data to estimate the strength of attribute versus correlation. Meanwhile, since the smaller the data amount is in LDP, the larger the disturbance error is. Therefore, the correlation threshold is directly used
Figure BDA0003703933790000118
May result in erroneous pruning of otherwise highly correlated attribute pairs.
In order to reduce errors caused by sampling and disturbance and improve the accuracy of edge selection, the embodiment uses an edge pruning method based on threshold relaxation. To pair
Figure BDA0003703933790000119
The method specifically comprises the following substeps:
step 1.3.6.1: calculating a correlation threshold
Figure BDA00037039337900001110
For a given dependency parameter φ, then:
Figure BDA00037039337900001111
step 1.3.6.2: a scaled correlation threshold/is calculated.
Assume that a total of n records among the currently collected records contain attribute A i And A j A calculated from the n records i And A j Is represented by
Figure BDA00037039337900001113
Due to the randomness of the sampling and perturbation, will
Figure BDA00037039337900001114
Considered as a random variable. For true mutual information
Figure BDA00037039337900001115
And mutual experience information
Figure BDA00037039337900001116
Satisfies the following conditions:
Figure BDA00037039337900001117
wherein:
Figure BDA0003703933790000121
when in use
Figure BDA0003703933790000122
When Δ I (η) is log (M) i ) (ii) a Otherwise, the reverse is carried out
Figure BDA0003703933790000123
Figure BDA0003703933790000124
Thus, given a confidence level of 1- α, the scaled correlation threshold, l, satisfies:
Figure BDA0003703933790000126
then l can be calculated as:
Figure BDA0003703933790000128
step 1.3.6.3: re-estimating the correlation and pruning the edges. If it is not
Figure BDA0003703933790000129
Represents attribute pair A i ,A j There is still a high probability of having a strong correlation, so the edge e is kept (i, j) in the dependency graph G; instead, we delete the edge from the dependency graph G, i.e., E ═ E- { (i, j) }.
Step 1.4: a junction tree structure. According to the dependency graph G constructed in the step 1.3, the dependency graph is converted into a set of clusters through a junction tree algorithm
Figure BDA00037039337900001210
Of a connection tree
Figure BDA00037039337900001211
Wherein: group C i Contains at least one attribute, | C i L represents the number of attributes in the blob,
Figure BDA00037039337900001212
is the size of the bolus.
Step 2: a second batch of user data is collected and the distribution of cliques of the junction tree is calculated.
In obtaining a junction tree
Figure BDA00037039337900001213
Later, to generate a synthetic dataset, clique sets need to be estimated
Figure BDA00037039337900001214
The joint distribution of each clique. Since the number of attributes included in each clique and the size of the clique are different, for different types of cliques, the present embodiment estimates the distribution of the cliques by using different methods. The method specifically comprises the following substeps:
step 2.1: and (4) classifying the clusters.
Dividing the ball into two groups according to the size of the ball
Figure BDA00037039337900001215
In detail, given an empirical blob size threshold σ, pair
Figure BDA00037039337900001216
If it is
Figure BDA00037039337900001217
Called ball C i Is in the form of small ball and put in
Figure BDA00037039337900001218
Performing the following steps; on the contrary, if
Figure BDA00037039337900001219
Called ball C i Is in large ball form and is put into group
Figure BDA0003703933790000131
In (1).
Step 2.2: and (4) decomposing the large lumps.
Since large cliques tend to contain more attributes, a straightforward estimate of their distribution still faces the problem of dimension cursing. To pair
Figure BDA0003703933790000132
According to the chain rule, the joint distribution can be decomposed into the product of a plurality of conditional probabilities, namely:
Figure BDA0003703933790000133
wherein: II type i Is represented by A i The conditions of (1).
Π i There is redundancy, i.e. when given Π i In some properties of (II), Π i With A i Are condition independent. By means of pairs II i Performing redundancy deletion to obtain
Figure BDA0003703933790000134
The purpose of reducing dimension can be achieved because
Figure BDA0003703933790000135
Greater than i Containing fewer attributes. Intuitively, the order of decomposition affects the effect of deleting redundancy.
Therefore, the present embodiment uses a heuristic based on a forward search strategy to determine a better decomposition order.
Order attribute set
Figure BDA0003703933790000136
The method comprises
Figure BDA0003703933790000137
And (4) secondary circulation, wherein each circulation selects one attribute A capable of deleting more redundancies from the rest attributes in the attribute set Q h As a result of the target attribute and having the redundancy deleted
Figure BDA0003703933790000138
As a condition, key value pair
Figure BDA0003703933790000139
Form of (2) is stored in a dictionary
Figure BDA00037039337900001310
In (1).
Ball C i The joint distribution of (a) is expressed as:
Figure BDA00037039337900001311
in detail, the pair
Figure BDA00037039337900001312
The secondary cycle specifically comprises the following sub-steps:
step 2.2.1: if the domain size | Q | ═ pi of the current attribute set Q A∈QA I is less than or equal to the blob size threshold σ, i.e.: if Q < sigma > is less than or equal to the sum of the attributes A and Q, randomly selecting one attribute A from the attribute set Q h E.g. Q as the target attribute, and making the attribute set Q divide the attribute A h All other attributes except
Figure BDA00037039337900001313
As a condition, i.e. the h-th term of the factorization is the conditional probability
Figure BDA00037039337900001314
Will attribute A h Delete from attribute set Q, i.e.: q ═ Q \ A h
Step 2.2.2: conversely, if | Q->σ for
Figure BDA0003703933790000141
The embodiment uses the attribute A j For the target attribute, a characteristic selection method of maximum-Redundancy-minimum-correlation (minimum-Redundancy-maximum-Relevance) is used for performing Q \ A on the attribute set j Performing redundancy elimination to obtain an attribute set after the redundancy elimination
Figure BDA0003703933790000142
The following results were calculated:
Figure BDA0003703933790000143
wherein: the mutual information values of any two attributes are the calculation results in step 1.
Based on the calculated redundancy elimination result
Figure BDA0003703933790000144
Selecting the order variance from the attribute set Q
Figure BDA0003703933790000145
Minimum attribute A h E.g. Q as target attribute, and making redundant elimination result
Figure BDA0003703933790000146
As a condition, then the h term of the factorization is
Figure BDA0003703933790000147
And let Q ═ Q \ A h
Step 2.3: a distribution estimation is performed using the second batch of user data.
In order to obtain the distribution of each clique, the joint distribution of some attribute combinations needs to be calculated, and the method specifically comprises the following sub-steps:
step 2.3.1: a combination of attributes is determined.
Set of settings
Figure BDA0003703933790000148
And
Figure BDA0003703933790000149
the method is used for storing the attribute combination needing to be distributed and the attribute combination with known distribution, and specifically comprises the following steps:
to pair
Figure BDA00037039337900001410
If it is
Figure BDA00037039337900001411
Then will be
Figure BDA00037039337900001412
Is put into
Figure BDA00037039337900001413
Performing the following steps; otherwise, it will
Figure BDA00037039337900001414
To pair
Figure BDA00037039337900001415
If | C i If | ≧ 3, then C is i Is put into
Figure BDA00037039337900001416
Performing the following steps; otherwise, store in
Figure BDA00037039337900001417
In (1).
Step 2.3.2: and (4) grouping users. Second batch of users
Figure BDA00037039337900001418
Is divided into
Figure BDA00037039337900001419
The disjoint groups, namely:
Figure BDA00037039337900001420
each group of
Figure BDA00037039337900001421
Respectively for attribute combination
Figure BDA00037039337900001422
The distribution of (2) is calculated.
Step 2.3.3: user data is collected and joint distributions are estimated.
To pair
Figure BDA0003703933790000151
The joint distribution estimation of (2) is the same as the calculation steps for the first group of users in steps 1.3.3 and 1.3.4, and thus can be obtained
Figure BDA0003703933790000152
To is pair
Figure BDA0003703933790000153
It can be obtained directly from the calculation result of step 1.
Step 2.4: the distribution of the cliques is calculated. To pair
Figure BDA0003703933790000154
The present embodiment estimates the joint distribution
Figure BDA0003703933790000155
Distribution of conditions
Figure BDA0003703933790000156
Can be distributed from the union
Figure BDA0003703933790000157
Thus obtaining the product. Thus, according to the decomposition formula of the joint distribution in step 2.2, it is calculated
Figure BDA0003703933790000158
Combined distribution of Pr (C) i )。
And step 3: a composite data set is generated.
The aggregation server obtains the junction tree according to the step 1
Figure BDA0003703933790000159
And the distribution of each cluster obtained in step 2
Figure BDA00037039337900001510
Generating a synthetic data set comprising N records by a sampling-based data generation method
Figure BDA00037039337900001511
Taking the generation of one record as an example, the method specifically comprises the following sub-steps:
step 3.1: randomly selecting a clique
Figure BDA00037039337900001512
According to its distribution Pr (C) i ) Sampling to obtain C i Sampling results of all attributes in
Figure BDA00037039337900001513
Then all and clusters C are selected i Associated groups
Figure BDA00037039337900001514
Step 3.2: to pair
Figure BDA00037039337900001515
We derived from the conditional distribution
Figure BDA00037039337900001516
Middle pair attribute
Figure BDA00037039337900001517
Sampling is performed, wherein:
Figure BDA00037039337900001518
representing an unsampled set of attributes, the conditional distribution can be from Pr (C) r ) And (4) obtaining. Then all are reacted with C r Insertion of connected and unaccessed blobs into
Figure BDA00037039337900001519
To the end of (c).
Step 3.3: and (3.2) repeatedly executing the step until the sampling results X of all the attributes are obtained, and finishing the generation.
In order to better verify the effect of the multidimensional data publishing method based on incremental learning and meeting the local differential privacy, referring to fig. 4, privincorr is the method, and the method in the invention is compared between two public data sets Adult and TPC-E with the existing method. The comparison method comprises a non-incremental method NoIncremental, a non-privacy version NoPrivJTree and two multidimensional k marginal issuing methods CALM and FT which meet the localization differential privacy. Experimental results show that the synthetic data set generated by the method can provide better effect.
Example two
The embodiment provides a multidimensional data distribution system based on local differential privacy of incremental learning, which comprises:
the correlation learning module is used for learning the correlation of all attribute pairs by aggregating disturbance data uploaded by a first group of users;
the junction tree model building module is used for building a dependency graph model according to the correlation of the attribute pairs and converting the built dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm;
the distribution calculation module of each group of the junction tree is used for estimating the distribution of the groups by adopting a corresponding estimation method according to the number and the size type of the attributes contained in each group based on the second batch of user data to obtain the joint distribution of each group in the junction tree model;
and the data publishing module is used for generating a data set which also comprises the same number of records and is synthesized by a data generating method based on sampling according to the junction tree model and the joint distribution of each group in the junction tree model and publishing the data set.
EXAMPLE III
The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps in the local differential privacy based multidimensional data distribution method based on incremental learning as described above.
Example four
The present embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the processor implements the steps in the incremental learning-based local differential privacy multidimensional data distribution method as described above.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The local differential privacy multidimensional data publishing method based on incremental learning is characterized by comprising the following steps:
learning the correlation of all attribute pairs by aggregating the first batch of user disturbance data;
constructing a dependency graph model according to the correlation of the attribute pairs, and converting the constructed dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm;
based on the second batch of user data, estimating the distribution of the cliques by adopting a corresponding estimation method according to the number and size types of the attributes contained in each clique to obtain the joint distribution of each clique in the junction tree model;
and generating a data set which also comprises the same number of record synthesis and publishing the data set by a data generation method based on sampling according to the junction tree model and the joint distribution of each group in the junction tree model.
2. The incremental learning-based multi-dimensional data publishing method for local differential privacy according to claim 1, wherein the building of the dependency graph model according to the correlation of attribute pairs comprises:
constructing by adopting a dependency graph model construction method based on incremental learning according to the edge set of the current dependency graph, wherein the method comprises the following steps: and performing T-round iteration, respectively collecting new data for each attribute pair remaining in the attribute pair set in each iteration, re-estimating the correlation between the attribute pairs, and removing the attribute pairs with weak correlation by adopting an edge pruning method based on threshold relaxation to obtain a pruned edge.
3. The incremental learning-based multi-dimensional data distribution method for local differential privacy as claimed in claim 2, wherein the correlation between the attribute pair is measured by using the mutual information of the two attributes, and the mutual information of the attribute pair is calculated by the formula:
Figure FDA0003703933780000011
in the formula, A j ,A j In the form of an attribute pair, the attribute pair,
Figure FDA0003703933780000012
are respectively attribute A i ,A j Domain of (a), Pr (a) m ) And Pr (a) n ) Respectively represent
Figure FDA0003703933780000013
Middle mth value a m Marginal distribution of (A) and
Figure FDA0003703933780000014
the nth value a n Marginal distribution of (A), Pr (a) m ,a n ) Denotes a m And a n The joint distribution of (c).
4. The incremental learning-based multi-dimensional data distribution method for local differential privacy as claimed in claim 2, wherein the step of removing the attribute pair with weak correlation by using an edge pruning method based on threshold relaxation comprises:
calculating a correlation threshold value based on the set dependency parameter;
calculating a scaled correlation threshold in combination with the correlation threshold, the given confidence level and mutual information of the recorded attribute pairs;
recalculating the re-estimated correlation, if the correlation of the attribute pair is greater than or equal to the scaled correlation threshold value, indicating that the attributes have strong correlation, and keeping an edge in the dependency graph; otherwise, the edge is removed from the dependency graph.
5. The incremental learning-based local differential privacy multidimensional data distribution method according to claim 1, wherein the estimating the distribution of the cliques by using a corresponding estimation method according to the number and size types of the attributes included in each clique to obtain the joint distribution of each clique in the junction tree model comprises:
dividing all the clusters into two groups of big clusters and small clusters according to the number and the size types of the clusters;
confirming an optimal decomposition sequence by adopting a heuristic method of a forward search strategy, and decomposing the large clusters according to the optimal sequence to obtain condition distribution;
and obtaining the joint distribution of each group in the junction tree model based on the second batch of user data and the condition distribution and based on a joint distribution formula.
6. The incremental learning-based multi-dimensional data distribution method for local differential privacy as claimed in claim 1, wherein the heuristic method using a forward search strategy confirms an optimal decomposition order, and decomposing the cliques according to the optimal order comprises:
if | Q | ≦ σ, randomly select an attribute A from Q h E.g. Q as the target attribute, let
Figure FDA0003703933780000021
As conditions, i.e. conditional distribution of the h-th term of factorization
Figure FDA0003703933780000031
If | Q |>σ for
Figure FDA0003703933780000032
With A j For the target attribute, using the characteristic selection method of maximum redundancy minimum correlation to the attribute set Q \ A j Performing redundancy elimination to obtain an attribute set after the redundancy elimination
Figure FDA0003703933780000033
Selecting an order from the calculated redundancy elimination result
Figure FDA0003703933780000034
Minimum A h E.g. Q as the target attribute, let
Figure FDA0003703933780000035
As a condition, the h term of factorization is the conditional distribution
Figure FDA0003703933780000036
And let Q ═ Q \ A h
Where | Q | represents the domain size of the current attribute set Q, σ is the blob size threshold,
Figure FDA0003703933780000037
divide attribute A for attribute set Q h All other attributes.
7. The incremental learning-based local differential privacy multidimensional data distribution method as recited in claim 1, wherein the record generation process comprises:
randomly selecting a small group, sampling according to the distribution of the small group to obtain the sampling results of all the attributes of the small group, and then selecting all the groups related to the small group
Figure FDA0003703933780000038
For is to
Figure FDA0003703933780000039
From conditional distribution
Figure FDA00037039337800000310
Middle pair attribute
Figure FDA00037039337800000311
A sampling is performed in which, among other things,
Figure FDA00037039337800000312
representing an unsampled attribute set with a conditional distribution from Pr (C) r ) Obtaining, then mixing all with C r Insertion of connected and unaccessed blobs into
Figure FDA00037039337800000313
To the end of (1);
and repeatedly executing sampling until sampling results of all attributes are obtained, and finishing the generation.
8. The multidimensional data publishing system of the local differential privacy based on the incremental learning is characterized by comprising the following components:
the correlation learning module is used for learning the correlation of all attribute pairs by aggregating the disturbance data of the first batch of users;
the junction tree model building module is used for building a dependency graph model according to the correlation of the attribute pairs and converting the built dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm;
the distribution calculation module of each group of the junction tree is used for estimating the distribution of the groups by adopting a corresponding estimation method according to the number and the size types of the attributes contained in each group based on the second batch of user data to obtain the joint distribution of each group in the junction tree model;
and the data publishing module is used for generating a data set which also comprises the same number of records and is synthesized by a data generating method based on sampling according to the junction tree model and the joint distribution of each group in the junction tree model and publishing the data set.
9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method for multi-dimensional data distribution for local differential privacy based on incremental learning according to any one of claims 1 to 7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in the method for multi-dimensional data distribution of local differential privacy based on incremental learning according to any one of claims 1 to 7.
CN202210699743.7A 2022-06-20 2022-06-20 Multi-dimensional data release method and system based on local differential privacy of incremental learning Active CN115098882B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210699743.7A CN115098882B (en) 2022-06-20 2022-06-20 Multi-dimensional data release method and system based on local differential privacy of incremental learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210699743.7A CN115098882B (en) 2022-06-20 2022-06-20 Multi-dimensional data release method and system based on local differential privacy of incremental learning

Publications (2)

Publication Number Publication Date
CN115098882A true CN115098882A (en) 2022-09-23
CN115098882B CN115098882B (en) 2024-08-06

Family

ID=83292213

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210699743.7A Active CN115098882B (en) 2022-06-20 2022-06-20 Multi-dimensional data release method and system based on local differential privacy of incremental learning

Country Status (1)

Country Link
CN (1) CN115098882B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115329898A (en) * 2022-10-10 2022-11-11 国网浙江省电力有限公司杭州供电公司 Distributed machine learning method and system based on differential privacy policy
CN116795850A (en) * 2023-05-31 2023-09-22 山东大学 Method, device and storage medium for concurrent execution of massive transactions of alliance chains

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368888A (en) * 2020-02-25 2020-07-03 重庆邮电大学 Service function chain fault diagnosis method based on deep dynamic Bayesian network
CN113094746A (en) * 2021-03-31 2021-07-09 北京邮电大学 High-dimensional data publishing method based on localized differential privacy and related equipment
US20210216902A1 (en) * 2020-01-09 2021-07-15 International Business Machines Corporation Hyperparameter determination for a differentially private federated learning process
CN113569286A (en) * 2021-03-26 2021-10-29 东南大学 Frequent item set mining method based on localized differential privacy

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210216902A1 (en) * 2020-01-09 2021-07-15 International Business Machines Corporation Hyperparameter determination for a differentially private federated learning process
CN111368888A (en) * 2020-02-25 2020-07-03 重庆邮电大学 Service function chain fault diagnosis method based on deep dynamic Bayesian network
CN113569286A (en) * 2021-03-26 2021-10-29 东南大学 Frequent item set mining method based on localized differential privacy
CN113094746A (en) * 2021-03-31 2021-07-09 北京邮电大学 High-dimensional data publishing method based on localized differential privacy and related equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
苏炜航;程祥;: "一种基于隐树模型的满足差分隐私的高维数据发布算法", 小型微型计算机系统, no. 04, 15 April 2018 (2018-04-15) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115329898A (en) * 2022-10-10 2022-11-11 国网浙江省电力有限公司杭州供电公司 Distributed machine learning method and system based on differential privacy policy
CN116795850A (en) * 2023-05-31 2023-09-22 山东大学 Method, device and storage medium for concurrent execution of massive transactions of alliance chains
CN116795850B (en) * 2023-05-31 2024-04-12 山东大学 Method, device and storage medium for concurrent execution of massive transactions of alliance chains

Also Published As

Publication number Publication date
CN115098882B (en) 2024-08-06

Similar Documents

Publication Publication Date Title
Zhu et al. Differential privacy and applications
CN115098882B (en) Multi-dimensional data release method and system based on local differential privacy of incremental learning
Han et al. Statistical analysis with linked data
Ding et al. A multiway p-spectral clustering algorithm
CN108334580A (en) A kind of community discovery method of combination link and attribute information
Liseo et al. Bayesian estimation of population size via linkage of multivariate normal data sets
Livi et al. Graph ambiguity
Tortora et al. Model-based clustering, classification, and discriminant analysis using the generalized hyperbolic distribution: MixGHD R package
Vengatesan et al. Improved T-Cluster based scheme for combination gene scale expression data
Kocayusufoglu et al. Summarizing network processes with network-constrained Boolean matrix factorization
Souravlas et al. Probabilistic community detection in social networks
Huang et al. Statistical inference of diffusion networks
Salinas-Gutiérrez et al. D-vine EDA: a new estimation of distribution algorithm based on regular vines
Liu et al. Variational inference for latent space models for dynamic networks
Ting et al. DEMass: a new density estimator for big data
Ibrahim et al. Mixed membership graph clustering via systematic edge query
Yu et al. Fragmentation coagulation based mixed membership stochastic blockmodel
Nabney et al. Semisupervised learning of hierarchical latent trait models for data visualization
Truong et al. Boolean matrix factorization via nonnegative auxiliary optimization
Poli et al. A genetic algorithm for graphical model selection
Maillart et al. Tail index partition-based rules extraction with application to tornado damage insurance
Mirtaheri et al. Tensor-based method for temporal geopolitical event forecasting
Hu et al. An improved possibilistic clustering based on differential algorithm
Karwa et al. Monte Carlo goodness-of-fit tests for degree corrected and related stochastic blockmodels
Murari et al. A Practical Utility-Based but Objective Approach to Model Selection for Scientific Applications in the Age of Big Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant