CN115098882A

CN115098882A - Local differential privacy multidimensional data publishing method and system based on incremental learning

Info

Publication number: CN115098882A
Application number: CN202210699743.7A
Authority: CN
Inventors: 郭山清; 唐朋; 胡程瑜; 刘高源; 金崇实
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2022-09-23
Anticipated expiration: 2042-06-20
Also published as: CN115098882B

Abstract

The invention belongs to the field of data security and privacy protection, and provides a multidimensional data issuing method and system based on local differential privacy of incremental learning, wherein the correlation of all attribute pairs is learned by aggregating a first batch of user disturbance data; constructing a dependency graph model according to the correlation of the attribute pairs, and converting the constructed dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm; based on the second batch of user data, estimating the distribution of the cliques by adopting a corresponding estimation method according to the number and size types of the attributes contained in each clique to obtain the joint distribution of each clique in the junction tree model; and generating a data set which also comprises the same number of record synthesis and publishing the data set by a data generation method based on sampling according to the junction tree model and the joint distribution of each group in the junction tree model.

Description

Local differential privacy multidimensional data publishing method and system based on incremental learning

Technical Field

The invention belongs to the field of data security and privacy protection, and particularly relates to a local differential privacy multidimensional data publishing method and system based on incremental learning.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

In the multidimensional data publishing problem of localized differential privacy, each user individual has a record containing a plurality of discrete attributes (continuous attributes are converted into discrete attributes by an equal width range of domain discretization into a fixed number), such as census data and the like. In practice, it is desirable for a data analyst to be able to perform any type of data analysis or mining on a data set to mine or extract the vast amount of underlying information behind the data to provide accurate and reliable predictions for populations and individuals. Therefore, the aggregation server needs to collect and publish data owned by all user individuals.

However, the data often contains sensitive information of the individual user, and the user does not want to share the actual data of the individual to any third-party data collector. Therefore, there is a need to address multidimensional data publishing approaches that satisfy localized differential privacy.

The localization differential privacy is used as a strict and quantifiable privacy protection model, the model does not depend on any third-party entity which declares that the model is credible, privacy protection is provided for real data of each user from the perspective of the user individual, and even if the third-party aggregation server is malicious, the privacy of the user individual can be guaranteed not to be revealed. In the model, a user locally adds certain scale noise to own real data for disturbance, and then uploads the disturbed data to the aggregation server. After receiving the disturbance data uploaded by all users, the aggregation server can only obtain some statistical information through calculation, and cannot deduce any personal sensitive information about the users from the statistical information.

Based on this model, existing work has proposed some solutions to solve this problem. In existing work, the aggregation server first collects the complete data of all users at once. In order to support all subsequent calculations while satisfying the localized differential privacy, each user needs to disturb the whole record of the user and send the disturbed result to the aggregator for aggregation. The aggregation server aggregates the disturbance data uploaded by the users, and provides any required distribution information for constructing the probability map model and generating the data by using an Expectation visualization algorithm, namely the joint distribution information of all attribute pairs and the distribution information of each group in the junction tree.

However, the above method has the following technical problems:

(i) when constructing the probabilistic graph model, it is necessary to calculate the correlation of all the paired attributes to determine the structure of the dependency graph. However, for multi-dimensional data, there are a large number of attribute pairs. To satisfy localized differential privacy, directly computing the correlation of all these pairwise attributes results in a large amount of noise being injected into the results, which severely degrades the accuracy of the dependency graph structure and the synthesized data.

(ii) When the distribution of the cliques of the junction tree is estimated, some big cliques may contain too many attributes, and the problem of high dimension is still faced when the distribution is calculated, so that the problem cannot be solved well.

Disclosure of Invention

In order to solve at least one technical problem in the background art, the present invention provides a local differential privacy based multidimensional data publishing method and system based on incremental learning, which constructs a probability graph model by an incremental learning-based method, then generates a set of low-dimensional distributions with noise by using the probability graph model, and then uses them to approximate the overall distribution of an input data set to generate a synthetic data set. Compared with the existing method, the method can provide privacy protection for each user individual and simultaneously obviously improve the precision of data release.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a multidimensional data publishing method based on local differential privacy of incremental learning, which comprises the following steps:

learning the correlation of all attribute pairs by aggregating the first batch of user disturbance data;

constructing a dependency graph model according to the correlation of the attribute pairs, and converting the constructed dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm;

based on the second batch of user data, estimating the distribution of the cliques by adopting a corresponding estimation method according to the number and size types of the attributes contained in each clique to obtain the joint distribution of each clique in the junction tree model;

and generating a data set which also comprises the same number of record synthesis and publishing the data set by a data generation method based on sampling according to the junction tree model and the joint distribution of each group in the junction tree model.

A second aspect of the present invention provides a multidimensional data distribution system for local differential privacy based on incremental learning, comprising:

the correlation learning module is used for learning the correlation of all attribute pairs by aggregating the disturbance data of the first batch of users;

the junction tree model building module is used for building a dependency graph model according to the correlation of the attribute pairs and converting the built dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm;

the distribution calculation module of each group of the junction tree is used for estimating the distribution of the groups by adopting a corresponding estimation method according to the number and the size type of the attributes contained in each group based on the second batch of user data to obtain the joint distribution of each group in the junction tree model;

and the data publishing module is used for generating a data set which also comprises the same number of records and is synthesized by a data generating method based on sampling according to the junction tree model and the joint distribution of each group in the junction tree model and publishing the data set.

A third aspect of the invention provides a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method for multi-dimensional data distribution of local differential privacy based on incremental learning as described above.

A fourth aspect of the invention provides a computer apparatus.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the method for multi-dimensional data distribution based on incremental learning-based local differential privacy as described above when executing the program.

Compared with the prior art, the invention has the beneficial effects that:

the method adopts an incremental learning-based method to construct a probability graph model, gradually prunes weak-correlation attribute pairs, allocates more data and privacy budgets to useful attribute pairs to correctly identify the correlation among all attribute pairs in an attribute set, thereby being capable of more effectively constructing the probability graph model, then utilizes the probability graph model to generate a group of low-dimensional distributions with noise, and then uses the low-dimensional distributions to approximate the overall distribution of an input data set so as to generate a synthetic data set.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flowchart illustrating a multidimensional data publishing method based on incremental learning local differential privacy according to an embodiment of the present invention;

figure 2 is a markov network of an embodiment of the present invention;

figure 3 is a markov network junction tree according to an embodiment of the present invention;

FIG. 4 is a graph comparing the algorithmic effects of the method of the present invention and existing methods.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

In a localization scenario, each user individual has a piece of data containing multiple attributes. In order to complete the data distribution task, data of all user individuals needs to be collected for data distribution. However, the data often contains sensitive information about the individual user. Therefore, there is a need to address the issue of multidimensional data distribution that satisfies localized differential privacy.

In order to solve the first technical problem in the background art of the application, the invention gradually prunes the attribute pairs with weak correlation, and allocates more data and privacy budgets to the useful attribute pairs so as to correctly identify the correlation among all the attribute pairs in the attribute set, thereby being capable of more effectively constructing a probability graph model and remarkably improving the quality of a synthetic data set.

In order to solve the second technical problem in the background art of the present application, the present invention further provides a new large clique distribution calculation method based on joint distribution decomposition and redundancy elimination, which effectively solves the distribution estimation of the large cliques.

Interpretation of terms:

the invention relates to two key elements, namely a localized differential privacy, junction tree model.

These two elements are first described below; and then, a formalized definition of a multidimensional data publishing problem under a localized differential privacy scene is given.

1. Localized differential privacy

In a local setting of differential privacy, an untrusted aggregator wishes to collect the user's personal information to accomplish the corresponding data analysis task. Localized differential privacy (Loca)l differential privacy) provides a random response algorithm

The data analysis task of the aggregation server is met, and meanwhile the privacy of the user is protected. Localized differential privacy may be defined as:

localized differential privacy: a definition domain and a value domain are respectively

Epsilon-localized differential privacy random algorithm of

When epsilon>At 0, give an arbitrary output

And arbitrary two inputs x ₁ ,

Satisfies the following conditions:

the privacy parameter epsilon measures the privacy protection strength of individual sensitive information, and smaller epsilon represents greater privacy protection strength.

2. Junction tree model

To overcome dimension cursing, it is critical to find conditional independence between attributes in the real dataset to break up the joint probability distribution into modular components. Probabilistic graphical models are elegant tools to identify such modular structures, while markov networks are the most widely used graphical models based on undirected graphs. The join tree algorithm provides a feasible method for accurately deducing joint probability distribution from a Markov network. The junction tree is defined as follows:

a junction tree model: for a given Markov network G, a tree is transformed from G

If and only if

Of (2) any two clusters C _i ,

Of (2) intersection C _i ∩C _j Present in connection C _i And C _j In each node on a unique path between, call

Is a junction tree. Wherein:

known as junction trees

In the form of a mass of (a),

indicating the intersection of two adjacent blobs.

Examples are: FIG. 3 shows a junction tree constructed from the Markov network of FIG. 2, where the elliptical nodes are the cliques of the junction tree and the rectangular nodes represent separators, namely:

note that the edges AD and AE in the markov network of fig. 2, connected by dashed lines, are introduced by the junction tree algorithm and are not themselves involved, we call them chords.

For the attribute set of

Of the data set

Given junction tree

Structure of (5), and joint distribution Pr (C) _i ) And Pr (S) _ij )，

The joint distribution of (a) may be expressed as:

3. problem definition

The formalization of the multidimensional data distribution problem to satisfy localized differential privacy is described as follows:

assume that there are N data owners and 1 aggregation server. Each data owner U _k (1. ltoreq. k. ltoreq.N) has a chain containing d discrete attributes { A ≦ ₁ ,…,A _d Recording of } of

Wherein

Representing a user U _k The ith attribute A of _i The value of (c). The data of all users form a data set

The aggregation server wants to know about the data set

Combined distribution of all attributes Pr (A) ₁ ,…, _d ) To generate a new composite data set for publication. In order to protect the privacy of individual users, the aggregation server collects the records of all N users on the premise of meeting the localized differential privacy and generates a data set which is identical to the original data set

Synthetic datasets with approximate distributions

Namely:

wherein:

representing a data set

The d-dimension of (a) is jointly distributed.

Example one

As shown in fig. 1, the present embodiment provides a multidimensional data publishing method based on local differential privacy of incremental learning, including the following steps:

step 1: learning the correlation of all attribute pairs by aggregating disturbance data uploaded by a first batch of users;

and 2, step: constructing a dependency graph model according to the correlation of the attribute pairs, and converting the constructed dependency graph model into a junction tree model consisting of a plurality of groups through a junction tree algorithm;

and step 3: based on the second batch of user data, estimating the distribution of the cliques by adopting a corresponding estimation method according to the number and size types of the attributes contained in each clique to obtain the distribution of each clique in the junction tree model;

and 4, step 4: and generating and distributing a data set which also comprises the same number of record synthesis by a data generation method based on sampling according to the connection tree model and the distribution of each group in the connection tree model.

The present embodiment relates to two types of entities: n data owners and 1 aggregation server. Each data owner U _k (1. ltoreq. k. ltoreq.N) has a record containing d discrete attributes

Aggregation server uses original data set composed of data of N data owners

Generate an AND

Synthetic datasets with similar distributions

And data release is carried out, and the privacy protection requirement of each data owner is ensured.

In order to more clearly understand the technical contents of the present invention, the following detailed description is given;

step 1: and collecting the first batch of user data and calculating to generate a probability graph model.

The method specifically comprises the following substeps:

step 1.1: and grouping the first group of users.

The first group of users

Partitioning into T disjoint groups, namely:

each group of

Respectively for the t-th iteration in step 1.3.

Step 1.2: the dependency graph model is initialized.

A dependency graph G is initialized, wherein the dependency graph G comprises d vertexes, and any two vertexes are provided with connecting edges. Let E { (i, j) |0 ≦ i ≠ j < d } be the set of edges for G (i.e., the set of attribute pairs).

Step 1.3: and constructing a dependency graph model.

In order to better construct the dependency graph model, the embodiment uses an incremental learning-based dependency graph model construction method, which consists of T iterations. In each iteration, some new data is collected for each attribute pair remaining in the set E of attribute pairs, respectively, and the correlations of these attribute pairs are re-estimated to eliminate attribute pairs with weaker correlations.

Taking the t-th iteration as an example, the method specifically comprises the following sub-steps:

step 1.3.1: and grouping the users participating in the iteration according to the residual attribute pairs.

Grouping users according to the edge set E of the current dependency graph G

Wherein:

the users of the group are only used for data collection on attribute pair e at the t-th iteration.

Step 1.3.2: user data is collected. In consideration of privacy protection, a user needs to add a proper amount of disturbance to own real data and report the disturbance result to the aggregation server.

To a first order

For example, assuming that e is (i, j), the method specifically includes the following sub-steps:

step 1.3.2.1: the user first associates himself with attribute A _i And A _j And converting the real data into a single attribute, and encoding the converted attribute value by using a one-hot (one-hot) encoding rule to obtain an encoding result S.

Step 1.3.2.2: setting a value of a parameter epsilon representing the privacy protection strength, and outputting a disturbance result S' by using an OUE algorithm with the following probability for the code value S of the user obtained in the step 1.3.2.1, namely:

wherein:

after obtaining the perturbation result S', the user reports it to the aggregation server.

Step 1.3.3: the aggregation server calculates the joint distribution { P ] of all the remaining attribute pairs in the current E by adopting an unbiased estimation method according to the disturbance data uploaded by the user _t (A _i ,A _j ) Where the subscript t denotes that the distribution is the result calculated by the data collection of the t-th iteration.

Step 1.3.4: and (6) accumulating the data. Since in each iteration, data for the attribute pairs that are not pruned needs to be collected continuously and their joint distribution recalculated. The data are accumulated, more accurate distribution information can be obtained, and the accuracy of judging the correlation of the non-trimmed attribute pair is improved.

Meanwhile, in order to improve the efficiency of data accumulation, the embodiment provides a more efficient data accumulation method, and the accumulation result of the t-th round can be directly obtained according to the accumulation result of the t-1 th round and the calculation result of the t-th round.

In particular, for arbitrary attribute combinations

Assume that the t-th collection includes

Is combined with the attributes of

And define

Representing upload attribute combinations

The number of users of (a) is,

represent

The marginal table of (a) is,

represent

The term (2) is used in the following description,

1≤j≤i _t . Then the t-th wheel

The accumulated result of the margin table of (1)

Can be expressed as:

wherein:

therefore, through the calculation of the formula, the joint distribution result after the accumulation of all the residual attribute pairs in the E and the t-1 round data collection result can be obtained

Step 1.3.5: the mutual information is recalculated.

In this embodiment, the correlation between the attribute pairs is measured using mutual information of the two attributes. To pair

Attribute pair A _i ,A _j Of mutual information

The calculation formula of (c) is as follows:

wherein (A) _i ,A _j ) In the form of an attribute pair, the attribute pair,

are respectively attribute A _i ,A _j Domain of (a), Pr (a) _m ) And Pr (a) _n ) Respectively represent

Middle mth value a _m Marginal distribution of and

middle nth value a _n Marginal distribution of (A), Pr (a) _m ,a _n ) Denotes a _m And a _n The joint distribution of (1).

Wherein: associated distribution Pr (A) _i ,A _j ) For the combined distribution obtained after step 1.3.4 data accumulation

Pr(a _m ,a _n ) Is the term Pr (A) in the joint distribution _i ＝a _m ,A _j ＝a _n ) Abbreviations of (a).

Step 1.3.6: trimming the invalid edge.

In general, for any attribute pair A _i ,A _j Usually using a predetermined correlation threshold

Information with it

Size of (2)The relationship is used to determine whether there is a correlation between the attribute pairs. However, sampling errors can result due to the use of partial data to estimate the strength of attribute versus correlation. Meanwhile, since the smaller the data amount is in LDP, the larger the disturbance error is. Therefore, the correlation threshold is directly used

May result in erroneous pruning of otherwise highly correlated attribute pairs.

In order to reduce errors caused by sampling and disturbance and improve the accuracy of edge selection, the embodiment uses an edge pruning method based on threshold relaxation. To pair

The method specifically comprises the following substeps:

step 1.3.6.1: calculating a correlation threshold

For a given dependency parameter φ, then:

step 1.3.6.2: a scaled correlation threshold/is calculated.

Assume that a total of n records among the currently collected records contain attribute A _i And A _j A calculated from the n records _i And A _j Is represented by

Due to the randomness of the sampling and perturbation, will

Considered as a random variable. For true mutual information

And mutual experience information

Satisfies the following conditions:

wherein:

when in use

When Δ I (η) is log (M) _i ) (ii) a Otherwise, the reverse is carried out

Thus, given a confidence level of 1- α, the scaled correlation threshold, l, satisfies:

then l can be calculated as:

step 1.3.6.3: re-estimating the correlation and pruning the edges. If it is not

Represents attribute pair A _i ,A _j There is still a high probability of having a strong correlation, so the edge e is kept (i, j) in the dependency graph G; instead, we delete the edge from the dependency graph G, i.e., E ═ E- { (i, j) }.

Step 1.4: a junction tree structure. According to the dependency graph G constructed in the step 1.3, the dependency graph is converted into a set of clusters through a junction tree algorithm

Of a connection tree

Wherein: group C _i Contains at least one attribute, | C _i L represents the number of attributes in the blob,

is the size of the bolus.

Step 2: a second batch of user data is collected and the distribution of cliques of the junction tree is calculated.

In obtaining a junction tree

Later, to generate a synthetic dataset, clique sets need to be estimated

The joint distribution of each clique. Since the number of attributes included in each clique and the size of the clique are different, for different types of cliques, the present embodiment estimates the distribution of the cliques by using different methods. The method specifically comprises the following substeps:

step 2.1: and (4) classifying the clusters.

Dividing the ball into two groups according to the size of the ball

In detail, given an empirical blob size threshold σ, pair

If it is

Called ball C _i Is in the form of small ball and put in

Performing the following steps; on the contrary, if

Called ball C _i Is in large ball form and is put into group

In (1).

Step 2.2: and (4) decomposing the large lumps.

Since large cliques tend to contain more attributes, a straightforward estimate of their distribution still faces the problem of dimension cursing. To pair

According to the chain rule, the joint distribution can be decomposed into the product of a plurality of conditional probabilities, namely:

wherein: II type _i Is represented by A _i The conditions of (1).

Π _i There is redundancy, i.e. when given Π _i In some properties of (II), Π _i With A _i Are condition independent. By means of pairs II _i Performing redundancy deletion to obtain

The purpose of reducing dimension can be achieved because

Greater than _i Containing fewer attributes. Intuitively, the order of decomposition affects the effect of deleting redundancy.

Therefore, the present embodiment uses a heuristic based on a forward search strategy to determine a better decomposition order.

Order attribute set

The method comprises

And (4) secondary circulation, wherein each circulation selects one attribute A capable of deleting more redundancies from the rest attributes in the attribute set Q _h As a result of the target attribute and having the redundancy deleted

As a condition, key value pair

Form of (2) is stored in a dictionary

In (1).

Ball C _i The joint distribution of (a) is expressed as:

in detail, the pair

The secondary cycle specifically comprises the following sub-steps:

step 2.2.1: if the domain size | Q | ═ pi of the current attribute set Q _A∈Q |Ω _A I is less than or equal to the blob size threshold σ, i.e.: if Q < sigma > is less than or equal to the sum of the attributes A and Q, randomly selecting one attribute A from the attribute set Q _h E.g. Q as the target attribute, and making the attribute set Q divide the attribute A _h All other attributes except

As a condition, i.e. the h-th term of the factorization is the conditional probability

Will attribute A _h Delete from attribute set Q, i.e.: q ═ Q \ A _h ；

Step 2.2.2: conversely, if | Q->σ for

The embodiment uses the attribute A _j For the target attribute, a characteristic selection method of maximum-Redundancy-minimum-correlation (minimum-Redundancy-maximum-Relevance) is used for performing Q \ A on the attribute set _j Performing redundancy elimination to obtain an attribute set after the redundancy elimination

The following results were calculated:

wherein: the mutual information values of any two attributes are the calculation results in step 1.

Based on the calculated redundancy elimination result

Selecting the order variance from the attribute set Q

Minimum attribute A _h E.g. Q as target attribute, and making redundant elimination result

As a condition, then the h term of the factorization is

And let Q ═ Q \ A _h 。

Step 2.3: a distribution estimation is performed using the second batch of user data.

In order to obtain the distribution of each clique, the joint distribution of some attribute combinations needs to be calculated, and the method specifically comprises the following sub-steps:

step 2.3.1: a combination of attributes is determined.

Set of settings

And

the method is used for storing the attribute combination needing to be distributed and the attribute combination with known distribution, and specifically comprises the following steps:

to pair

If it is

Then will be

Is put into

Performing the following steps; otherwise, it will

To pair

If | C _i If | ≧ 3, then C is _i Is put into

Performing the following steps; otherwise, store in

In (1).

Step 2.3.2: and (4) grouping users. Second batch of users

Is divided into

The disjoint groups, namely:

each group of

Respectively for attribute combination

The distribution of (2) is calculated.

Step 2.3.3: user data is collected and joint distributions are estimated.

To pair

The joint distribution estimation of (2) is the same as the calculation steps for the first group of users in steps 1.3.3 and 1.3.4, and thus can be obtained

To is pair

It can be obtained directly from the calculation result of step 1.

Step 2.4: the distribution of the cliques is calculated. To pair

The present embodiment estimates the joint distribution

Distribution of conditions

Can be distributed from the union

Thus obtaining the product. Thus, according to the decomposition formula of the joint distribution in step 2.2, it is calculated

Combined distribution of Pr (C) _i )。

And step 3: a composite data set is generated.

The aggregation server obtains the junction tree according to the step 1

And the distribution of each cluster obtained in step 2

Generating a synthetic data set comprising N records by a sampling-based data generation method

Taking the generation of one record as an example, the method specifically comprises the following sub-steps:

step 3.1: randomly selecting a clique

According to its distribution Pr (C) _i ) Sampling to obtain C _i Sampling results of all attributes in

Then all and clusters C are selected _i Associated groups

Step 3.2: to pair

We derived from the conditional distribution

Middle pair attribute

Sampling is performed, wherein:

representing an unsampled set of attributes, the conditional distribution can be from Pr (C) _r ) And (4) obtaining. Then all are reacted with C _r Insertion of connected and unaccessed blobs into

To the end of (c).

Step 3.3: and (3.2) repeatedly executing the step until the sampling results X of all the attributes are obtained, and finishing the generation.

In order to better verify the effect of the multidimensional data publishing method based on incremental learning and meeting the local differential privacy, referring to fig. 4, privincorr is the method, and the method in the invention is compared between two public data sets Adult and TPC-E with the existing method. The comparison method comprises a non-incremental method NoIncremental, a non-privacy version NoPrivJTree and two multidimensional k marginal issuing methods CALM and FT which meet the localization differential privacy. Experimental results show that the synthetic data set generated by the method can provide better effect.

Example two

The embodiment provides a multidimensional data distribution system based on local differential privacy of incremental learning, which comprises:

the correlation learning module is used for learning the correlation of all attribute pairs by aggregating disturbance data uploaded by a first group of users;

EXAMPLE III

The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps in the local differential privacy based multidimensional data distribution method based on incremental learning as described above.

Example four

The present embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the processor implements the steps in the incremental learning-based local differential privacy multidimensional data distribution method as described above.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The local differential privacy multidimensional data publishing method based on incremental learning is characterized by comprising the following steps:

2. The incremental learning-based multi-dimensional data publishing method for local differential privacy according to claim 1, wherein the building of the dependency graph model according to the correlation of attribute pairs comprises:

constructing by adopting a dependency graph model construction method based on incremental learning according to the edge set of the current dependency graph, wherein the method comprises the following steps: and performing T-round iteration, respectively collecting new data for each attribute pair remaining in the attribute pair set in each iteration, re-estimating the correlation between the attribute pairs, and removing the attribute pairs with weak correlation by adopting an edge pruning method based on threshold relaxation to obtain a pruned edge.

3. The incremental learning-based multi-dimensional data distribution method for local differential privacy as claimed in claim 2, wherein the correlation between the attribute pair is measured by using the mutual information of the two attributes, and the mutual information of the attribute pair is calculated by the formula:

in the formula, A _j ,A _j In the form of an attribute pair, the attribute pair,

Middle mth value a _m Marginal distribution of (A) and

the nth value a _n Marginal distribution of (A), Pr (a) _m ,a _n ) Denotes a _m And a _n The joint distribution of (c).

4. The incremental learning-based multi-dimensional data distribution method for local differential privacy as claimed in claim 2, wherein the step of removing the attribute pair with weak correlation by using an edge pruning method based on threshold relaxation comprises:

calculating a correlation threshold value based on the set dependency parameter;

calculating a scaled correlation threshold in combination with the correlation threshold, the given confidence level and mutual information of the recorded attribute pairs;

recalculating the re-estimated correlation, if the correlation of the attribute pair is greater than or equal to the scaled correlation threshold value, indicating that the attributes have strong correlation, and keeping an edge in the dependency graph; otherwise, the edge is removed from the dependency graph.

5. The incremental learning-based local differential privacy multidimensional data distribution method according to claim 1, wherein the estimating the distribution of the cliques by using a corresponding estimation method according to the number and size types of the attributes included in each clique to obtain the joint distribution of each clique in the junction tree model comprises:

dividing all the clusters into two groups of big clusters and small clusters according to the number and the size types of the clusters;

confirming an optimal decomposition sequence by adopting a heuristic method of a forward search strategy, and decomposing the large clusters according to the optimal sequence to obtain condition distribution;

and obtaining the joint distribution of each group in the junction tree model based on the second batch of user data and the condition distribution and based on a joint distribution formula.

6. The incremental learning-based multi-dimensional data distribution method for local differential privacy as claimed in claim 1, wherein the heuristic method using a forward search strategy confirms an optimal decomposition order, and decomposing the cliques according to the optimal order comprises:

if | Q | ≦ σ, randomly select an attribute A from Q _h E.g. Q as the target attribute, let

As conditions, i.e. conditional distribution of the h-th term of factorization

If | Q |>σ for

With A _j For the target attribute, using the characteristic selection method of maximum redundancy minimum correlation to the attribute set Q \ A _j Performing redundancy elimination to obtain an attribute set after the redundancy elimination

Selecting an order from the calculated redundancy elimination result

Minimum A _h E.g. Q as the target attribute, let

As a condition, the h term of factorization is the conditional distribution

And let Q ═ Q \ A _h ，

Where | Q | represents the domain size of the current attribute set Q, σ is the blob size threshold,

divide attribute A for attribute set Q _h All other attributes.

7. The incremental learning-based local differential privacy multidimensional data distribution method as recited in claim 1, wherein the record generation process comprises:

randomly selecting a small group, sampling according to the distribution of the small group to obtain the sampling results of all the attributes of the small group, and then selecting all the groups related to the small group

For is to

From conditional distribution

Middle pair attribute

A sampling is performed in which, among other things,

representing an unsampled attribute set with a conditional distribution from Pr (C) _r ) Obtaining, then mixing all with C _r Insertion of connected and unaccessed blobs into

To the end of (1);

and repeatedly executing sampling until sampling results of all attributes are obtained, and finishing the generation.

8. The multidimensional data publishing system of the local differential privacy based on the incremental learning is characterized by comprising the following components:

the distribution calculation module of each group of the junction tree is used for estimating the distribution of the groups by adopting a corresponding estimation method according to the number and the size types of the attributes contained in each group based on the second batch of user data to obtain the joint distribution of each group in the junction tree model;

9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method for multi-dimensional data distribution for local differential privacy based on incremental learning according to any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in the method for multi-dimensional data distribution of local differential privacy based on incremental learning according to any one of claims 1 to 7.