CN112667712B

CN112667712B - Grouped accurate histogram data publishing method based on differential privacy

Info

Publication number: CN112667712B
Application number: CN202011637291.7A
Authority: CN
Inventors: 陶陶; 李思文
Original assignee: Anhui University of Technology AHUT
Current assignee: Anhui University of Technology AHUT
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2023-03-17
Anticipated expiration: 2040-12-31
Also published as: CN112667712A

Abstract

The invention discloses a grouping accurate histogram data issuing method based on differential privacy, and belongs to the technical field of data privacy protection. The invention provides a Grouping Accurate Histogram data issuing method (AGHP) based on differential privacy, which is characterized in that firstly, based on a smooth Grouping idea, an exponential mechanism is utilized to carry out global approximate sequencing on the frequency of original Histogram buckets; secondly, a dynamic programming algorithm is provided, global grouping with optimal error balance is realized on the ordered histogram, and grouping reconstruction errors and noise errors are balanced; and finally, adding Laplace noise to the grouped histogram and then issuing. The algorithm effectively reduces the error of histogram data release on the premise of meeting the differential privacy, improves the usability of the histogram data release, and expands the practical application of the differential privacy technology theory.

Description

Grouped accurate histogram data publishing method based on differential privacy

Technical Field

The invention relates to the technical field of data privacy protection, in particular to a grouping accurate histogram data issuing method based on differential privacy.

Background

Under the background of a big data era, a large amount of personal information data is generated every day, and the information digitization technology enables various organizations to easily collect a large amount of information data, issue statistical results in various forms and conduct data analysis research. Although the analysis and mining results of the data can help people analyze and research things, the problem that private information is stolen can be caused in the actual information publishing process.

Histogram is a common technique for visually displaying data distribution characteristics, and is often used to issue statistical data. The technique partitions data into disjoint buckets by some attribute and then represents the data characteristics with a bucket frequency. If we directly publish the statistical histogram without privacy protection in the information publishing process, an attacker can deduce user data by combining background knowledge and the real count of the histogram bucket, and the privacy of the user is revealed.

At present, as a new privacy protection model, differential privacy has many applications in histogram distribution technology. The method achieves the effect of protecting privacy by converting original data and adding noise to a statistical result, most of the existing histogram distribution technologies based on differential privacy are to add noise and reconstruct a histogram, and the reconstruction generally adopts a method of combining buckets adjacent to positions and then taking an average value so as to reduce the global sensitivity. However, this method cannot measure buckets with similar frequencies in the global range, which may result in a large reconstruction error when reconstructing the packet. It is therefore necessary to consider the ordering of bucket counts prior to reconstruction. In addition, most of the current common grouping methods are fixed-length grouping or greedy clustering grouping, and these methods cannot well balance reconstruction errors and noise errors, which can cause the usability of the issued histogram to be reduced. Therefore, the balance of reconstruction errors, noise errors and the like is achieved during global grouping, and the usability of published data is improved while the differential privacy is met.

Through retrieval, the Chinese patent number: ZL201811273045.0, filed as: 10 and 30 months in 2018, and the invention name is as follows: a privacy protection method for data release. In the application, corresponding batch data is obtained from a database according to a batch query request submitted to a data open platform by a user, random noise meeting the given differential privacy protection requirement is added to the batch data, and finally, a noise disturbance result is returned to the user in a histogram issuing mode. However, the method performs secondary noise addition on data, which results in a large data error, and does not perform filtering operation on the data after noise addition, and although the privacy of the data can be ensured, the usability of the data is not considered.

As another example, the chinese patent application No.: ZL202010573117.4, filed as: in 2020, on 22 months and 06 months, the invention name is: a data publishing method based on differential privacy is provided. In the application, laplacian denoising is performed on input histogram data, then the denoised data is subjected to filtering operation, then the denoised histograms are sequenced according to frequency values by a reordering method, and finally the minimum SSE grouping is found according to a clustering strategy of a dynamic planning idea. However, according to the method, the privacy cost is high due to the fact that excessive noise is added, and then the method adopts a dynamic programming method to reconstruct the ordered histogram, only the reconstruction error of the grouping is considered, and the balance between the reconstruction error and the noise error caused by the grouping is not considered.

Based on the above analysis, a histogram data distribution method satisfying the difference privacy and having less error generated in the distribution process is needed.

Disclosure of Invention

1. Technical problem to be solved by the invention

In view of the problem that the existing histogram distribution encryption method cannot give consideration to the difference privacy and the usability of the histogram, the group precise histogram data distribution method based on the difference privacy reduces noise errors added in the histogram distribution process, effectively balances the group reconstruction errors and the noise errors, improves the usability of the histogram, and improves the usability of data on the premise of not revealing the privacy when statistical data are distributed.

2. Technical scheme

In order to achieve the purpose, the technical scheme provided by the invention is as follows:

the invention discloses a grouping accurate histogram data issuing method based on differential privacy, which comprises the following steps:

step one, obtaining a numerical histogram statistical data field, and inputting the histogram frequency into a histogram data set H = (H) ₁ ,H ₂ ,…H _n ) In (1), the privacy protecting budgets ε and Δ f are given simultaneously, Δ f is the L of the data set H and its neighboring data sets ₁ A distance;

step two, using the first histogram bucket H ₁ Is a base barrel H _i Adding it to the ordered histogram sequence H ^* And delete the bucket from H;

step three, calculating a base barrel H _i Neighbor bucket set L (H) _i ) And an exponential scoring function u (H) _i ) According to a scoring function u (H) _i ) In proportion to

From L (H) to _i ) Select out H _j Wherein the privacy budget ε ₁ = ε/2, reaction of H with _j Adding to an ordered histogram sequence H ^* Then, H is introduced _j As a base barrel;

step four, repeating the step three until the original histogram data set H is empty;

step five, the ordered histogram sequence H is paired ^* Performing dynamic programming grouping according to the global error Err, and selecting a histogram grouping structure H with the minimum global error ^G ；

Step six, describing the bucket frequency of the grouping by the grouping average number, and adding Laplace noise Lap (1/epsilon) to each bucket frequency ₂ ) Obtaining a histogram sequence after adding noise

And releasing.

Further, in the step one, each H _i The frequency of the unit interval is that the privacy protection budget epsilon is less than 1.

Furthermore, in the third step, the score function u (H) is determined _i ) From the base bucket H _i Neighbor bucket set L (H) _i ) Selecting out the buckets with frequency similar to that of the base bucket, wherein L (H) _i ) And u (H) _i ) Calculated according to the formula (1) and the formula (2) respectively,

L(H _i )＝{H _j :|H _j -H _i |≤δ} (1)

u(H _i )＝-(|H _j -H _i |+|j-i|) (2)

where δ is a threshold that controls the number of buckets in the neighbor bucket set.

Further, in the fifth step, the dynamically planned error evaluation function is set as the global error Err (, H) _l ,H _r ) As shown in the formula (3),

wherein

Frequency mean, G, representing the group _i And | represents the number of buckets in the group.

In order to reconstruct the error AE,

is a noise error. Wherein the privacy budget ε ₂ = ε/2, which is determined to be the group mean

The magnitude of the added Laplace noise

The added Laplace noise has a size of Lap (1/epsilon) ₂ )/|G _i |。

Furthermore, in the fifth step, grouping the histogram H by using the dynamic programming concept, and recording the minimum global error of each grouping structure

Select T of _Err Lowest packet structure H ^G And recording the optimal grouping number k, as shown in formula (6):

where n is the number of histogram buckets, k is the number of all possible groupings, and k is 1 ≦ n.

Furthermore, in the sixth step, the grouped histogram H is processed ^G Taking group mean value of frequency of each group of buckets in group G _i The middle histogram bucket frequency is:

post-pair Laplace noise Lap (b) per barrel frequency, where b =1/ε ₂ Obtaining the histogram sequence after adding noise

Wherein

Furthermore, in the sixth step, the noise adding process to the original data set is to construct a probability density function obeying laplace distribution, an inverse cumulative distribution function is obtained according to the probability density function, and then uniformly distributed random variables are input to the function, so that laplace noise can be obtained.

Further, the specific steps for obtaining the laplace noise are,

s1, setting a constructed obedience position parameter mu to be 0, setting Laplace distribution with a scale parameter b to be Lap (b), and enabling a probability density function p (x) to be shown as a formula (7),

s2, random variables alpha-U (0,1) meeting the uniform distribution are substituted into an inverse function of the Laplace cumulative distribution function, and then the noise value meeting the condition can be obtained as shown in a formula (8):

s3, uniformly distributing alpha to U (-0.5,0.5), and combining the piecewise function of the formula (8) into a formula (9) as follows:

F ^-1 (x) =0-b + sign (α) × ln (1-2 abs (α)) (9) wherein the sign function is used to obtain the sign and the abs function is used to obtain the absolute value of the parameter.

3. Advantageous effects

Compared with the prior art, the technical scheme provided by the invention has the following remarkable effects:

(1) Due to the traditional differential privacy histogram data issuing method, only the similar bucket counting of position neighbors is considered when the histogram is grouped, and the similar buckets counting in the global range cannot be measured, so that a large reconstruction error is generated when the histogram is grouped. According to the grouping precise histogram data issuing method based on the difference privacy, the approximate sorting algorithm based on the index mechanism is adopted, the frequency of the original histogram buckets is subjected to global approximate sorting by the index mechanism according to the relation of the difference values among the bucket counts, and the accuracy in grouping is improved.

(2) The traditional differential privacy histogram data publishing method obtains the optimal error balance global grouping of the original histogram through the fixed-length grouping or the greedy clustering grouping, is easy to fall into local optimization, cannot well balance the approximate error and the Laplace error, and causes the usability of the published histogram to be reduced. When the histogram is grouped, the optimized dynamic programming technology is adopted for self-adaptive grouping, and the grouping number does not need to be determined. Meanwhile, the global error Err is used as an error evaluation function, self-adaptive grouping is carried out according to a dynamic programming recurrence formula, and a grouping scheme H with the minimum global error is obtained from all possible groupings ^G And the buckets with similar counting values are combined into one group, so that the accuracy of the final distribution histogram is improved, wherein the global error is composed of an approximation error and a Laplace error, and the global group with the optimal error balance is realized on the sequencing histogram.

Drawings

FIG. 1 is a theoretical architecture diagram of the method of the present invention;

FIG. 2 is a block flow diagram of the method of the present invention.

Detailed Description

For a further understanding of the invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings and examples.

The existing histogram publishing method based on the difference privacy directly adds Laplace noise on each barrel of the original histogram to achieve the purpose of protecting the privacy. However, although the method of directly adding noise can effectively protect private data, it is easy to reduce the usability of the histogram due to excessive noise addition, and may result in a higher cumulative error in a long-range counting query.

Generally, in order to improve the accuracy of histogram distribution, reduce noise errors and improve data availability, there are two common strategies, and the histogram distribution method under the strategy 1 directly adds laplacian noise to the count of each bucket, so as to achieve the effect of disturbing the real count. Such methods are costly in terms of privacy due to the excessive noise added. The histogram issuing method under the strategy 2 is just opposite to the strategy 1 in sequence, the original histogram is firstly reconstructed to reduce the global sensitivity, then noise is added to the counting result after reconstruction, and although reconstruction processing can generate reconstruction errors, the added noise amount is reduced. The precision of the strategy response query is generally higher, and the problems exist in how to balance the reconstruction error and the noise error, ensure the privacy of histogram publication and improve the usability.

The invention adopts a strategy 2 method, for input histogram data, firstly sorting the histogram data and then reconstructing grouping, wherein the sorting operation enables buckets with similar frequency numbers to be arranged together, reduces errors during grouping reconstruction, then adopts a dynamic programming grouping based on the global minimum error to obtain a grouping scheme with the minimum global error from all possible groupings, realizes the global grouping with the best error balance on a sorting histogram, balances approximate errors and Laplace errors, finally adds Laplace noise to the grouped histogram and distributes the histogram in an original sequence, obviously reduces the added noise value, and simultaneously effectively improves the usability of the data distributed by the histogram.

Meanwhile, when the histogram is grouped, the optimized dynamic programming technology is adopted for self-adaptive grouping, and the grouping number does not need to be determined. Meanwhile, the global error Err is used as an error evaluation function, self-adaptive grouping is carried out according to a dynamic programming recurrence formula, and a grouping scheme H with the minimum global error is obtained from all possible groupings ^G The buckets with similar counting values are combined into one group, so that the accuracy of the final distribution histogram is improved, wherein the global error is composed of an approximation error and a Laplace error, and the global score with the optimal error balance is realized on the sequencing histogramAnd (4) grouping.

The invention improves the traditional histogram data publishing method based on differential privacy, achieves higher usability while protecting the privacy data,

for a further understanding of the invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings.

Example 1

With reference to fig. 1, a method for issuing group-based precise histogram data based on differential privacy according to this embodiment includes the steps of:

step one, obtaining a numerical histogram statistical data field, and inputting the histogram frequency into a histogram data set H = (H) ₁ ,H ₂ ,…H _n ) In (1), the privacy protecting budgets ε and Δ f are given simultaneously, Δ f is the L of the data set H and its neighboring data sets ₁ Distance:

firstly, reading a numerical histogram statistical data field to be issued from a data source such as a database or csv, inputting statistical data (namely histogram frequency) of each interval into a histogram data set H, completing the input of an original histogram data set H, and setting privacy protection budgets epsilon and delta f. Wherein epsilon is manually specified and is generally smaller than 1, and the smaller epsilon, the higher the data privacy protection degree is, and the lower the data availability is; Δ f is L of the data set H and its neighboring data sets ₁ Greater distance, Δ f, indicates more noise needs to be added, L for each bucket of the histogram ₁ The distance is 1.

Step two, using the first histogram bucket H ₁ Is a base barrel H _i Adding it to the ordered histogram sequence H ^* And the bucket is deleted from H.

From L (H) to _i ) Select out H _j Where privacy budget ε _i = ε/2, reaction of H with _j Addition to orderHistogram sequence H ^* Then, H is introduced _j As a base barrel:

in order to obtain a better grouping result during grouping, the bucket frequency is sorted by an exponential mechanism to obtain a more precise sequence. According to a scoring function u (H) _i ) From the base bucket H _i Neighbor bucket set L (H) _i ) Selecting out the buckets with frequency similar to that of the base bucket, wherein L (H) _i ) And u (H) _i ) Calculated according to the formula (1) and the formula (2) respectively,

L(H _i )＝{H _j :|H _j -H _i |≤δ} (1)

u(H _i )＝-(|H _j -H _i |+|j-i|) (2)

wherein δ is a threshold value controlling the number of buckets in the set of neighbor buckets, which can be adjusted according to the overall bucket count value. In this example, δ is taken to be 50. If bucket H _j Frequency and base barrel H _i If the difference of frequency is within the range of threshold value delta, the bucket H _j Neighbor bucket set L (H) at base bucket _i ) In (1). Exponential scoring function u (H) _i ) From H _j And H _i The absolute value of the frequency difference value and the opposite number of the sum of the absolute values of the sequence difference value are jointly formed, and the exponential mechanism is defined as follows:

let the random algorithm M input the data set as H and output as an entity object H _j e.R, u (H) is an exponential mechanism scoring function, Δ u is the sensitivity of the function u (H), if proportional to

Is selected from the input and output H _j Then the algorithm M provides epsilon-differential privacy protection.

As can be seen from the definition of equation (2) and the indexing mechanism, the indexing mechanism scores each output by using a scoring function u and assigns a higher probability of being indexed to the output with a higher score, i.e., the higher the result of the scoring function, the higher the probability of being selected. So scoring function u (H) _i )＝-(|H _j -H _i L (H) can be continuously collected from the neighbor buckets by an exponential mechanism _i ) In the selection and the last base bucket H _i Buckets of similar frequency forming an ordered histogram H ^* 。

And step four, repeating the step three until the original histogram data set H is empty.

Step five, the ordered histogram sequence H is paired ^* Performing dynamic programming grouping according to the global error Err, and selecting a histogram grouping structure H with the minimum global error ^G ：

Setting the error evaluation function of the dynamic programming as a global error Err (, H) _l ,H _r ) As shown in the formula (3),

wherein

In order to reconstruct the error AE,

is a noise error; wherein the privacy budget ε ₂ = ε/2, which is determined to be the group mean

Magnitude of added Laplace noise, to

The added Laplace noise has a size of Lap (1/epsilon) ₂ )/|G _i |。

Then, the sorted histograms H are dynamically planned and grouped according to the global error Err,

1) When the number of packets k =1, H is calculated ^* Err (, H) divided into 1 groups by the item of the middle first i (1. Ltoreq. I.ltoreq.n) ₁ ,H _i ) It is denoted as

The calculation method is shown in formula (4):

in the above formula

Represents D ^* The mean of the counts from the 1 st bucket to the ith bucket;

2) When k is>1, H can be calculated according to the thought of dynamic programming ^* The smallest global error, denoted by the minimum k-sets of the top i terms

The state escape formula is shown as (5)

3) To H ^* In order to reduce the operation amount and improve the efficiency, the reconstruction is mainly realized by adopting a grouping strategy of a dynamic programming idea, and the minimum global error of each group is recorded by grouping the n buckets in total from 1 group, 2 groups and …, k groups

Select so that T _Err The minimum grouping, and record the optimal dividing structure and the optimal grouping number k under the grouping number, as shown in formula (6):

where n is the number of original histogram buckets, and k is the number of all possible grouped clusters 1. Ltoreq. K. Ltoreq.n.

Step six, grouping the histogram H ^G Taking group mean value of frequency of each group of buckets in the group G _i Middle histogram barrel frequencyThe number is as follows:

post-pair Laplace noise Lap (b) per barrel frequency, where b =1/ε ₂ Obtaining a histogram sequence after adding noise

And issue, therein

The noise adding process of the original data set comprises the steps of constructing a probability density function obeying Laplace distribution, solving an inverse cumulative distribution function according to the probability density function, and then inputting uniformly distributed random variables into the function to obtain Laplace noise; the method comprises the following specific steps:

s1, setting a constructed obedience position parameter mu to be 0, setting Laplace distribution with a scale parameter b to be Lap (b), and enabling a probability density function p (x) to be as shown in a formula (7),

F ^-1 (x)＝0-b*sign(α)*ln(1-2abs(α)) (9)

wherein, sign function is used to obtain positive and negative of parameter, abs functionFor obtaining the absolute value of the parameter. The noise error of Laplace can be obtained by generating pseudo-random numbers which are consistent with alpha-U (-0.5,0.5) through a computer and substituting the pseudo-random numbers into alpha in the formula (9), and the Laplace noise is added into the frequency of the barrel to obtain data after noise addition

With reference to fig. 2, after the exponential mechanism sorting is performed, the histogram is reconstructed and grouped, the mean value of the sorted and grouped data is added with noise, and after appropriate laplace noise is added, the final histogram can be issued.

The present invention and its embodiments have been described above schematically, without limitation, and what is shown in the drawings is only one of the embodiments of the present invention, and the actual structure is not limited thereto. Therefore, if the person skilled in the art receives the teaching, without departing from the spirit of the invention, the person skilled in the art shall not inventively design the similar structural modes and embodiments to the technical solution, but shall fall within the scope of the invention.

Claims

1. A grouped accurate histogram data publishing method based on differential privacy is characterized by comprising the following steps:

step one, obtaining a numerical histogram statistical data field, and inputting the histogram frequency into a histogram data set H = (H) ₁ ,H ₂ ,…H _n ) In the method, privacy protection budgets epsilon and delta f are given at the same time, and delta f is L of a data set H and a data set adjacent to the data set H ₁ A distance;

From L (H) to _i ) Select out H _j Where privacy budget ε ₁ = ε/2, reaction of H with _j Adding to an ordered histogram sequence H ^* Then, H is introduced _j As a base barrel;

And releasing;

in the fifth step, the error evaluation function of the dynamic programming is set as a global error Err (, H) _l ,H _r ) As shown in the formula (3),

wherein

Frequency mean, G, representing the group _i L represents the number of buckets in the group,

in order to reconstruct the error AE,

The magnitude of the added Laplace noise

The added Laplace noise has a size of Lap (1/epsilon) ₂ )/|G _i |；

In the fifth step, grouping the histogram H by adopting the dynamic programming idea, and recording the minimum global error of each grouping structure

Select T therein _Err Lowest packet structure H ^G And recording the optimal grouping number k, as shown in formula (6):

wherein n is the number of histogram buckets, k is the number of all possible groupings, and k is greater than or equal to 1 and less than or equal to n;

in the third step, according to the scoring function u (H) _i ) Slave base barrel H _i Neighbor bucket set L (H) _i ) Selecting out the buckets with frequency similar to that of the base bucket, wherein L (H) _i ) And u (H) _i ) Calculated according to the formula (1) and the formula (2) respectively,

L(H _i )＝(H _j ：|H _j -H _i |≤δ} (1)

u(H _i )＝-(|H _j -H _i |+|j-i|) (2)

wherein, δ is a threshold value for controlling the number of buckets in the neighbor bucket set;

in the sixth step, the grouped histogram H is subjected to ^G Taking group mean value of frequency of each group of buckets in group G _i The middle histogram bucket frequency is:

Wherein

In the sixth step, the noise adding process of the original data set is to construct a probability density function obeying Laplace distribution, an inverse cumulative distribution function of the probability density function is obtained according to the probability density function, and then uniformly distributed random variables are input into the function, so that Laplace noise can be obtained;

the specific steps for obtaining the laplace noise are,

s2, random variables alpha-U (0,1) meeting the uniform distribution are brought into the inverse function of the Laplace cumulative distribution function, the noise value meeting the condition can be obtained as shown in a formula (8),

s3, taking uniformly distributed alpha to U (-0.5,0.5), merging the piecewise function of the formula (8) into a formula (9) as shown in the following,

F ^-1 (x)＝0-b*sign(α)*ln(1-2abs(α)) (9)

the sign function is used for acquiring the positive and negative of the parameter, and the abs function is used for acquiring the absolute value of the parameter.

2. The grouped precise histogram data distribution method based on differential privacy as claimed in claim 1, wherein: in the step one, each H _i The privacy protection budget epsilon is less than 1 for the frequency of the unit interval.