CN112329056A

CN112329056A - Government affair data sharing-oriented localized differential privacy method

Info

Publication number: CN112329056A
Application number: CN202011211693.0A
Authority: CN
Inventors: 朴春慧; 郝玉蓉; 蒋学红; 郑丽娟; 赵永斌; 张云佐
Original assignee: Shijiazhuang Tiedao University
Current assignee: Guangzhou Chick Information Technology Co ltd
Priority date: 2020-11-03
Filing date: 2020-11-03
Publication date: 2021-02-05
Anticipated expiration: 2040-11-03
Also published as: CN112329056B

Abstract

The invention relates to a localization differential privacy method facing government affair data sharing, and provides a government affair data sharing method based on localization differential privacy, wherein the method introduces a data box dividing thought on the basis of a CMS algorithm, and divides data records into a smaller data domain range through equal-width box dividing to form a BCS algorithm, so that the problem of large statistical error when the data domain is large and the data amount is small in the current privacy protection algorithm is solved; and then, the data binning idea is extended and applied to the GRR algorithm to form the BRR algorithm, so that an obvious effect is achieved. In order to verify the effectiveness of the BCS algorithm and the BRR algorithm, the improved BCS algorithm and the improved BRR algorithm are compared and analyzed with the CMS algorithm and the GRR algorithm on a synthetic data set and a real data set respectively, and experimental results show that the method provided by the invention effectively reduces statistical errors, improves the effectiveness of the BCS algorithm and the GRR algorithm under different distributions and data domain sizes, and has high popularization and application values.

Description

Government affair data sharing-oriented localized differential privacy method

Technical Field

The patent application belongs to the technical field of privacy protection, and particularly relates to a localization differential privacy method facing government affair data sharing.

Background

The intelligent government needs to utilize the data to help it make decisions more accurately and objectively. Among various data, statistics are widely used in various fields due to the characteristics of generality, timeliness and serviceability. In the construction of intelligent governments, a business requirement of sharing statistical data should exist. In order to make a more rational decision, a government department will generally share some kind of data records to other departments according to business requirements, so as to provide auxiliary reference for the management or service decision of the department. Government affair data sharing refers to a process of transferring information or data from one government department to another department by means of an information sharing platform or other technical means according to laws and regulations in order to break the data barriers and fully exert the data value. Government data typically involves a large amount of personal, business-related data. Sensitive information, however, could be easily inferred if government data were shared without proper precautions, as evidenced by published data leakage events over the past two decades, including de-anonymization of massachusetts public health records, de-anonymization of Netflix users, and de-anonymization of individuals participating in genome-wide association studies.

Existing research has conducted research on sensitive information leaks that may exist in government data using privacy protection techniques. Current data privacy protection technologies can be broadly divided into three categories: anonymity-based privacy protection techniques, encryption-based privacy protection techniques, and differential privacy-based privacy protection techniques. The anonymity-based privacy protection technology can be further divided into relational data privacy protection, social graph data privacy protection and position and track data privacy protection according to the difference between the privacy data type and the application scene. Anonymous-based methods, however, typically lack strict privacy security guarantees and are therefore more suitable for privacy protection of small-scale data. Although the encryption-based method has better security guarantee, the encryption operation thereof brings a large amount of computation overhead, which makes it difficult to apply to resource-limited scenes. Differential Privacy (DP) has been developed vigorously over the past decade. The method can theoretically and accurately limit the upper limit of privacy disclosure, and is a main characteristic of the method superior to the traditional privacy protection scheme. The DP model was used by the U.S. Census Bureau for demographic data. The traditional DP model is deployed on a central server, but in practice it is difficult to find a true trusted third party, which limits the application of traditional differential privacy to some extent. Localized Differential Privacy (LDP) arises on the basis of DP, and can resist not only any background knowledge attack, but also untrusted third party attacks. Currently, companies such as google, Apple, etc. have used the LDP model for collecting information of user default browser homepages and search engine settings.

However, the LDP model is applied to privacy-preserving government data sharing with relatively little research effort, since these leading-edge methods need to be applied innovatively in combination with application scenarios and changing demands. In a government affair data sharing scene, the LDP model has the disadvantages of low accuracy, large statistical error, and the like, and is particularly significant in the case of a large data field and a small amount of data.

Localized differential privacy provides a stronger privacy guarantee than centralized differential privacy, formally defined as follows:

definition 1: localized differential privacy (ε -LDP). Let X be a private information that takes a value from a set X of k elements (let X ═ k: {0,1,2,3, …, k-1}, X ∈ X). The privatization mechanism Q is a random mapping Q from [ k ] to the output set Z that maps X ∈ X to Z ∈ Z with a probability Q (Z | X), and Z of the mapped output is called a privatized sample. If X' e.X for all X, when e >0, there is

Then Q is deemed to satisfy epsilon-localization differential privacy. As can be seen from equation (1), a smaller privacy budget epsilon may guarantee a higher privacy level. And the mapping output of any pair x, x' is similar, so that no particular input can be inferred from the output result.

Government statistical data sharing is an important component of electronic government affairs, but privacy disclosure existing in the sharing process greatly influences the strength and transparency of government statistical information disclosure. In localized differential privacy, the query results of a statistical database are not affected by any single private data, which protects personal information from being revealed under the requirement of ensuring that processed statistical information is available. Therefore, the LDP can be applied to a scene of sharing statistical data among government departments, and the safety of sensitive information in the data sharing process is ensured.

Such an inter-division data sharing application scenario should exist in government business. That is, one government department is constantly obtaining data records in an incremental manner, and another department needs to share the statistical information of the data records to be used as an auxiliary reference for the management or service decision of the department. In localized differential privacy, the query results of a statistical database are not affected by any single private data, which protects personal information from being revealed under the requirement of ensuring that processed statistical information is available. In combination with this scenario, a solution for realizing privacy-preserving data sharing among government departments by using a localized differential privacy method is needed, and the effectiveness of the solution is analyzed through experiments.

Traditional localized differential privacy system architecture:

specifically, the data providing department carries out privacy processing on the value of the sensitive attribute column of each record of the data to be shared, so that the data demand department cannot deduce the determined value of the sensitive attribute of the target object after receiving the shared data. And then, the data demand department carries out statistical correction on the received shared data according to the related privacy parameters provided by the data providing department to obtain the required statistical result data. As shown in fig. 1, the encoding and perturbation operations are performed by the data providing section. Each encoded record generates a perturbation report by a random function, thereby satisfying the definition of epsilon-LDP. And finally, the data providing department sends the disturbed report to the data demand department, the data demand department executes decoding and aggregation operation, and available statistical data is generated after statistical correction.

Common LDP method

The following describes the currently common localized differential privacy algorithm and briefly analyzes it to find a method applicable to the inter-government statistical data sharing scenario.

1.Generalized randomized response(GRR)

The random Response algorithm (RR) is the most typical LDP algorithm, first proposed in 1965 by Warner et al for privacy protection. However, RR has a disadvantage that it only responds to discrete data containing two values, and is not applicable to data having more than two values. To this end, a more commonly used random response GRR algorithm is proposed. For each piece of collected private data X ∈ X, X ═ Z ═ k, the user sends the true value of X with probability p, and sends the randomly selected value X' from X \ X } with probability 1-p, and the perturbation function is as shown in equation (2).

2.Randomized aggregatable privacy-preserving ordinal response(RAPPOR)

Randomized aggregated private-preserving cordinal response is a general LDP algorithm proposed by Google. In the rapor mechanism, the real value x of the user is encoded into a bit vector B. When the category attribute is numerous, the RAPPOR has the problems of high communication cost, low accuracy and the like. For this purpose, it uses Bloom filters for encoding. The value x is mapped to different positions in the bit vector B by using k hash functions. For example, the mapping position is set to 1, and the remaining positions are set to 0. Privacy output B 'of k-RAPPOR'Is a k-bit vector which is a probability

And flipping each bit of B.

3.Count Mean Sketch(CMS)

CMS was introduced by the apple differential privacy team in 2016 and consists of a client-side algorithm and a server-side algorithm. The client algorithm ensures that the size of the transmission is m by mapping the element d of the domain with one of k hash functions. At the server side, the data structure for aggregating the private data is a sketch matrix M with dimensions k × M. To calculate elements of a domain

The server-side algorithm averages the counts corresponding to the k hash functions in the matrix M to obtain the final estimation frequency.

The three localized differential privacy algorithms mentioned above are all based on random response, and the uncertainty makes the accuracy of the estimation result unstable. This phenomenon is particularly significant in the case where the data field is large and the amount of data is small. It may even estimate the frequency value corresponding to a low frequency attribute value as negative, which reduces the reference value of the data to some extent. Therefore, one of the innovative points of the present invention is to implement an LDP algorithm with high data utility, so as to improve the accuracy of the estimation result when the data domain is large and the data amount is small, thereby providing the government department with available statistical information while protecting the data privacy.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a local differential privacy method facing government affair data sharing, so as to solve the defects of the existing LDP algorithm and improve the accuracy and stability of estimation results.

In order to solve the problems, the technical scheme adopted by the invention is as follows:

a localization differential privacy method facing government affair data sharing introduces a data binning idea on the basis of a CMS algorithm (Count Mean Sketch), sorts data records into a smaller data domain range compared with an original data domain through equal-width binning, and constructs a counting Sketch matrix for aggregation to reduce space-time complexity so as to overcome the problem that the statistical error of the current privacy protection algorithm is large at a sparse data distribution position (where the data domain is large and the data amount is small).

The technical scheme of the invention is further improved as follows: designing a local scrambler at a data provider to scramble the original data: firstly, carrying out box separation on data according to the domain size of the value (namely a sensitive attribute value d) of a sensitive attribute column, selecting a random hash function for each piece of data in a box by a local scrambler to encode the random hash function to obtain a vector, and disturbing the vector; then, a report containing the selected hash function index and the disturbance vector is sent to a data demand side, and as the data supply side algorithm meets the LDP definition, even if a potential attacker has related background knowledge, accurate information of the sensitive attribute of the attack target cannot be known;

and designing an aggregator at a data demander, aggregating all disturbance reports and related parameters by the aggregator after receiving them from a data provider, wherein the data structure of the aggregated privatized data is a counting sketch matrix with the size of k multiplied by m, the data demander averages the counts corresponding to k hash functions in the matrix to obtain frequency estimation of each attribute value, and finally generates usable statistical data after statistical correction.

The technical scheme of the invention is further improved as follows: the specific operation process is as follows:

s1, the original record is first encoded by a randomly selected hash function, and a set of hash functions H ═ H is designed at the data provider side₁,h₂,…，h_kB, specifying that the function in the H can output a value not greater than m according to the input data, wherein m is the length of an initialization vector of a sensitive attribute value d in each data record, and then sharing the hash function between a data provider and a data demander;

s2, dividing the domain interval of the sensitive attribute value according to the equal-width box dividing ideaZ，Z_iThe divided domain interval is smaller than the original data domain;

s3, initializing a set V for storing the disturbance report obtained subsequently, wherein V_iFor storing zone belonging intervals Z_iA disturbance report of the data record of (a);

s4, the data provider orderly responds to the sensitive attribute value d in the shared data record⁽ⁱ⁾Carrying out disturbance treatment;

and S5, the data demand side calculates the frequency statistics information of each attribute value according to the received disturbance report and the related parameters.

The technical scheme of the invention is further improved as follows: the disturbance processing procedure of the data provider is to prevent user data leakage by ensuring that disturbed data obeys localization differential privacy, specifically:

s41, initializing a vector v with length m for the sensitivity attribute value d in each data record, and representing v { -1}^m；

S42, randomly selecting a value j in the k range as an index for selecting the jth hash function, wherein h_j(d) The representation selects the jth function to hash the sensitive attribute value d, if h_j(d) 134, then assign bit 134 of v to 1;

s43, converting each bit of the vector v to

[ i.e. 1/(e) ]^ε/2+1) probability is randomly inverted to generate a new vector

Is shown as

S44, turning the vector

Hash function h_jAnd sending the index j and the values of the parameters k, m and epsilon to the data demander.

The technical scheme of the invention is further improved as follows: in S5, after the data demander obtains the disturbance report and the relevant parameters of the data provider, the data demander constructs a count sketch matrix using the same parameters, and estimates the count of the sensitive attribute value d through the sketch matrix.

The technical scheme of the invention is further improved as follows: the data demander specifically operates as follows:

s51, initializing an all-zero matrix M with the size of k multiplied by M to construct a sketch matrix, wherein k represents the number of hash functions, and M represents the length of a vector v;

s52, processing the vector v in each disturbance report and converting the vector v into a vector x;

s53, for the disturbance result (x, j) of each sensitive attribute value, adding x to the jth row of the matrix M in order according to bits, wherein the jth row represents the sum of the record items of the jth hash function, and when a data demand side obtains enough records, the scale of the matrix M is large enough;

s54, the data demander reads each row M [ j, h_j(d)]To calculate the mean of these estimates, thereby obtaining an unbiased estimate.

Due to the adoption of the technical scheme, the invention has the beneficial effects that:

the invention provides a method for reducing statistical errors by using data binning and a counting sketch matrix and realizing higher data utility under different distributions. The data box separation idea solves the problem that the requirement of the current privacy protection algorithm on the data volume is strict when the data domain is large, and the data record can be separated into a smaller data domain range by using the box separation idea. When data are aggregated, the data in different sub-domains are aggregated in the respective domains, so that the data in the data domain can be prevented from being divided into other sub-domains, and the reliability of data aggregation is improved to a certain extent.

The main beneficial effects of the technical scheme are as follows:

(1) aiming at the scene of sharing statistical data among government departments, a government affair data sharing method based on localized differential privacy is provided, and controllable privacy protection can be provided for sensitive information on the basis of promoting data sharing.

(2) The method solves the problem that the current privacy protection algorithm has strict requirements on the data size when the data domain is large, effectively reduces the statistical error of the data, and can provide available statistical information while protecting the privacy of government affair data.

(3) The method keeps better data utility in different data distributions and can adapt to various privacy protection tasks in different distributions.

Drawings

FIG. 1 is a block diagram of a conventional localized differential privacy system framework;

FIG. 2 is a schematic diagram of the operation of a data provider of the present invention;

FIG. 3 is a schematic diagram of the operation of a data requestor according to the present invention;

FIG. 4 is a comparison graph of statistical results after BCS algorithm processing and original data statistical results under a simulation data set 1 (satisfying geometric distribution) according to the present invention;

FIG. 5 is a comparison graph of the statistical result after the BCS algorithm processing under the simulation data set 2 (satisfying uniform distribution) and the statistical result of the original data according to the present invention;

fig. 6 is a frequency histogram corresponding to a real data set (philippine family income and expenditure data set);

FIG. 7 is a frequency histogram obtained after CMS algorithm processing under a real data set;

FIG. 8 is a frequency histogram obtained after processing with the BCS algorithm of the present invention under a real data set;

FIG. 9 is a frequency histogram obtained after processing by GRR algorithm under a real data set;

FIG. 10 is a frequency histogram of a real data set processed by the BRR algorithm of the present invention;

FIG. 11 is a comparison graph of the average absolute percentage error of the statistics of each attribute value processed by CMS and BCS algorithms under a real data set;

FIG. 12 is a graph comparing the average absolute percentage error of the statistical measures of the attribute values processed by the GRR and BRR algorithms under the actual data set;

FIG. 13 is a graph of the effect of bin counts on mean absolute percentage error;

FIG. 14 shows the effect of data size on statistical data error (CMS versus BCS alignment direction);

FIG. 15 shows the effect of data size on statistical data error (GRR vs BRR alignment direction);

FIG. 16 shows the effect of privacy prediction on statistical data error (CMS versus BCS alignment direction);

FIG. 17 shows the effect of privacy budget on statistical data error (GRR vs. BRR directions);

fig. 18 shows the effect of bin count on statistical data error for the same privacy budget (in the case of BCS fxs 4/8/16);

fig. 19 shows the effect of bin count on statistical data error for the same privacy budget (in the case of BRR fxs 4/8/16);

FIG. 20 is a graph of the effect of data field size on statistical data error (data set satisfying Zipf distribution, CMS versus BCS orientation);

FIG. 21 shows the effect of data domain size on statistical data error (data set satisfying uniform distribution, CMS vs BCS alignment).

Detailed Description

The present invention will be described in further detail with reference to examples.

The invention discloses a localization differential privacy method facing government affair data sharing, and the specific process is detailed as follows with reference to fig. 1-21.

1.1 localized differential privacy based data sharing scheme

Aiming at the defects of the existing LDP algorithm, the problem that the requirement of the current privacy protection algorithm on the data volume is strict when the data domain is large is solved by adopting a data binning idea on the basis of a CMS algorithm, and a counting sketch matrix for aggregation is constructed to reduce the space-time complexity. The advantage of this is that the data records can be sorted into smaller data fields using the binning concept. When data are aggregated, the data in different sub-domains are aggregated in the respective domains, so that the data in the data domain can be prevented from being divided into other sub-domains, and the reliability of data aggregation is improved to a certain extent.

The local perturber is designed at the data provider to perturb the raw data. The data is first binned according to the domain size of the value of the sensitive attribute column. For each piece of data in the box, the local perturber selects a random hash function to encode the data to obtain a vector, and perturbs the vector. Subsequently, a report containing the selected hash function index and the perturbation vector is sent to the data consumers, as shown in FIG. 2. Since the data provider algorithm satisfies the LDP definition, even if a potential attacker has relevant background knowledge, accurate information of the attack target sensitivity attribute is not known.

The aggregator is designed on the data consumer side. When all disturbance reports and related parameters are received from the data provider, the data consumer will aggregate them through the aggregator. The data structure of the aggregated privatized data is a count sketch matrix with a size of k × m. And the data demander obtains the frequency estimation of each attribute value by averaging the counts corresponding to the k hash functions in the matrix. The final statistical correction results in usable statistical data, as shown in fig. 3.

1.2 data sharing algorithm design based on localized differential privacy

Common data binning is equal-width binning and equal-frequency binning. And (3) the equal-width binning is performed according to the same interval width on the premise of the domain size of the sensitive attribute value, and the data volume in each bin is not fixed. The equal frequency sub-boxes are that the sensitive attribute values are arranged from small to large, and are equally divided into x parts according to the number of records, and at this time, the data volume in each sub-box is the same. The effect of equal frequency binning is susceptible to data distribution, particularly when the data records are concentrated on several attribute values, such as the Zipf distribution and the geometric distribution. Therefore, the patent divides the data records by adopting equal-width boxes. For simplicity of description, the improved algorithm is denoted as BCS (Binning Count Sketch).

The original record is first encoded by a randomly selected hash function, so a set of hash functions H ═ H needs to be designed at the data provider₁,h₂,…，h_kAnd specifies that the function in H is capable of outputting a value no greater than m based on the input data. The set of hash functions is then shared between the data provider and the data consumer. The complete BCS Algorithm is given in Algorithm 1(Algorithm 1). Line 2 divides the domain interval of the sensitive attribute value according to the equal-width box dividing idea, Z_iInto smaller domain intervals after division. The algorithm initializes a set V in line 3 for storing the disturbance reports obtained thereafter. Wherein, V_iFor storing zone belonging intervals Z_iThe perturbation report of the data record. Lines 4-8 are performed by the data provider, in turn, for the value of the sensitive attribute d in the shared data record⁽ⁱ⁾And (6) carrying out disturbance processing. Lines 9-11 the data demander calculates the frequency statistics for each attribute value based on the received disturbance report and the associated parameters.

1.2.1 data provider Algorithm design

Algorithm 2(Algorithm 2) describes the perturbation process of the data providing department. User data leakage is prevented by ensuring that disturbed data is subject to localized differential privacy. In line 1, the algorithm initializes a vector v of length m for the value of the sensitivity attribute d in each data record, denoted v { -1}^m. In lines 2-3, the value j is randomly chosen within k as an index to select the jth hash function, where h_j(d) Indicating that the jth function is selected to hash the sensitivity attribute value d. If h_j(d) 134, bit 134 of v is assigned a value of 1. Then in lines 5-6, each bit of the vector v is divided into

After random turning over, the probability of generating new vector

Is shown as

Finally, the reversed vector is carried out in the 8 th line

Hash function h_jThe index j and the values of the parameters k, m and epsilon are sent to the data demand department.

1.2.2 data requirement end algorithm design

After obtaining the disturbance report and the related parameters of the data providing department, the data demanding department will construct a count sketch matrix using the same parameters, as shown in fig. 3. With the sketch matrix, the last step is to estimate the count of the value of the sensitivity attribute d by deskewing the count and averaging the corresponding hash entries in M. Wherein the content of the first and second substances,

is the data field of the sensitive attribute column,

and the sub data field is the sub data field after the sensitive attribute column is boxed. This process is described in Algorithm 3(Algorithm 3).

First, an all-zero matrix M of size k × M is initialized in row 1 to construct a sketch matrix, where k denotes the number of hash functions and M denotes the length of the vector v. The vector v in each perturbation report is then processed and converted to x in lines 2-4. In lines 5-6, for each perturbation result (x, j) of the value d of the sensitive attribute, we add x to the matrix M in bit-wise orderLine j, which represents the sum of entries for which the jth hash function was selected. The size of the matrix M may be large enough when the data demander obtains enough records. Finally, on lines 7-8, the data demander reads each line M [ j, h_j(d)]To calculate the mean of these estimates, thereby obtaining an unbiased estimate.

To verify the feasibility of the BCS algorithm, the patent first generates two simulation datasets containing 10 ten thousand records, which satisfy the geometric distribution and the uniform distribution, respectively. The frequency statistics of the original data set and the frequency statistics of the data processed by the BCS (fxs ═ 8) algorithm are shown in fig. 4-5. The method is easy to obtain, the difference between the data statistic value processed by the BCS algorithm and the original data statistic value is small, and the method has statistical significance.

Fig. 4-5 are comparison graphs of statistical results of

simulation data sets

1 and 2 respectively processed by the BCS algorithm and statistical results of original data, fig. 4 is geometrical data sets, and fig. 5 is Uniform data sets.

1.3 method extension

In addition, this patent extends the data binning concept to the GRR algorithm mentioned in the background, and the improved algorithm is denoted as brr (pairing Randomized response). BRR is an improvement over GRR work, and GRR can be a method of BRR for certain parameter selection in certain situations, i.e., when the parameter fxs is 1, the GRR algorithm. It can be found from a review of equation (2) that values of p close to 0 or 1 are not preferable. This is because in an extreme case, when p is 1, K is 1, and at this time, the sensitive attribute has only one value, that is, the probability that the data is not inverted is 1, and if the data is shared, the sensitive information is in danger; when p is 0, the probability of data inversion is the largest, and the obtained disturbed data deviates from the original data, although the protection intensity is increased, the original purpose of sharing data is violated. In the BRR algorithm, the number of values that can be taken by the sensitive attribute in the subdomain is determined by the data domain of the sensitive attribute column and the binning number fxs, and if the value of the sensitive attribute in the subdomain exceeds 1, it is ensured that the value of the binning number does not exceed the size of the data domain, thereby protecting the privacy of the shared data.

2 experimental part

2.1 Experimental data set

The experiment was analyzed using a total of three data sets: two sets of simulation data and one set of real data. The simulation data sets respectively satisfy Zipf distribution and uniform distribution, and each data set comprises 100000 pieces of data. The real data set is a Family Income and expense data set (Family Income and Expenditure) provided by Kaggle. Based on the condition that households with annual income greater than 200000 pesos are high-income households, 16685 data are screened out for subsequent analysis. Since the data set contains a plurality of attribute fields, the data providing department can share the statistical analysis data of high-income families in various dimensions such as region, age, marital status, and the like. The patent takes age dimension as an example, and privacy protection processing is carried out on high income family frequency data of all ages, so that the characteristics and advantages of the BCS and the BRR algorithm are proved.

2.2 Experimental indices

The utility of the privacy-protected data is typically evaluated by how different the original data set differs from the privacy-treated data set. Common error metrics are: relative error, absolute error, mean absolute percentage error, euclidean distance, and the like. The average absolute percentage error MAPE is used as an evaluation index of the data effectiveness, and the effectiveness of the BCS algorithm is proved by comparing with the CMS algorithm. For each data value, we calculate the absolute value between the estimate frequency and the true frequency, divide the absolute value by the true frequency, then accumulate these values, and divide by the size of the data field. MAPE^[42,43]Is defined as follows:

where | D | is the category field size of the sensitive attribute column, y_iIs the true frequency number, x, of the ith attribute value_iThe estimation frequency of the ith attribute value. The smaller the value of MAPE, the closer the estimated distribution is to the true distribution, indicating the better utility of the data.

2.3 analysis of the results

2.3.1 frequency estimation

To clearly show the frequency distribution trend and facilitate observation, we calculate the estimation frequency of each attribute value. Three types of high-income family age distribution histograms are plotted based on the selected data, respectively, as shown in fig. 6. Fig. 6 is a frequency histogram corresponding to an original real data set, fig. 7 is a frequency histogram obtained by CMS algorithm processing when ∈ 2, fig. 8 is an frequency histogram obtained by BCS algorithm processing when ∈ 2, fxs is 16, fig. 9 is a frequency histogram obtained by GRR algorithm processing when ∈ 2, and fig. 10 is an frequency histogram obtained by BRR processing when ∈ 2, fxs is 16. Observing the frequency histograms in fig. 6-10, it can be seen that although the specific value of the record is changed, the distribution property of the original data is maintained from the general trend of the data, and a strong reference value still exists. Compared with the CMS algorithm, the estimation frequency of the data processed by the BCS algorithm is closer to the true value, and is particularly remarkable at two ends with small data quantity.

2.3.2 error analysis

We calculated MAPE for the statistics of high income household number per age after privacy protection using CMS and GRR as controls, respectively, fig. 11-12 are mean absolute percentage errors for statistics of each age after privacy protection, fig. 11 is a comparison of CMS and BCS, and fig. 12 is a comparison of GRR and BRR.

As shown in fig. 11-12, we set the parameters to m 1024, k 2048 and e 2. Analysis of experimental data shows that when the age of the householder is 15, 17, 93, 97, and 99, respectively, the error of the statistical value corresponding to the data after the privacy protection by the CMS algorithm is the largest. Similarly, when the age of the householder is 15, 17, 89, 96 and 98, respectively, the statistical value error corresponding to the data after the privacy protection by the GRR algorithm is the largest. It is easy to find that the numbers of the households corresponding to the ages of the households are all the cases with too small data volume, and are respectively 2, 4, 3, 2, 4, 5, 3 and 4. Compared with CMS and GRR algorithms, the BCS and BRR algorithms have smaller corresponding statistical errors when the data volume is small, the statistical errors are lower than those before the improvement, and the overall trend is more stable.

2.3.3 Effect of bin count on statistical data error

In order to verify the influence of the bin number on the statistical data error, the average absolute percentage error of the high income family number statistic corresponding to each household age under different bin numbers is calculated, wherein the parameter m is 1024, k is 2048, and epsilon is 2. Fig. 13 shows the effect of the number of bins on the statistical data error, and it can be seen from fig. 13 that as the number of bins increases, MAPE decreases gradually. This is because the error that random response technique brought can be restricted to a certain extent in the branch case, guarantee that certain age value of low frequency still is in the position that is nearer apart from this original age value after being handled by privacy protection. In addition, it should be noted that the maximum value of the number of bins in the BCS and BRR algorithms should not exceed the size of the data field value, especially in the BRR algorithm. If the value of the number of sub-boxes exceeds the size of the data field, the sensitive attribute is enabled to be in the sub-field Z_iThe number of the middle dereferencing values is close to 1, namely the probability of data not turning over is close to 1. Current subdata field 1<＝Z_i<When the number of bins is 1.5, the number of bins is 56 in fig. 11 to 12<＝fxs<At 84, the BRR algorithm is gradually closer to 0. Sharing such data directly would put sensitive information at risk, which is undesirable.

2.3.4 Effect of data Scale on statistical data error

To confirm the utility of BCS and BRR algorithms on data of different scales, we performed doubling on the screened philippine high income family raw data set (16685 pieces of data) to generate data sets of 4 times (66740 pieces of data), 8 times, 12 times, 16 times, and 20 times (333700 pieces of data), respectively. The parameters are set to m 1024, k 2048, e 2 and fxs 4, respectively. The changes in the statistical error values were observed at different data scales using CMS and GRR as controls.

Fig. 14-15 show the influence of data size on statistical data error, and it can be seen from fig. 14-15 that the overall error of data statistics processed by different privacy protection algorithms gradually decreases as the data size increases. Compared with CMS and GRR, BCS and BRR maintain better data utility under different data scales, and especially when the data scale is smaller, the advantages of BCS and BRR are more obvious. When the data set reaches a certain scale, the overall error of the BCS and BRR statistics fluctuates within a small range. The experimental results also verify the statement "the size of the data volume for statistics determines the degree of data utility" pointed out in the literature by Qingqing et al.

2.3.5 influence of privacy budget on statistical data errors

Under the values of different privacy budgets (0-5), privacy protection is carried out on the original data set of the high-income family of the Philippines, so that the influence of privacy budget parameters on the data effectiveness of the BCS and the BRR is verified. Similarly, the parameters m 1027 and k 2048 were selected for CMS and GRR as controls, respectively. Taking fxs ═ 8 as an example, it can be easily seen from fig. 16-17 that as the privacy budget increases, the MAPE values of both algorithms are smaller. I.e. the resulting estimated data is closer to the original data when the privacy budget is increased. Second, the error values of the BCS and BRR algorithms are smaller at the same privacy budget, and especially when epsilon <2.5 (epsilon <3.5), the data utility of the BCS (BRR) algorithm is significantly better than that of cms (grr). In addition, under the same parameter setting, MAPEs when fxs is 4, fxs is 8, and fxs is 16 are calculated respectively, as shown in fig. 18-fig. 19, it is easy to find that the larger the bin count fxs is, the smaller the statistical data error MAPE is under the same privacy budget.

FIGS. 16-17 show the effect of privacy budgets on statistical data errors, FIG. 16 shows CMS versus BCS alignment, and FIG. 17 shows GRR versus BRR alignment.

Fig. 18-19 show the effect of bin count fxs on statistical data error for the same privacy budget.

2.3.6 Effect of data field size on statistical data error

The BCS algorithm adopts a hash function to encode data in an encoding stage, so that the hash collision phenomenon is easy to occur. This phenomenon occurs mainly because different values may be mapped to the same position under the same hash function, thereby reducing the estimation accuracy of the data. The method utilizes different data domain sizes to generate data sets meeting Zipf distribution and uniform distribution, the size of each data set is 100000, and algorithm effectiveness under different domain sizes is tested. The parameters are set to m 256, k 1024, e 2 and fxs 8, respectively. As shown in fig. 20-21, the errors of the BCS and CMS algorithms under the Zipf distribution and the uniform distribution increase with the increase of the data field size, and are significant when | D | > m. Second, the usefulness of the BCS algorithm is significantly better than the CMS algorithm when the data fields are the same, whether in a Zipf distribution or a uniform distribution.

FIGS. 20-21 show the effect of data field size on statistical data errors in terms of Zipf datasets (Zipf datasets) and Uniform datasets (Uniform datasets), respectively.

2.4 algorithmic characterization

We analyzed BCS and BRR algorithm characteristics from two aspects:

(1) better data utility. From the error perspective of the frequency histogram, the error value of the CMS and GRR algorithms mainly originates from the part where the sensitive attribute values are sparse. The patent provides privacy protection algorithms BCS and BRR based on data binning, and a smaller error amount is obtained by dividing data into smaller data domains. The discussion and analysis in the aspects of data scale, privacy budget, data domain size and the like show that the BCS and the BRR have better data utility.

(2) Slightly higher temporal complexity. The time complexity required for using the BCS, BRR algorithms is only more than the time for binning the data compared to the CMS, GRR algorithms. However, since government shared personal statistics are usually not of the order of magnitude, the slightly higher time complexity does not become a limiting factor in the practical application of the algorithm in the government field.

3. Conclusion

In order to help intelligent governments make more accurate and objective decisions by using data, data sharing is an essential task in the construction process of the intelligent governments. Since the government affair data contains a large amount of personal sensitive information, privacy information may be leaked by directly sharing the data or analyzing the shared data. Therefore, the patent studies on how the government protects private information from leakage while enforcing government affair data sharing. First, the patent discusses the existing privacy protection technology for the scene of sharing statistical data among government departments, and proposes a government data sharing algorithm bcs (planning Count sketch) based on localized differential privacy. And then compared with the CMS algorithm, the algorithm has higher utility and can be maintained under different distributions and data domain sizes. However, the algorithm provided by the patent is only suitable for the condition of single-value sensitive attribute at present, and the next step of work considers the problem of how to ensure the sensitivity and the utility of data under the condition of multi-value sensitive attribute.

The intelligent government as an extension of the service government needs to use data to help it make decisions more accurately and objectively. In order to make a more rational decision, a government department will generally share some kind of data records to other departments according to business requirements, so as to provide auxiliary reference for the management or service decision of the department. However, if government data is shared without appropriate precautionary measures, it is easy to cause leakage of private information. Therefore, how to protect the personal privacy information from being leaked is a key problem that the government needs to solve while pushing credible government affair data sharing and building safe intelligent governments. In the scene of sharing statistical data among government departments, the patent proposes a government affair data sharing method based on localized differential privacy. According to the method, a data binning thought is introduced on the basis of a Count Mean Sketch (CMS) algorithm, and data records are split into smaller data domain ranges through equal-width binning, so that the problem that the statistical error is large when the data domain is large and the data amount is small in the current privacy protection algorithm is solved. And then, the data binning idea is extended and applied to a Generalized Random Response (GRR) algorithm, so that obvious effect is achieved. To verify its validity, the improved Binning Count Sketch (BCS) and Binning Randomised Response (BRR) algorithms of the present invention were compared with CMS and GRR algorithms on the synthetic and real datasets, respectively. Experimental results show that the method provided by the invention effectively reduces statistical errors, improves the utility of the method under different distributions and data domain sizes, and has high popularization and application values.

Claims

1. A localization differential privacy method facing government affair data sharing is characterized in that a data binning idea is introduced on the basis of a CMS algorithm, data records are divided into a smaller data domain range compared with an original data domain through equal-width binning, and a counting sketch matrix used for aggregation is constructed to reduce space-time complexity so as to overcome the problem that a current privacy protection algorithm is large in statistical error at a sparse data distribution position.

2. A localized differential privacy method towards government data sharing according to claim 1,

designing a local scrambler at a data provider to scramble the original data: firstly, carrying out box separation on data according to the domain size of the value of the sensitive attribute column, selecting a random hash function for each piece of data in a box by a local scrambler to encode the random hash function to obtain a vector, and disturbing the vector; then, sending a report containing the selected hash function index and the disturbance vector to a data demand side;

3. The localized differential privacy method for government data sharing according to claim 2, wherein the specific operation process is as follows:

s2, according to the equal widthDomain interval Z, Z of sensitive attribute value is divided by box separation thought_iThe divided domain interval is smaller than the original data domain;

s4, the data provider sequentially carries out disturbance processing on the sensitive attribute value d in the shared data record;

4. The localization differential privacy method for government data sharing according to claim 3, wherein the perturbation process of the data provider is to prevent user data from being leaked by ensuring that perturbed data obeys localization differential privacy, specifically:

S42, randomly selecting a value j in the k range as an index for selecting the jth hash function, wherein h_j(d) The representation selects the jth function to hash the sensitive attribute value d, if h_j(d) 134, then assign bit 134 of vector v to 1;

s43, converting each bit of the vector v to

After random turning over, the probability of generating new vector

Is shown as

S44, turning the vector

5. The method for localized differential privacy for government data sharing according to claim 4, wherein in step S5, after the data demander obtains the disturbance report and the related parameters from the data provider, the data demander constructs a counting sketch matrix using the same parameters, and estimates the count of the value d of the sensitive attribute by counting the sketch matrix.

6. A government data sharing oriented localized differential privacy method according to claim 5, wherein the data demander is specifically operative to:

s51, initializing an all-zero matrix M with the size of k multiplied by M to construct a counting sketch matrix, wherein k represents the number of hash functions, and M represents the length of a vector v;

s53, for the disturbance result (x, j) of each sensitive attribute value, sequentially adding the vector x to the jth row of the matrix M according to bits, wherein the jth row represents the sum of the record items of which the jth hash function is selected;