CN110795758A - Non-equidistant histogram publishing method based on differential privacy - Google Patents

Non-equidistant histogram publishing method based on differential privacy Download PDF

Info

Publication number
CN110795758A
CN110795758A CN201910961197.8A CN201910961197A CN110795758A CN 110795758 A CN110795758 A CN 110795758A CN 201910961197 A CN201910961197 A CN 201910961197A CN 110795758 A CN110795758 A CN 110795758A
Authority
CN
China
Prior art keywords
value
key
group
histogram
privacy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910961197.8A
Other languages
Chinese (zh)
Other versions
CN110795758B (en
Inventor
郑啸
杨磊
陈启航
梁越永
童琨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Xiangyun Technology Co Ltd
Maanshan Health Information Center
Anhui University of Technology AHUT
Original Assignee
Anhui Xiangyun Technology Co Ltd
Maanshan Health Information Center
Anhui University of Technology AHUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Xiangyun Technology Co Ltd, Maanshan Health Information Center, Anhui University of Technology AHUT filed Critical Anhui Xiangyun Technology Co Ltd
Priority to CN201910961197.8A priority Critical patent/CN110795758B/en
Publication of CN110795758A publication Critical patent/CN110795758A/en
Application granted granted Critical
Publication of CN110795758B publication Critical patent/CN110795758B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6227Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2107File encryption

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a differential privacy-based non-equidistant histogram publishing method, which relates to the technical field of data privacy protection and comprises the following two main steps: 1) aiming at the problems that the distribution characteristics of sample data cannot be sufficiently reflected by the equidistant histogram divided by the prior histogram distribution method without considering the sparsity degree of data distribution under the differential privacy, zero bucket can possibly occur and the like, the non-equidistant histogram is generated by the method of determining each group of demarcation points of the abscissa by averagely dividing the ordinate; 2) and reasonably allocating privacy budgets to each group on the non-equidistant histogram according to the group distance, and respectively adding random noise subjected to Laplace distribution to each group to improve the privacy of data in the non-equidistant histogram and ensure the query result precision of partial long-range query. The invention not only reasonably ensures the privacy and the usability of the data, but also effectively ensures the distribution characteristics of the data.

Description

Non-equidistant histogram publishing method based on differential privacy
Technical Field
The invention relates to the technical field of data privacy protection, in particular to a non-equidistant histogram issuing method based on differential privacy.
Background
With the advent of the big data age, organizations collect various information to generate massive data, and can issue statistical results in various forms, including generating histograms reflecting data distribution characteristics. The histogram is a technology capable of intuitively estimating data distribution characteristics, the key of the histogram is the group spacing and the group number, most histogram distribution technologies divide the histogram from how to select the group number, most generated histograms are equal-width histograms, and a good histogram needs to consider the two aspects of the group spacing and the group number, so that the group number does not need to be divided reasonably, and the group spacing of the histogram also needs to be divided reasonably according to the sparsity of data. Taking a hospital as an example, an electronic medical record system is mostly adopted, and thus a large amount of medical data is generated. To reflect the health status of social personnel, medical institutions may distribute statistics of medical health data in various forms, such as the number of patients per age interval.
While the data owned by the medical institution contains many sensitive information and needs privacy protection, at present, there are many privacy protection technologies for data, and the differential privacy (differential privacy) proposed by cythia Dwork in 2006 is a privacy protection technology which can resist attacks with any background knowledge and can issue more accurate statistical data while protecting a single record in a database.
Although there is a related research on histogram data distribution in the existing differential privacy protection technology, most generated histograms are equal-width histograms, and the distribution characteristics of data are hidden to some extent, so privacy protection considering non-equidistant histograms is necessary.
The non-equidistant histogram and the differential privacy technology are combined to further optimize the histogram issuing technology under the differential privacy condition, and the issued histogram can better reflect the characteristics of data distribution and can also meet the requirement of privacy protection.
Disclosure of Invention
The invention aims to provide a differential privacy-based non-equidistant histogram publishing method, which aims to combine a differential privacy protection technology with a non-equidistant histogram, reasonably set privacy budget by combining the sparsity degree of data distribution, ensure the privacy of data, maximally retain the distribution characteristics of the data and ensure the precision of long-range query to a certain extent.
In order to achieve the above purpose, the invention provides the following technical scheme: a non-equidistant histogram release method based on differential privacy comprises the following steps:
1) selecting an original database table, wherein the original database table at least comprises a row of sensitive attribute rows needing privacy protection, and setting a total privacy budget epsilon of a non-equidistant histogram to be issued;
2) reading N records in an original database table in a key value pair (key, value) mode to obtain N key value pairs; the key value represents the attribute value of a certain column in the database table, and the value represents the value of a certain sensitive attribute column in the database table;
3) and (3) processing the data by N key values: merging key value pairs with the same key value, and accumulating the value values of the key value pairs with the same key value to generate n key value pairs with different key values;
4) sorting the key value pairs of n different key values according to the order of the key values from small to large, and marking as < key(1),value(1)>,<key(2),value(2)>,…,<key(n),value(n)>Therein key(1)<key(2)<…<key(n)
5) Setting the upper bound value of the key value as Max and the lower bound value as Min by taking the key value as the abscissa of the non-equidistant histogram, and dividing the key value into k groups according to the value range [ Min, Max ] of the key value;
6) determining a set of boundary points on an abscissa by adopting an empirical distribution function and a generalized inverse function thereof, and recording left and right boundary points of any one of k groups on the abscissa and corresponding group distances;
7) counting the frequency sum of the key values falling into each abscissa group in sequence, calculating the frequency and group height of each group, and constructing a non-equidistant histogram h;
8) according to the definition of global sensitivity and privacy budget in differential privacy, Laplace noise is added to the group height of any one group in the non-equidistant histogram h, and the non-equidistant histogram h' based on the differential privacy is issued.
Further, the empirical distribution function in the step 6) is marked as Fn(x) Value range of [0,1]Defined as:
Figure BDA0002228963020000031
wherein m is more than or equal to 1 and less than or equal to n-1, j is 0,1, …, n, key(m)、key(m+1)The mth and (m + 1) th sequence records respectively representing key value pair sequencing;
f is to ben(x) Value range of [0,1 ]]Are equally divided into k groups, the group distance of the groups is
Figure BDA0002228963020000032
Any packet interval is
Figure BDA0002228963020000033
Figure BDA0002228963020000034
Said empirical distribution function Fn(x) Is generalized inverse function memory
Figure BDA0002228963020000035
Indicates that the condition F is satisfiedn(y) maximum lower bound, key ≧ x(1)≤y≤key(n)
Defining the set of all boundary points on the abscissa of the non-equidistant histogram as Bq
Figure BDA0002228963020000036
For any one grouping on the abscissa bgAnd its left boundary is marked as BgLAnd the right boundary is marked as BgRAnd then:
BgL=Bg-1,BgR=Bg
group bgIs recorded as Δ BgObtaining:
ΔBg=Bg-Bg-1,g=1,2,…,k。
further, grouping b in the non-equidistant histogram h in the step 8)gGroup height h ofgThe formula for adding laplacian noise is:
Figure BDA0002228963020000037
wherein, h'gGroup height, epsilon, representing the additive laplacian noisegRepresents a packet bgΔ f is sensitivity;
εgthe calculation formula of (2) is as follows:
Figure BDA0002228963020000038
further, said group bgPrivacy budget epsilong≤ε。
Further, the value values are 1 and 0, and indicate whether the sensitive attribute value meets the condition of the statistical query function; when the sensitive attribute value meets the condition of the statistical query function, the value is 1; and when the sensitive attribute value does not accord with the condition of the statistical query function, the value is 0.
According to the technical scheme, the non-equidistant histogram issuing method based on the differential privacy has the following beneficial effects:
the invention discloses a differential privacy-based non-equidistant histogram release method, which comprises the following two main steps: firstly, aiming at the problems that the distribution characteristics of sample data cannot be sufficiently reflected by the equidistant histogram divided by the conventional histogram distribution method under the differential privacy without considering the sparsity degree of data distribution, zero bucket possibly occurs and the like, the non-equidistant histogram is generated by the method of determining each group of demarcation points of the abscissa by averagely dividing the ordinate; secondly, reasonably distributing privacy budget epsilon for each group according to group distance on non-equidistant histogramgAnd random noise which obeys Laplace distribution is added to each group respectively to improve the privacy of data in the non-equidistant histogram and ensure the query result precision of partial long-range query. The invention not only reasonably ensures the privacy and the usability of the data, but also effectively ensures the distribution characteristics of the data. The sparsity of data distribution is fully considered when the histogram is published under the differential privacy, so that the accuracy of the published histogram for reflecting the data distribution characteristics is ensured; laplace noise is added to each group by applying a differential privacy technology, so that privacy protection in the data publishing process is realized; and reasonable privacy budget is set for each group according to the group distance, so that the accuracy of long-range query is ensured to a certain extent.
It should be understood that all combinations of the foregoing concepts and additional concepts described in greater detail below can be considered as part of the inventive subject matter of this disclosure unless such concepts are mutually inconsistent.
The foregoing and other aspects, embodiments and features of the present teachings can be more fully understood from the following description taken in conjunction with the accompanying drawings. Additional aspects of the present invention, such as features and/or advantages of exemplary embodiments, will be apparent from the description which follows, or may be learned by practice of specific embodiments in accordance with the teachings of the present invention.
Drawings
The drawings are not intended to be drawn to scale. In the drawings, each identical or nearly identical component that is illustrated in various figures may be represented by a like numeral. For purposes of clarity, not every component may be labeled in every drawing. Embodiments of various aspects of the present invention will now be described, by way of example, with reference to the accompanying drawings, in which:
FIG. 1 is a basic flow chart of a differential privacy-based non-equidistant histogram distribution method according to the present invention;
FIG. 2 is a non-equidistant histogram before noise addition, plotted according to a method of the present invention;
FIG. 3 is an isometric histogram produced according to a conventional method according to an embodiment;
FIG. 4 is a non-equidistant noise histogram plotted using different privacy budgets to add noise according to an embodiment of the method of the present invention;
fig. 5 is a non-equidistant noise histogram plotted using the same privacy budget plus noise for an embodiment.
Detailed Description
In order to better understand the technical content of the present invention, specific embodiments are described below with reference to the accompanying drawings.
In this disclosure, aspects of the present invention are described with reference to the accompanying drawings, in which a number of illustrative embodiments are shown. Embodiments of the present disclosure are not intended to include all aspects of the present invention. It should be appreciated that the various concepts and embodiments described above, as well as those described in greater detail below, may be implemented in any of numerous ways, as the disclosed concepts and embodiments are not limited to any one implementation. In addition, some aspects of the present disclosure may be used alone, or in any suitable combination with other aspects of the present disclosure.
Based on the research in the prior art and the adoption of the prior differential privacy protection technology to perform related processing on histogram data release, most histograms are equal-width histograms, the distribution characteristics of the data are hidden to a certain extent, and the characteristics of the data distribution cannot be accurately reflected; therefore, the invention aims to provide a differential privacy-based non-equidistant histogram publishing method, the non-equidistant histogram and the differential privacy technology are combined to realize further optimization of the histogram publishing technology under the differential privacy condition, and the published histogram can better meet the privacy protection requirement and accurately reflect the data distribution characteristics.
The invention discloses a differential privacy-based non-equidistant histogram release method, which specifically comprises the following steps:
1) selecting an original database table, wherein the original database table at least comprises a row of sensitive attribute rows needing privacy protection, and setting a total privacy budget epsilon of a non-equidistant histogram to be issued; 2) key value pair < key, value>Reading N records in an original database table to obtain N key value pairs; the key value represents the attribute value of a certain column in the database table, and the value represents the value of a certain sensitive attribute column in the database table; 3) preprocessing N key value pairs: merging key value pairs with the same key value, and accumulating the value values of the key value pairs with the same key value to generate n key value pairs with different key values; 4) sorting the key value pairs of n different key values according to the order of the key values from small to large, and marking as < key(1),value(1)>,<key(2),value(2)>,…,<key(n),value(n)>Therein key(1)≤key(2)≤…≤key(n)(ii) a 5) Setting the upper boundary value of the key value as Max and the lower boundary value as Min by taking the key value as the abscissa of the non-equidistant histogram, and setting the value range [ Min, Max ] of the key value]Dividing the obtained product into k groups; 6) determining a set of boundary points on an abscissa by adopting an empirical distribution function and a generalized inverse function thereof, and recording left and right boundary points of any one of k groups on the abscissa and corresponding group distances; 7) counting the frequency sum of the key values falling into each abscissa group in sequence, calculating the frequency and group height of each group, and constructing a non-equidistant histogram h; 8) according to the definition of global sensitivity and privacy budget in differential privacy, Laplace noise is added to the group height of any one group in the non-equidistant histogram h, and the non-equidistant histogram h' based on the differential privacy is issued.
The value values of the sensitive attribute column in the step 2) are 1 and 0, which indicates whether the sensitive attribute value meets the condition of the statistical query function; when the sensitive attribute value meets the condition of the statistical query function, the value is 1; and when the sensitive attribute value does not accord with the condition of the statistical query function, the value is 0.
Wherein, in the step 5), the Moore formula is used for dividing n key values into k groups, namely:
the value of the parameter C is 1,2 or 3, the value of C influences the grouping number, when the value of C is 1, the grouping number is smaller, and when the value of C is 2, the grouping is more reasonable; the upper and lower bounds of the key value are set to ensure that the data volume for drawing the non-equidistant histogram is in a reasonable range, including some data with special characteristics, such as the age data is non-negative; the reasonableness of the setting of the size of the constant C is also determined by the characteristics of the specific data, such as the distribution of the number of HIV diseases with age in the medical data, and the upper bound is generally not more than 100 and the lower bound is 0 because of the limited age size.
The determination of the boundary point of the horizontal coordinate of the histogram in the step 6) is determined by an empirical distribution function and a generalized inverse function thereof, and the overall distribution characteristics of the data, and is directly related to the sparsity of the data. The empirical distribution function is noted as Fn(x) Value range of [0,1]Defined as:
Figure BDA0002228963020000071
wherein, m is more than or equal to 1 and less than or equal to n-1, j is 0,1(m)、key(m+1)The mth and (m + 1) th sequence records respectively representing key value pair sequencing; f is to ben(x) Value range of [0,1 ]]Are equally divided into k groups, the group distance of the groups is
Figure BDA0002228963020000072
Any packet interval is
Figure BDA0002228963020000073
Said empirical distribution function Fn(x) Is generalized inverse function memory
Figure BDA0002228963020000074
Indicates that the condition F is satisfiedn(x) Maximum lower bound, key ≧ x(1)≤y≤key(n)
Defining the set of all boundary points on the abscissa of the non-equidistant histogram as Bq
Figure BDA0002228963020000075
For any one grouping on the abscissa bgAnd its left boundary is marked as BgLAnd the right boundary is marked as BgRAnd then: b isgL=Bg-1,BgR=BgGroup bgGroup spacing on the abscissa is denoted as Δ Bg,ΔBg=Bg-Bg-1
Thus, any abscissa is grouped b in step 7)gPacket interval (B)gL,BgR) And g is 1,2, …, k, the frequency sum of the key values falling into the interval, namely the cumulative sum of the value values corresponding to all the key values falling into the interval, is recorded as ngFrequency is denoted as fgGroup height is denoted as hgThen, then
Figure BDA0002228963020000079
Figure BDA0002228963020000076
Figure BDA0002228963020000077
According to the definition of sensitivity and privacy budget in differential privacy, Δ f ═ maxD,D′‖f(D)-f(D′)‖1D represents the number of data of all records in the database table, and D' represents the number of data of the database table with one record removed; in the present invention, D ═ N-1, the sensitivity is
Figure BDA0002228963020000078
Wherein, the detailed calculation process is as follows:
Figure BDA0002228963020000082
grouping b any one of the non-equidistant histograms h in combination with a given total privacy budget εgThe formula for calculating the group high additive laplacian noise is as follows:
Figure BDA0002228963020000083
wherein, h'gGroup height, epsilon, representing the additive laplacian noisegRepresents a packet bgThe privacy budget. The magnitude of the added Laplace noise is based on the group spacing Δ BgTo determine the group spacing Δ BgReflecting the sparseness of data, histogram packets with small group distances have denser data, and in long-range queries, the accuracy of query results is reduced due to the accumulation of noise, so that less noise should be added for packets with smaller group distances compared with packets with larger group distances.
According to the parallel combinability of the differential privacy, as the data of each group are not intersected with each other, the privacy budget epsilon of any group on the abscissagExistence condition εgE.g. epsilon, thus epsilongThe calculation formula of (2) is as follows:
Figure BDA0002228963020000084
compared with a database table with small data volume, the technical scheme disclosed by the invention is more suitable for the conditions of large data volume and obvious data hierarchy, the total data volume reaches ten thousand levels or more, the attribute for grouping can take the value types of hundreds of levels or more, the grouping is obvious at this time, and the data distribution characteristics can be reflected more reasonably.
The following describes the differential privacy-based non-equidistant histogram distribution method in detail with reference to the specific embodiments shown in the drawings, in which whether an original database table is aids-infected or not is taken as a sensitive attribute column.
The technical scheme of the invention is concretely explained by taking the AIDS number in the 0-99 age interval in the 2010 Chinese medical health general survey result as an example. For ease of illustration, a patient information database table in the form shown in table 1 is given:
table 1 patient information database table
Figure BDA0002228963020000091
According to table 1, the specific implementation steps are as follows:
firstly, preprocessing data: reading a real data table in the form of a database table 1 in the form of key value pair < key, value, wherein the total number of patients between 0 and 99 years is N15982, health data are always stored in huge private information, and if the data table in the form of table 1 is directly published, an attacker can obtain specific information of a record by means of connection and association of a plurality of tables. According to a statistical table of the acquired immune deficiency syndrome number and age groups in 2010 published by a public health science data center, generating random acquired immune deficiency syndrome numbers in each age group on the premise that the whole is approximately in accordance with normal distribution, and obtaining the following sequential statistics after merging and sorting to obtain a key value pair sorting table shown in a table 2:
table 2 key value pair sorting table
Figure BDA0002228963020000092
Secondly, setting parameters; in this embodiment, the key value is set to have an upper limit Max of 99, a lower limit Min of 0, n of 100, and C of 2
Figure BDA0002228963020000101
Determining a vertical coordinate grouping point; in the interval [0,1]While the ordinate is divided into groups, in the present embodiment, the ordinate is equally divided into 14 groups at a group interval of
Figure BDA0002228963020000102
The ordinate grouping points are: 0,0.071,0.142,0.213,0.284,0.355,0.426,0.497,0.568,0.639,0.710,0.781,0.852,0.923,0.994,1.
Determining an abscissa boundary point by taking the key value as the abscissa of the histogram and according to the empirical distribution function and the generalized inverse function thereof; in the present embodiment, the abscissa grouping points are divided according to the formulas (1-2) and (1-3) into: 0,24, 28,30, 33,35, 38,40, 42,45, 49,53, 59,66, 99; the group distances are as follows: 24, 4,2, 3, 2, 3, 2, 2, 3, 4, 4, 6, 7, 33, and the interval of each group on the abscissa is: [0,24), [24,28), [28,30), [30,33), [33,35), [35,38), [38,40), [40,42), [42,45), [45,49), [49,53), [53,59), [59,66, [66, 99); according to the formula (1-4), the frequency sum corresponding to each group interval is: 940, 1264, 810, 1300, 913, 1487, 1059, 983, 1222, 1362, 1008, 1348, 1139, 1153; according to equations (1-5), each set corresponds to a frequency of: 0.0588, 0.0791, 0.0507, 0.0813, 0.0571, 0.0930, 0.0663, 0.0615, 0.0765, 0.0852, 0.0631, 0.0843, 0.0713, 0.0721; according to equations (1-6), each group corresponds to a group height of: 0.00245,0.019775,0.02535,0.02710,0.02855,0.03100,0.03315,0.03075,0.0255,0.0213,0.015775,0.01405,0.01019,0.00218.
In this embodiment, to illustrate the feasibility of the present invention, the total privacy budgets epsilon are set to be 1, 0.1, 0.01, ln 2, ln3, and ln 4, respectively, and to illustrate that the implementation of the present invention ensures a certain usability in advance of improving the privacy of each group of data, the experimental process reasonably takes the privacy budget value of each group as a part of the total budget of each group according to the weight of each group, and sets the privacy budget epsilon of each groupgThe weights of the values are:
the groups of heights before adding laplace noise according to equations (1-7) are in order: 0.00245,0.019775,0.02535,0.02710,0.02855,0.03100,0.03315,0.03075,0.0255,0.0213,0.015775,0.01405,0.01019,0.00218.
The average group heights after 10 laplacian additions per epsilon according to equations (1-8) are shown in table 3.
Table 3 group heights after laplacian noise addition using different privacy budget values
Figure BDA0002228963020000111
The non-equidistant histogram without noise drawn by the method is shown in fig. 2, for comparison, the advantages of the method are highlighted, and the equidistant histogram generated by the traditional method is drawn as shown in fig. 3; by comparing fig. 2 with fig. 3, it can be found that the conventional equidistant histogram may cause the phenomena of "heavy tailing" and "zero bucket", and in the present embodiment, the number of patients with ages between 80-88, 88-96, 96-99 is negligible compared to the total number of patients, as shown in the histogram, resulting in the occurrence of the interval with the sample frequency of 0, and there are a plurality of such "zero buckets". As can be seen from the graph, the histogram division of FIG. 2 is more reasonable than that of FIG. 3, and the distribution characteristics of the data can be more reflected, in this embodiment, the number of patients between 24-49 ages is large, and the number of people at each age point is different, and it can be seen from FIG. 2 that the number of patients between 38-40 ages is the largest, and the maximum value also appears between 38-40, which is in line with the distribution of the original data, while in FIG. 3, the histogram between 24-48 ages is too gentle, which cannot reflect the variation of the number of patients between 24-48; therefore, the drawing method can better and accurately reflect the data change trend and the distribution characteristics.
Aiming at the problem that the histogram accurately reflecting the data change trend and the distribution characteristics can leak privacy, when epsilon is ln3, the invention draws a non-equidistant noisy histogram as shown in figure 4, and in order to illustrate the usability of the data after noise is added, similarity is used
Figure BDA0002228963020000121
Performing correlation comparison between the histogram h before adding noise and the histogram h' after adding noise, whereinTo add the high average of the set of histograms before the noise,as the mean value, h, of the noise-added histogram groupgIs the g-th group height, h 'of the noise-front histogram'gFor the g-th group height of the histogram after noise addition, g is 1,2, … k.
The similarity between the noisy histogram and the non-noisy histogram generated by using different privacy budgets epsilon for the packets with different group distances in the present embodiment is obviously different, and the similarity d (h, h ') between h and h' under each epsilon value is calculated as shown in table 4.
TABLE 4 similarity between noisy and non-noisy histograms generated by different privacy budgets ε
Figure BDA0002228963020000124
Table 5 group heights after laplacian noise addition using the same privacy budget values
Figure BDA0002228963020000125
Figure BDA0002228963020000131
TABLE 6 similarity between grouped identical privacy budget values for different bin spacings generated histogram h' and noisy histogram
Figure BDA0002228963020000132
In addition, to demonstrate that the present invention guarantees the usability of each group while improving the privacy of each group, experiments added noise to each group using the same privacy budget value, and the results are shown in table 5. The similarity d (h, h ") calculation results for the histogram h ″ generated using the same privacy budget values for the groups of different bin distances and the histogram h without noise are shown in table 6:
as shown in fig. 4 and fig. 5, it can be clearly seen that the difference between the histogram generated by adding noise using different privacy budgets and the histogram generated by adding noise using the same privacy budget on each packet according to the group distance size is not much different from the similarity of the histogram before noise addition, which indicates that the difference between the histogram generated by adding noise using different privacy budgets on each packet not only protects the privacy of the packet data with different sparsities, but also ensures the overall distribution characteristics of the data.
Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention should be determined by the appended claims.

Claims (6)

1. A non-equidistant histogram release method based on differential privacy is characterized by comprising the following steps:
1) selecting an original database table, wherein the original database table at least comprises a row of sensitive attribute rows needing privacy protection, and setting a total privacy budget epsilon of a non-equidistant histogram to be issued;
2) reading N records in an original database table in the form of key value pairs < key, value > to obtain N key value pairs; the key value represents the attribute value of a certain column in the database table, and the value represents the value of a certain sensitive attribute column in the database table;
3) data processing is performed on the N key values: merging key value pairs with the same key value, and accumulating the value values of the key value pairs with the same key value to generate n key value pairs with different key values;
4) sorting the key value pairs of n different key values according to the order of the key values from small to large, and marking as<key(1),value(1)>,<key(2),value(2)>,…,<key(n),value(n)>Therein key(1)<key(2)<…<key(n)
5) Setting the upper bound value of the key value as Max and the lower bound value as Min by taking the key value as the abscissa of the non-equidistant histogram, and dividing the key value into k groups according to the value range [ Min, Max ] of the key value;
6) determining a set of boundary points on an abscissa by adopting an empirical distribution function and a generalized inverse function thereof, and recording left and right boundary points of any one of k groups on the abscissa and corresponding group distances;
7) counting the frequency sum of the key values falling into each abscissa group in sequence, calculating the frequency and group height of each group, and constructing a non-equidistant histogram h;
8) according to the definition of global sensitivity and privacy budget in differential privacy, Laplace noise is added to the group height of any one group in the non-equidistant histogram h, and the non-equidistant histogram h' based on the differential privacy is issued.
2. The differential privacy-based non-equidistant histogram distribution method according to claim 1, wherein the empirical distribution function in step 6) is denoted as Fn(x) Value range of [0,1]Defined as:
Figure FDA0002228963010000011
wherein m is more than or equal to 1 and less than or equal to n-1, j is 0,1, …, n, key(m)、key(m+1)The mth and (m + 1) th sequence records respectively representing key value pair sequencing;
f is to ben(x) Value range of [0,1 ]]Are equally divided into k groups, the group distance of the groups is
Figure FDA0002228963010000021
Any packet zoneIs formed by
Figure FDA0002228963010000022
Figure FDA0002228963010000023
Said empirical distribution function Fn(x) Is generalized inverse function memory
Figure FDA0002228963010000024
Indicates that the condition F is satisfiedn(y) maximum lower bound, key ≧ x(1)≤y≤key(n)
Defining the set of all boundary points on the abscissa of the non-equidistant histogram as Bq
Figure FDA0002228963010000026
For any one grouping on the abscissa bgAnd its left boundary is marked as BgLAnd the right boundary is marked as BgRAnd then:
BgL=Bg-1,BgR=Bg
group bgIs recorded as Δ BgObtaining:
ΔBg=Bg-Bg-1,g=1,2,…,k。
3. the differential privacy based non-equidistant histogram distribution method according to claim 2, wherein in step 8), grouping b is performed on non-equidistant histogram hgGroup height h ofgThe formula for adding laplacian noise is:
Figure FDA0002228963010000027
wherein, h'gGroup height, epsilon, representing the additive laplacian noisegRepresents a packet bgΔ f is the sensitivity.
4. The differential privacy-based non-equidistant histogram publication method of claim 3, wherein said privacy budget εgThe calculation formula of (2) is as follows:
Figure FDA0002228963010000028
5. the differential privacy based non-equidistant histogram publication method of claim 3, wherein said group bgPrivacy budget epsilong≤ε。
6. The differential privacy-based non-equidistant histogram publication method according to claim 1, wherein the value values are 1 and 0, indicating whether the sensitive attribute value satisfies the condition of the statistical query function; when the sensitive attribute value meets the condition of the statistical query function, the value is 1; and when the sensitive attribute value does not accord with the condition of the statistical query function, the value is 0.
CN201910961197.8A 2019-10-11 2019-10-11 Non-equidistant histogram publishing method based on differential privacy Active CN110795758B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910961197.8A CN110795758B (en) 2019-10-11 2019-10-11 Non-equidistant histogram publishing method based on differential privacy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910961197.8A CN110795758B (en) 2019-10-11 2019-10-11 Non-equidistant histogram publishing method based on differential privacy

Publications (2)

Publication Number Publication Date
CN110795758A true CN110795758A (en) 2020-02-14
CN110795758B CN110795758B (en) 2021-07-30

Family

ID=69440286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910961197.8A Active CN110795758B (en) 2019-10-11 2019-10-11 Non-equidistant histogram publishing method based on differential privacy

Country Status (1)

Country Link
CN (1) CN110795758B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111628974A (en) * 2020-05-12 2020-09-04 Oppo广东移动通信有限公司 Differential privacy protection method and device, electronic equipment and storage medium
CN111737744A (en) * 2020-06-22 2020-10-02 安徽工业大学 Data publishing method based on differential privacy
CN112182638A (en) * 2020-08-20 2021-01-05 中国海洋大学 Histogram data publishing method and system based on localized differential privacy model
CN113486402A (en) * 2021-07-27 2021-10-08 平安国际智慧城市科技股份有限公司 Numerical data query method, device, equipment and storage medium
CN113672956A (en) * 2021-08-20 2021-11-19 山东大学 Localized differential privacy protection method and system for numerical distribution calculation
CN113672979A (en) * 2021-08-19 2021-11-19 安徽工业大学 Method and device for issuing differential privacy non-equidistant histogram based on barrel structure division
CN113486402B (en) * 2021-07-27 2024-06-04 平安国际智慧城市科技股份有限公司 Numerical data query method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218837A (en) * 2013-04-22 2013-07-24 北京航空航天大学 Unequal class interval histogram rendering method based on empirical distribution function
CN104809408A (en) * 2015-05-08 2015-07-29 中国科学技术大学 Histogram release method based on difference privacy
CN105046160A (en) * 2015-07-21 2015-11-11 东华大学 Histogram-based data flow-oriented differential privacy publishing method
US20180349384A1 (en) * 2015-11-02 2018-12-06 LeapYear Technologies, Inc. Differentially private database queries involving rank statistics
CN109492047A (en) * 2018-11-22 2019-03-19 河南财经政法大学 A kind of dissemination method of the accurate histogram based on difference privacy

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218837A (en) * 2013-04-22 2013-07-24 北京航空航天大学 Unequal class interval histogram rendering method based on empirical distribution function
CN104809408A (en) * 2015-05-08 2015-07-29 中国科学技术大学 Histogram release method based on difference privacy
CN105046160A (en) * 2015-07-21 2015-11-11 东华大学 Histogram-based data flow-oriented differential privacy publishing method
US20180349384A1 (en) * 2015-11-02 2018-12-06 LeapYear Technologies, Inc. Differentially private database queries involving rank statistics
CN109492047A (en) * 2018-11-22 2019-03-19 河南财经政法大学 A kind of dissemination method of the accurate histogram based on difference privacy

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张浩铭等: "优化结构下的差分隐私直方图发布", 《万方数据库》 *
邵波: "差分隐私直方图发布方法的研究", 《万方数据库》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111628974A (en) * 2020-05-12 2020-09-04 Oppo广东移动通信有限公司 Differential privacy protection method and device, electronic equipment and storage medium
CN111737744A (en) * 2020-06-22 2020-10-02 安徽工业大学 Data publishing method based on differential privacy
CN112182638A (en) * 2020-08-20 2021-01-05 中国海洋大学 Histogram data publishing method and system based on localized differential privacy model
CN112182638B (en) * 2020-08-20 2022-09-09 中国海洋大学 Histogram data publishing method and system based on localized differential privacy model
CN113486402A (en) * 2021-07-27 2021-10-08 平安国际智慧城市科技股份有限公司 Numerical data query method, device, equipment and storage medium
CN113486402B (en) * 2021-07-27 2024-06-04 平安国际智慧城市科技股份有限公司 Numerical data query method, device, equipment and storage medium
CN113672979A (en) * 2021-08-19 2021-11-19 安徽工业大学 Method and device for issuing differential privacy non-equidistant histogram based on barrel structure division
CN113672979B (en) * 2021-08-19 2024-02-09 安徽工业大学 Differential privacy non-equidistant histogram release method and device based on barrel structure division
CN113672956A (en) * 2021-08-20 2021-11-19 山东大学 Localized differential privacy protection method and system for numerical distribution calculation
CN113672956B (en) * 2021-08-20 2023-09-22 山东大学 Localized differential privacy protection method and system for numerical distribution calculation

Also Published As

Publication number Publication date
CN110795758B (en) 2021-07-30

Similar Documents

Publication Publication Date Title
CN110795758B (en) Non-equidistant histogram publishing method based on differential privacy
TWI584137B (en) Search, determine the active area of ​​the method with the server
Baroni 39 Distributions in text
CN103080924B (en) For the method and apparatus processing data set
CN107180093A (en) Information search method and device and ageing inquiry word recognition method and device
CN109558936A (en) Abnormality determination method and program
CN102841946A (en) Commodity data retrieval sequencing and commodity recommendation method and system
CN104598450A (en) Popularity analysis method and system of network public opinion event
CN108897789A (en) A kind of cross-platform social network user personal identification method
CN107368573A (en) Video quality evaluation method and device
CN108132964A (en) A kind of collaborative filtering method to be scored based on user item class
Zhu et al. Operational risk measurement: a loss distribution approach with segmented dependence
CN109727226A (en) A kind of position table automatic generation method based on machine learning
JP5367632B2 (en) Knowledge amount estimation apparatus and program
US7428550B2 (en) Systems and methods for estimating the number of unique attributes in a database
CN106447385A (en) Data processing method and apparatus
Fisher Estimation of Race and Ethnicity by Re-Weighting Tax Data
CN107391533A (en) Generate the method and device of graphic data base Query Result
CN111798406A (en) Picture quality evaluation method and system
CN106611339A (en) Seed user screening method, and method and apparatus for evaluating user influence of product
CN104657388A (en) Data processing method and device
Sukhatme et al. Surface quasigeostrophic turbulence: The study of an active scalar
Breuer et al. Empirical patterns in Google Scholar citation counts
CN110175220B (en) Document similarity measurement method and system based on keyword position structure distribution
US20180011850A1 (en) Temporal-based visualized identification of cohorts of data points produced from weighted distances and density-based grouping

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant