CN117235800B

CN117235800B - Data query protection method of personalized privacy protection mechanism based on secret specification

Info

Publication number: CN117235800B
Application number: CN202311416556.4A
Authority: CN
Inventors: 胡春强; 陈佳俊; 张今革; 蔡斌; 夏晓峰; 桑军
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2023-10-27
Filing date: 2023-10-27
Publication date: 2024-05-28
Anticipated expiration: 2043-10-27
Also published as: CN117235800A

Abstract

The invention provides a data query protection method of a personalized privacy protection mechanism based on secret specifications, which comprises the following steps: the access device sends a query instruction to the data storage device; the data storage device performs: analyzing the query instruction to obtain a query function and a query attribute; extracting a query attribute dataset; dividing the query attribute dataset into a sensitive subset and a non-sensitive subset based on the set of user secret specifications; obtaining a mean value query result of the query attribute dataset according to a pre-constructed Laplace mechanism based on a secret specification; obtaining a median query result of the query attribute dataset according to a pre-constructed exponential mechanism based on a secret specification; and issuing a mean query result and/or a median query result. The privacy protection scope and the protected entity are accurately defined through the secret specification, strict constraint that all attribute records of data are regarded as sensitive is avoided, less data distortion and more accurate data query results are provided, and balance of privacy protection and data utility is achieved.

Description

Data query protection method of personalized privacy protection mechanism based on secret specification

Technical Field

The invention relates to the technical field of data security, in particular to a data query protection method of a personalized privacy protection mechanism based on secret specifications.

Background

With the advent of the big data age, a large amount of sensitive data including personal identification information, medical records, financial transactions, etc. are stored in databases, and in order to support various research, business decisions and government policy formulation, the demand for statistical queries for these data is rapidly increasing, and statistical queries such as mean and median are important tools for knowing data distribution and trend, and they provide key insight about data for various industries and industries. For example, a researcher may need to calculate the average age of a patient in a medical study, or a financial institution may need to find a median credit score for a customer, or obtain the average age, median annual revenue, etc. from a census dataset. These queries typically involve sensitive information and therefore require effective privacy protection.

Differential privacy (DIFFERENTIAL PRIVACY, DP for short) is becoming the gold standard for privacy protection due to its theoretical provability and robustness to adversaries with priori knowledge, and can protect individual privacy while meeting data analysis requirements, and can be applied to common data analysis operations such as mean and median query, so as to ensure that sensitive data is properly protected. Legal frameworks such as Health Insurance Portability and Accountability Act (HIPAA) privacy rules, family Education Rights and Privacy Act (FERPA) and general data protection act (GDPR) of the european union are directed to ensuring that organizations and individuals adhere to transparent, fair and secure principles in collecting, processing and sharing personal information. In addition, recent privacy regulations in california in the united states, including california consumer privacy laws and california privacy laws, also strengthen individuals' control over their personal information and prescribe transparency of the organization in collecting and using data. A common goal of these privacy laws and standards is to protect individual privacy, give individuals rights to how their data is used, and to manage and limit their ability to share and process data. From the perspective of legislation and policy, users have the right to control their own privacy and exhibit personalized privacy requirements. The personalized concept is rooted in a unique cultural background, personal privacy preference or social factors of each person, and reflects the difference of privacy expectations of different users.

However, when control of the sensitivity of an individual to its own data is involved, traditional differential privacy tends to be too strict, which generally treats all of the data in the data set that is relevant to the individual as inherently sensitive, whereas in practice, not all of the information that is relevant to the individual is treated as sensitive and requires the same level of protection due to differences in individual privacy preferences and attitudes. Consider a scenario like an intelligent building management system that processes a large amount of sensor data and personal information, including user location details and health indicators. Notably, individuals may have different perspectives on the sensitivity of these attributes. Some users may consider their location information sensitive, while health data is not. In contrast, others may treat health data as sensitive data, while treating location information as non-sensitive information. Furthermore, some consider both of these properties to be either sensitive or insensitive. Therefore, the users of the traditional differential privacy data protection method cannot independently define own secret specifications and cannot accurately protect the data protection range, and query result data obtained during data query is large in data distortion and low in utility.

The census data set usually contains information such as age, gender, annual income, telephone number, health status and the like, but the information such as annual income, health status, age and the like relates to personal privacy of users, different privacy setting requirements are needed for different users, for example, for part of individuals, the specific annual income of the individuals may not be wanted to be disclosed, and part of users may not want to disclose the age or health status of the individuals, so as to prevent information leakage from being used by criminals or invalid promotion.

Disclosure of Invention

The invention aims to solve the technical problems existing in the prior art and provides a data query protection method and a census data set query protection method of a personalized privacy protection mechanism based on secret specifications.

To achieve the above object of the present invention, according to a first aspect of the present invention, there is provided a data query protection method of a personalized privacy protection mechanism based on a secret specification, comprising: the access device sends a query instruction to the data storage device; the data storage device performs: receiving and analyzing a query instruction to obtain a query function and a query attribute; extracting a query attribute dataset from the set dataset; acquiring a user secret specification set, and dividing a query attribute data set into a sensitive subset and a non-sensitive subset based on the user secret specification set; when the query function is a mean query function, a mean query result of the query attribute dataset is obtained according to a pre-constructed Laplacian mechanism based on a secret specification; when the query function is a median query function, a median query result of the query attribute dataset is obtained according to a pre-constructed exponential mechanism based on a secret specification; and issuing a mean query result and/or a median query result of the query attribute dataset to the access device.

The technical scheme is as follows: allowing a user individual to set which attribute records are sensitive and which attribute records are not sensitive based on a secret specification, helping to accurately define the scope of privacy protection and protected entities, avoiding that the traditional differential privacy method treats all attribute records of data as sensitive strict constraints, enabling privacy protection to be more flexible and personalized, and simultaneously providing less data distortion and more accurate data query results; the laplace mechanism SSLM based on the secret specification is proposed and applied to the mean value query, the exponential mechanism SSEM based on the secret specification is proposed and applied to the median value query, the accuracy of data analysis is improved, and meanwhile, the data distortion is reduced to the greatest extent, particularly when a large part of data is insensitive, the balance of privacy protection and data utility is achieved, compared with the most advanced differential privacy framework mechanism, SSLM utility is improved by about 14 times for the mean value query by using insensitive data, and SSEM is improved by about 6 times for the median value query by using insensitive data.

To achieve the above object of the present invention, according to a second aspect of the present invention, there is provided a census data set query protection method, including: the access device sends a query instruction to the data storage device which stores the census data set; the data storage device performs: receiving and analyzing a query instruction to obtain a query function and a query attribute, wherein the query attribute comprises age and annual income; extracting a query attribute dataset from the census dataset; acquiring a user secret specification set, and dividing a query attribute data set into a sensitive subset and a non-sensitive subset based on the user secret specification set; when the query function is a mean query function, a mean query result of the query attribute dataset is obtained according to a pre-constructed Laplacian mechanism based on a secret specification; when the query function is a median query function, a median query result of the query attribute dataset is obtained according to a pre-constructed exponential mechanism based on a secret specification; and issuing a mean query result and/or a median query result of the query attribute dataset to the access device.

The technical scheme is as follows: allowing a user individual to set which attribute records of the population census data set are sensitive and which attribute records are not sensitive based on a secret specification, helping to accurately define the privacy protection range and protected entities, avoiding the traditional differential privacy method from treating all attribute records of the population census data set as sensitive strict constraints, enabling privacy protection to be more flexible and personalized, and simultaneously providing less data distortion and more accurate data query results; a Laplace mechanism (SSLM) based on a secret specification is provided and applied to average value query of a population census data set, an index mechanism (SSEM) based on the secret specification is provided and applied to median value query of the population census data set, data query accuracy is improved, data distortion is reduced to the greatest extent, balance of privacy protection and data utility is achieved particularly when a large part of data is insensitive, and compared with the most advanced differential privacy framework mechanism, SSLM utility is improved by about 14 times for average value query and SSEM is improved by about 6 times for median value query by using insensitive data.

Drawings

FIG. 1 is a flow chart of a method for protecting data query of a personalized privacy protection mechanism based on secret specifications in a preferred embodiment of the invention;

FIG. 2 is a first example of the present invention calculating a median score function value;

FIG. 3 is a second example of the present invention calculating a median score function value;

FIG. 4 is a flow chart of a method for protecting a human mouth screening dataset query in another preferred embodiment of the present invention;

FIG. 5 is a diagram showing the variation of RMSE at average query results as the scale of non-sensitive attributes changes SSLM in accordance with another preferred embodiment of the present invention;

FIG. 6 is a diagram showing the variation of the RMSE of SSLM in the mean value query result as the degree of privacy protection varies in another preferred embodiment of the present invention;

FIG. 7 is a graph showing the variation of RMSE for SSEM in median query results as the scale of non-sensitive attributes is varied in accordance with another preferred embodiment of the present invention;

fig. 8 is a diagram showing RMSE variation of SSEM in median query results as the degree of privacy protection varies in another preferred embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

In the description of the present invention, it should be understood that the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the present invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the present invention.

In the description of the present invention, unless otherwise specified and defined, it should be noted that the terms "mounted," "connected," and "coupled" are to be construed broadly, and may be, for example, mechanical or electrical, or may be in communication with each other between two elements, directly or indirectly through intermediaries, as would be understood by those skilled in the art, in view of the specific meaning of the terms described above.

From the perspective of legislation and policy, users have the right to control their own privacy and exhibit personalized privacy requirements. The personalized concept is rooted in a unique cultural background, personal privacy preference or social factors of each person, and reflects the difference of privacy expectations of different users. The existing research on individuation of differential privacy is mainly reflected in individuation of privacy budget in a Differential Privacy (DP) framework, namely, an individual designates privacy protection intensity of own data, the individual cannot set which data are sensitive to be protected and which data are not sensitive to be protected according to privacy requirements, when the control of the sensitivity of the individual to the own data is involved, the traditional differential privacy is often too strict, and the traditional differential privacy usually regards all data related to the individual in a data set as inherent sensitivity, so that the problems of large data distortion, increased calculation cost, lower accuracy when inquiring access data for later analysis and the like are caused. Based on this, the present application seamlessly integrates the secret specification into the differential privacy framework, introducing differential privacy (SSDP) based on the secret specification.

The relevant definitions and mechanisms of the differential privacy based on Secret Specifications (SSDP) proposed by the present application are explained and illustrated below.

The set data set is the data set to be protected, the set data set is denoted as D ₀＝(r₁,…,r_i…,r_n), n denotes the number of users, the user index i e 1, n, r _i denotes the record of the user u _i, Representing the user set, r _i represents a k-dimensional variable,/>K represents the attribute dimension, r _i from the domain/> Attribute a _j record representing user u _i, attribute index j e 1, k; /(I)An attribute record is represented, which can be set as a sensitive record or a insensitive record according to the secret specification setting of the user, and the sensitive record, the insensitive record and the sexual record are variables and can be different record values. The set data set may be a census data set.

Definition one (secret specification): the secret specification of user u _i is formally defined as a binary function S _i：r_i→{0,1}^k, in whichIs a single k-dimensional attribute record associated with user u _i. The function S _i determines the sensitivity classification for each attribute in the record r _i, such as A _j (1. Ltoreq.j.ltoreq.k). If the secret specification function value S _i(A_j) =1 of the attribute j, the attribute a _j is considered to be a sensitive record, and if the secret specification function value S _i(A_j) =0 of the attribute j, the attribute a _j is considered to be a non-sensitive record.

By definition one, each user-specified secret specification divides its record (containing the k attribute) into two sub-records. Specifically, function S _i divides the attribute value in record r _i into two sub-records: sensitive sub-recordsAnd non-sensitive sub-records/>In particular, when k=1, i.e. one record is associated with only one attribute, the user' S secret canonical set S divides the set data set D ₀ into two subsets, namely the sensitive data subset D _s and the non-sensitive data subset D _ns.

Definition of two%-Adjacent recordings): let/>Representation and user set/>A set of related secret specifications. Recording of user u _i And a record/>Is/>Adjacent, satisfy for any j, when

Definition of three%-Adjacent data sets): data set D and data set D' are/>Adjacent, if and only if one record in D is different from D', and the different record is/>Adjacent ones.

Definition of four%Sensitivity): for any pair/>The global sensitivity of the query function f is denoted Δf _s and is measured in terms of the L ₁ norm for the neighboring data sets D and D':

The application starts from privacy rights of users, realizes individuation of personal privacy requirements by allowing the users to independently specify the sensitivity of self-set records, and proposes differential privacy (Secret Specification-based DIFFERENTIAL PRIVACY, SSDP) based on secret specifications.

Definition five (differential privacy SSDP based on secret specification): in secret specificationIn the background of (a), a random algorithm M satisfies/>If for any/>-Any subset o of adjacent datasets D and D' and Range (M), algorithm/>The method meets the following conditions:

Range (M) represents a random algorithm Output space of/>Expressed in a random algorithm/>Effect on data set D to obtain results/>Probability of/>Indicating that the result/>, obtained when the random algorithm M is applied to the data set D', is obtainedIs a probability of (2). E represents a preset degree of privacy protection, E > 0.

The main objective of the differential privacy SSDP based on the secret specification proposed by the present application is to protect sensitive records in a set data set, which is significantly characterized in that the sensitivity of the recorded attributes is determined by the user himself, independently of the recorded values of these attributes. Meaning that changing the value of a sensitive record does not affect its sensitivity, thus creating a symmetric neighborhood relationship. As set forth in definition five, for sensitive properties of a record, differential privacy SSDP based on secret specifications ensures the same level of privacy protection as traditional differential privacy DP, thus protecting against powerful attacks.

The application provides a Laplace mechanism (Secret Specification-based LAPLACE MECHANISM, SSLM) based on a secret specification as a basic mechanism for realizing a differential privacy SSDP scheme based on the secret specification, and the mechanism is applied to average value inquiry of a database or a set data set.

The laplace mechanism SSLM based on the secret specification has the following theorem:

theorem one (laplace mechanism SSLM based on secret specification): given a function f: D→R and secret Specification of users Expressed as/>SSLM satisfy/>Wherein/>Represents f/>Sensitivity, R represents the value range of the function f, and R represents the real number range when the function f is a median query function or a mean query function. Lap (-) represents lapalce a distributed probability density function.

Theorem two (laplace mechanism SSLM based on secret specification): SSLM when the query function f is noted as the mean function, i.eSatisfy/>

The application also provides an exponential mechanism (Secret Specification-based Exponential Mechanism, SSEM) based on the secret specification as a basic mechanism for realizing the SSDP mechanism, and the exponential mechanism is applied to the median query of the database and the set data set.

Consider a query function f: D→O, where the query function may be a median query function, and the real-valued score function for the output space O of the function f may be expressed asWherein D represents a dataset satisfying f (D ^*) =o formed by changing the record values of any number of sensitive records in dataset D; /(I)Representing a set of distinct sensitive records in D and D; /(I)Representing collections/>The base number of (1), namely the number of the set sensitive records, is also the number of the sensitive records changing the record value in the process of transforming from the data set D to the D; the real value scoring function s (D, o) has a negative value, and the meaning of the calculation formula is expressed in/>The least number of sensitive records (s (D, o) is the largest) is/areThe number of the sensitive records is inverted to obtain a score function value; /(I)Is the global sensitivity of the scoring function s. For basic statistical functions such as median (median) query functions, Δs=1.

Definition six (exponential mechanism EM): let O represent a random algorithmThe set of all possible outputs, i.e. output space. For the scoring function/>If/>Probability and/>, of generating an output O in OProportional thenMeets the E-DP. Wherein D and D' are/>-Adjacent data sets. /(I)Representing the real number domain.

According to secret specificationsThe present application modifies the scoring function s (D, o) of the original exponential mechanism and is expressed in another way, namely/>Wherein r represents/>Sensitive records in association with one-dimensional attributes, and/> Representing that in the process of obtaining D ^* from data set D, data set D ^* is obtained when the minimum number of sensitive records is changed to f (D ^*) =o, and the number of sensitive records changed at this time is multiplied by-e to be the maximum/>Values.

Definition seven (exponential mechanism based on secret specification SSEM): given a function f: D-O and user's secret Specification setSSEM/>With probability/>Output O, z, o.epsilon.O,/>Representing the median query result of the output o according to the probability. exp (·) represents the exponential probability distribution density function.

Privacy analysis:

Theorem two (laplace mechanism SSLM based on secret specification): SSLM when the query function f is noted as the mean function, i.e Satisfy/>Meets the privacy requirement.

Theorem three: SSEM, when the query function f is noted as a median function, i.eSatisfy/>SSDP, complying with privacy requirements.

The application provides a data query protection method of a personalized privacy protection mechanism based on secret specifications, as shown in fig. 1, in a preferred embodiment, the method comprises the following steps:

In step S101, the access device sends a query instruction to the data storage device. The access device is preferably but not limited to a mobile terminal or a PC or a notebook. The data storage device is preferably, but not limited to, a data server or cloud server. The access device and the data storage device communicate via an internet connection.

The data storage device performs:

Step S102, receiving and analyzing the query instruction to obtain a query function and a query attribute. The query instruction contains a query function and a query attribute, and because the record of each user in the set data set comprises a multidimensional attribute, if average age or annual income is required to be obtained from the census data set, the age attribute or annual income attribute data set needs to be processed.

Step S103, extracting a query attribute dataset from the set dataset, including:

The data set is set to be D ₀＝(r₁,…,r_i…,r_n), n denotes the number of users, r _i denotes the record of user u _i, user index i.epsilon.1, n, K represents the attribute dimension, r _i from the domain/> Attribute a _j record representing user u _i, attribute index j e1, k; if the query attribute is attribute A _j, then query attribute dataset

Step S104, a user secret specification set is obtained, and the query attribute data set is divided into a sensitive subset and a non-sensitive subset based on the user secret specification set. The user secret specification set comprises binary functions of all attributes of the user, after query attributes are resolved, the secret specification set of the query attributes is extracted from the user secret specification set, and the query attribute data set is divided into a sensitive subset and a non-sensitive subset based on the extracted secret specification set. Specifically, step S104 includes:

Step S1041, user secret Specification set The secret specification of the user u _i is defined as a binary function: s _i：r_i→{0,1}^k, if user u _i defines that attribute a _j is recorded as a sensitive record, the secret specification of user u _i is at the value S _i(A_j) =1 of attribute a _j, and if user u _i defines that attribute a _j is recorded as an insensitive record, the secret specification of user u _i is at the value S _i(A_j) =0 of attribute a _j.

In step S1042, set attribute a _j as query attribute, execute: if the secret specification of user u _i is at the value S _i(A_j) =1 of attribute a _j, attribute a _j of user u _i is recordedIs classified into a sensitive subset D _s, and if the secret specification of the user u _i is 0 in the value S _i(A_j) =0 of the attribute a _j, the attribute a _j of the user u _i is recorded/>The recorded values of (a) fall into the non-sensitive subset D _ns.

The query attribute data set is divided into the sensitive subset and the non-sensitive subset through the step S104, the sensitive subset is protected in the mean query or the median query, the non-sensitive subset is not protected, the distortion of the query data can be reduced while the sensitive data is protected, and the utility of the query data is improved.

Step S105, when the query function is a mean query function, a mean query result of a query attribute dataset is obtained according to a pre-constructed Laplace mechanism based on a secret specification;

When the query function is a median query function, a median query result of the query attribute dataset is obtained according to a pre-constructed exponential mechanism based on the secret specification.

Step S106, issuing a mean query result and/or a median query result of the query attribute dataset to the access device.

In this embodiment, preferably, in step S105, when the query function is a mean query function, the obtaining the mean query result of the query attribute dataset according to the pre-constructed laplace mechanism based on the secret specification includes:

Step A1, calculating the mean f _mean (D) of the query attribute dataset from the sensitive subset D _s and the non-sensitive subset D _ns:

Wherein, I.I represents the cardinality of the data set, namely the number of the data in the data set; f _mean (·) represents the mean query function, f _mean(D_s) represents the mean of the sensitive subset D _s, f _mean(D_ns) represents the mean of the non-sensitive subset D _ns.

Step A2, constructing a query attribute datasetAdjacent data set D', in particular, constructed according to definition three above-Adjacent dataset D', global sensitivity/>, of the acquisition mean query function

Wherein f _mean (D') represents/>-The mean of the neighboring dataset D', ₁ representing the L ₁ norm.

Step A3, adding noise meeting Laplace distribution in the mean value f _mean (D) of the query attribute data setObtaining the average value query result/>

Wherein Lap (·) represents lapalce a distribution probability density function, ε represents a preset degree of privacy protection, and ε > 0.

It can be seen that the average query result can be improved by using the non-sensitive subset D _ns The utility of the method reduces data distortion to the maximum extent, and simultaneously privacy protection is carried out on the mean value of the sensitive subset D _s through satisfying the noise of Laplace distribution, so that personalized privacy protection is realized, and the method is convenient and effective for data analysis.

In a simplified application scenario of this embodiment, the specific procedure of mean value query is as follows:

step 101: input data set D, user's secret specification set Privacy budget e, mean query function f _mean.

Step 102: for the mean query, consider the case where each record is associated with only one attribute, i.e., k=1, where the sensitivity of the user record is consistent with the sensitivity of the record attribute. The user' S secret specification set S divides the dataset d= (r ₁,…,r_n) into two subsets, namely a sensitive data subset D _s and a non-sensitive data subset D _ns. Based on this we initialize sensitive subsets of the datasetAnd non-sensitive subset/>

Step 103: user-based secret specification setWe can get the sensitive subset D _s and the non-sensitive subset D _ns. Without loss of generality, assume D _s＝(r₁,r₂,…r_m),D_ns＝(r_m+1,r₂,…r_n) and each record is associated with a numerical attribute. Based on the obtained D _s and D _ns, we can derive the mean query result for dataset D asWhere || represents the cardinality of the data set.

Step 104: let data sets d=d _s∪D_ns and D '=d' _s∪D′_ns beAdjacent we can get |d _s|＝|D′_s |, and D _ns＝D′_ns. Firstly, calculating and obtaining the/> -of the mean value query functionSensitivity, since non-sensitive records in dataset D are not present/>-Neighbor recordings, thus/>It is worth emphasizing that in comparison to the global sensitivity Δf of the mean query,/>Then, to the mean query result f (D _s) calculated based on the sensitive subset D _s, a set of Laplace distribution/>, satisfyingNoise of (i.e./>)

Step 105: returning noise mean query results

In this embodiment, preferably, in step S105, when the query function is a median query function, the median query result of the query attribute dataset is obtained according to a pre-constructed exponential mechanism based on the secret specification, including:

step B1, for query attribute records in query attribute dataset D (i.e. ) The recorded values of (a) are ordered from small to large, i.e. the ordered query attribute dataset D satisfies r _i≤r_i+1.

Step B2, setting the base |d|=2m+1 of the ordered query attribute dataset D, a first intermediate parameterLet O denote the median output space of the query attribute dataset D. The median query function returns the record values in the query dataset D as median query results, and therefore, r _i e O, i.e., the record values belong to the median space O.

Constructing a query attribute datasetAdjacent dataset D', in particular, built up/>, according to definition three aboveNeighboring dataset D ', since it is assumed that datasets d=d _s∪D_ns and D ' =d ' _s∪D′_ns are/>Adjacent, we can get |d _s|＝|D′_s |, and D _ns＝D′_ns.D′_s represents/>-A sensitive subset of the neighboring dataset D ', D' _ns representing/>A non-sensitive subset of the neighboring data set D'. It has to be emphasized that according to the secret specification S, non-sensitive records in the dataset D do not exist/>Adjacent recordings, thus disregarding changes of non-sensitive recordings.

Step B3, for any output median O, O E O, calculating the score function value of the median O according to the following formula

Wherein D represents a data set satisfying f _med(D^*) =o formed by changing the record values of any number of sensitive records in the query attribute data set D, and in this process, the insensitive record values are not changed; /(I)Representing a set of distinct sensitive records in D and D; r represents a constituent belonging to/>Sensitive recording of (c); Expressed at/> When the number of the sensitive records is minimum, namely when the number of the sensitive records which are changed by D is minimum, the number of the sensitive records is/isThe product obtained by multiplying the sensitive record number of the sensitive records by the number epsilon is a score function value of a median o, f _med (DEG) represents a median query function, epsilon represents a preset privacy protection degree, and epsilon is more than 0.

Further preferably, the score function value of the median o is obtained rapidlyFor a pair ofReasoning demonstrates that the score function value can be obtained quickly in the following way, for any record r _i epsilon D and r _i epsilon O, calculate/>There are three cases:

(1) If i.ltoreq.m, meaning that the m-i sensitive records to the right of record r _i are changed, then

(2) If i=m, meaning that no sensitive record is changed, then

(3) If i.gtoreq.m, meaning that the i-m sensitive records to the left of record r _i are changed, then

Step B4, calculating the output probability of all the medians in the median output space O, and setting the output probability of the medians OThe method comprises the following steps:

step B5, taking the median value output according to the output probabilities of all median values in the median value output space O as a median value query result

In a simplified application scenario of this embodiment, the median query specific process is as follows:

step 101: input data set D, user's secret specification set Privacy budget e, mean query function f _med.

Step 102: for a median query, consider the case where each record is associated with only one attribute, i.e., k=1, where the sensitivity of the user record and the sensitivity of the record attribute agree. The user' S secret specification set S divides the dataset d= (r ₁,…,r_n) into two subsets, namely a sensitive data subset D _s and a non-sensitive data subset D _ns. Based on this we initialize sensitive subsets of the datasetAnd non-sensitive subset/>

Step 103: user-based secret specification setWe can get the sensitive subset D _s and the non-sensitive subset D _ns. Without loss of generality, we assume |d|=2m+1, and r _i≤r_i+1, where |·| represents the cardinality of the dataset. The median function f _med returns the record in dataset D that is ranked m, so we can obtain the true median value for dataset D as f _med(D)＝r_m.

Step 104: since the data sets d=d _s∪D_ns and D '=d' _s∪D′_ns are assumed to beAdjacent we can get |d _s|＝|D′_s |, and D _ns＝D′_ns. It has to be emphasized that according to the secret specification S, non-sensitive records in the dataset D do not exist/>Adjacent recordings, so we do not consider the change of non-sensitive recordings.

For any record r _i ε D, calculateThere are three cases:

(2) If i=m, meaning that no record is changed, then

Step 105: returning noise median query results

For the convenience of detailed description, the score function value of the median o in the process of obtaining the median query resultTwo examples of calculations are shown.

Example 1：D＝(1,2,3,4,5,6,7),D_s＝(1,3,5,6),D_ns＝(2,4,7),∈＝1 D＝(1,2,3,4,5,6,7),O＝(1,2,3,4,5,6,7), fig. 2 illustrates the score function values for each of the possible output median values, as well as the probability for each median value.

Example 2: d= (1, 2,3,4,5,6, 7), D _s＝(1,3,5),D_ns = (2, 4,6, 7), e= 1,O = (1, 2,3,4,5,6, 7), fig. 3 shows the score function value for each median value that may be output, and the probability for each median value.

The tables in FIGS. 2 and 3 illustrateSpecific examples of the calculation process. The two tables differ in that the secret specifications for the query data set D are different, resulting in different proportions of sensitive records (sensitive attribute records) in the query data set D. As is evident from the results in the tables of fig. 2 and 3, when the proportion of sensitive records is relatively reduced (or less than 50%), part of the records in D cannot be considered as median output. This phenomenon is illustrated in FIG. 3, wherein/>This will increase the output probability of other recorded values (possibly median values) and thus enhance the data utility.

The invention also discloses a method for protecting the query of the census data set, and in a preferred embodiment, as shown in fig. 4, the method comprises the following steps:

In step S201, the access device sends a query instruction to a data storage device storing a population census data set. The access device is preferably but not limited to a mobile terminal or a PC or a notebook. The data storage device is preferably, but not limited to, a data server or cloud server for government or public security systems. The access device and the data storage device communicate via an internet connection.

The data storage device performs:

Step S202, receiving and analyzing a query instruction to obtain a query function and a query attribute, wherein the query attribute comprises age and annual income; the query attributes may also include health status, academy, identification numbers, etc.

Step S203, extracting a query attribute dataset from the census dataset, comprising:

Step S204, a user secret specification set is obtained, and the query attribute data set is divided into a sensitive subset and a non-sensitive subset based on the user secret specification set. The user secret specification set comprises binary functions of all attributes of the user, after query attributes are resolved, the secret specification set of the query attributes is extracted from the user secret specification set, and the query attribute data set is divided into a sensitive subset and a non-sensitive subset based on the extracted secret specification set. Specifically, step S204 includes:

Step S2041, user secret Specification set The secret specification of the user u _i is defined as a binary function: s _i：r_i→{0,1}^k, if user u _i defines that attribute a _j is recorded as a sensitive record, the secret specification of user u _i is at the value S _i(A_j) =1 of attribute a _j, and if user u _i defines that attribute a _j is recorded as an insensitive record, the secret specification of user u _i is at the value S _i(A_j) =0 of attribute a _j.

Step S2042, set attribute a _j as the query attribute, is performed for all users: if the secret specification of user u _i is at the value S _i(A_j) =1 of attribute a _j, attribute a _j of user u _i is recordedIs classified into a sensitive subset D _s, and if the secret specification of the user u _i is 0 in the value S _i(A_j) =0 of the attribute a _j, the attribute a _j of the user u _i is recorded/>The recorded values of (a) fall into the non-sensitive subset D _ns.

The query attribute data set is divided into the sensitive subset and the non-sensitive subset by step S204, the sensitive subset is protected in the mean query or the median query, the non-sensitive subset is not protected, the distortion of the query data can be reduced while the sensitive data is protected, and the utility of the query data is improved.

Step S205, when the query function is a mean query function, a mean query result of the query attribute dataset is obtained according to a pre-constructed Laplace mechanism based on a secret specification;

When the query function is a median query function, a median query result of the query attribute dataset is obtained according to a pre-constructed exponential mechanism based on a secret specification;

step S206, issuing a mean query result and/or a median query result of the query attribute dataset to the access device.

In this embodiment, preferably, in step S205, when the query function is a mean query function, a mean query result of the query attribute dataset is obtained according to a pre-constructed laplace mechanism based on a secret specification, including:

Step C1, calculating the mean f _mean (D) of the query attribute dataset from the sensitive subset D _s and the non-sensitive subset D _ns:

Where |·| represents the cardinality of the data set, f _mean (·) represents the mean query function, f _mean(D_s) represents the mean of the sensitive subset D _s, f _mean(D_ns) represents the mean of the non-sensitive subset D _ns.

Step C2, constructing a query attribute datasetAdjacent data set D', in particular, constructed according to definition three above-Adjacent dataset D', global sensitivity/>, of the acquisition mean query function

Step C3, adding noise meeting Laplace distribution in the mean value f _mean (D) of the query attribute data setObtaining final average query result/>

It can be seen that the average query result can be improved by using the non-sensitive subset D _ns The utility of the method reduces data distortion to the maximum extent, and simultaneously privacy protection is carried out on the mean value of the sensitive subset D _s through satisfying the noise of Laplace distribution, so that personalized privacy protection is realized, and the method is convenient and effective for data analysis. /(I)

In this embodiment, preferably, in step S205, when the query function is a median query function, the median query result of the query attribute dataset is obtained according to a pre-constructed exponential mechanism based on the secret specification, including:

step D1, for query attribute records in query attribute dataset D (i.e. ) The recorded values of (a) are ordered from small to large, i.e. the ordered query attribute dataset D satisfies r _i≤r_i+1.

Step D2, setting the base |d|=2m+1 of the ordered query attribute dataset D, a first intermediate parameterLet O denote the median output space of the query attribute dataset D. The median query function returns the record values in the query dataset D as median query results, and therefore, r _i e O, i.e., the record values belong to the median space O.

Constructing a query attribute datasetAdjacent dataset D', in particular, built up/>, according to definition three aboveNeighboring dataset D ', since it is assumed that datasets d=d _s∪D_ns and D ' =d ' _s∪D′_ns are/>Adjacent, we can get |d _s|＝|D′_s |, and D _ns＝D′_ns.D′_s represents/>-A sensitive subset of the neighboring dataset D ', D' _ns representing/>-A non-sensitive subset of the neighboring data set D'. It has to be emphasized that according to the secret specification S, non-sensitive records in the dataset D do not exist/>Adjacent recordings, thus disregarding changes of non-sensitive recordings.

Step D3, for any output median O, O E O, calculating the score function value of the median O according to the following formula

(2) If i=m, meaning that no sensitive record is changed, then

Step D4, calculating the output probability of all the medians in the median output space O, and setting the output probability of the medians OThe method comprises the following steps: /(I)

Step D5, taking the median value output according to the output probabilities of all median values in the median value output space O as a median value query result

The utility experiment analysis is carried out on the population census data set query protection method provided by the application.

Experimental background: experimental verification was performed on 2012 u.s.u.s.census dataset. Specifically, 1000 and 10000 records were randomly selected from 2012 u.s.census dataset, respectively, and evaluation of mean query was performed with record attribute Age. Furthermore, 1001 and 10001 records were randomly selected from 2012 u.s.census dataset, respectively, and evaluation of median queries was made with record attributes Annual Income (annual revenue). Where the parameter δ represents the duty cycle of the non-sensitive records in the dataset (default δ=0.8), and the parameter e represents the degree of privacy protection.

The data utility of SSLM was evaluated using root mean square error (Root Mean Square Error, RMSE) as an evaluation index. Fig. 5 and fig. 6 show the corresponding variation trend of RMSE of SSLM (mean query scenario) proposed by the present invention along with the variation of the parameters δ and e on the 2012 U.S. census data subsets of different scales, wherein the data size of the sub-graph (a) in fig. 5 and the sub-graph (a) in fig. 6 is 1000, and the data size of the sub-graph (b) in fig. 5 and the sub-graph (b) in fig. 6 is 10000. Fig. 7 and 8 show the corresponding variation trend of RMSE of SSEM (median query scenario) proposed in the present invention with variation of parameters δ and e on different scales of 2012 U.S. census data subsets, wherein the data size of sub-graph (a) in fig. 7 and sub-graph (a) in fig. 8 is 1001, and the data size of sub-graph (b) in fig. 7 and sub-graph (b) in fig. 8 is 10001, respectively.

Fig. 5 shows how the proportion δ of non-sensitive attribute values affects the data utility of SSLM we propose. It can be seen from fig. 5 (a) and 5 (b) that the RMSE of the sslm over the data sets of different scales has a consistent decreasing trend as the parameter δ increases from 0 to 0.9. In other words, the RMSE of SSLM is gradually decreasing and is increasingly better than the baseline approach (classical LAPLACE MECHANISM, LM). In particular, SSLM, which is equivalent to LM, exhibits the same RMSE as LM when δ=0, i.e., all attribute values in the dataset are sensitive. Conversely, when δ=1, meaning that all attribute values are insensitive, RMSE of SSLM will decrease to 0. Furthermore, as shown in fig. 5 (a) and 5 (b), the utility of SSLM on the 2012 U.S. census dataset subset for data volumes 1000 and 10000 was increased by about 6-fold and 5-fold, respectively, when δ = 0.8, as compared to LM. The utility of SSLM on the 2012 U.S. census dataset subset at data volumes 1000 and 10000 was increased by approximately 14-fold and 10-fold, respectively, when δ = 0.9, as compared to LM. The utility is significantly improved because in SSLM, the non-sensitive attribute values remain unchanged in response to the average query, thereby improving overall accuracy of the average.

The parameter e represents the privacy budget of the user associated with the sensitive recording attribute. A higher e value means a lower level of privacy protection, thereby improving utility. As shown in fig. 6 (a) and 6 (b), both lm and SSLM (default delta=0.8) decrease RMSE over different scale-size datasets as e increases from 0.01 to 0.5. The main point conveyed by FIG. 6 is that SSLM is significantly better than the baseline mechanism LM under any E conditions. Furthermore, SSLM improves utility over LM by a factor of about 2 on a 2012 U.S. census data subset for a data size of 1000 and 3 times at a data size of 10000 when e > 0.2.

Similar to the effect of the parameter delta on the average query volume, as shown in fig. 7, the data utility of SSEM significantly exceeded the baseline method EM as delta increased from 0 to 0.9. Considering the boundary case, we can see that SSEM is equivalent to EM when δ=0. In addition, FIG. 7 (a) and FIG. 7 (b) show that when the parameter delta E (0, 0.5) is equal to the RMSE of SSEM, however, when the parameter delta E (0.5,0.9) is equal to the EM, SSEM exhibits a significant improvement in data utility over the EM.

In FIG. 8, the effect of the parameter ε on the median query is similar to that shown in FIG. 5. As expected, both SSEM and EM reduced RMSE on different scale data sets with increasing E. Notably, in fig. 8 (a) and 8 (b), SSEM shows a significant utility improvement compared to EM. Specifically, when e > 0.2, as shown in fig. 8 (a), SSEM achieves about a 2-fold improvement in utility over EM over a 2012 U.S. census data subset with a data size of 1001.

The data query protection method provided by the application has the following technical effects:

1. We have introduced a new privacy definition SSDP that enables individuals to better control their private information, ensuring that only data marked as sensitive by users are privacy protected.

2. By allowing individuals to independently define secret specifications about their own data, SSDP achieves personalized privacy protection, facilitating efficient data analysis.

3. We provide a specific SSDP mechanism for mean value queries, improving the accuracy of data analysis while minimizing data distortion, especially when a significant portion of the data is non-sensitive while better exploring privacy and utility trade-offs.

4. We evaluate SSLM and SSEM performance by comparative experiments on real datasets. Experimental results indicate that SSLM improves utility by approximately a factor of 14 for mean queries by using non-sensitive data compared to the most advanced DP mechanism. SSEM improves utility by approximately 6-fold for median queries by using non-sensitive data.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. The data query protection method of the personalized privacy protection mechanism based on the secret specification is characterized by comprising the following steps:

The access device sends a query instruction to the data storage device;

The data storage device performs:

receiving and analyzing a query instruction to obtain a query function and a query attribute;

extracting a query attribute dataset from the set dataset, comprising:

The data set is set to be D ₀＝(r₁,…,r_i…,r_n), n denotes the number of users, r _i denotes the record of user u _i, user index i.epsilon.1, n, K represents the attribute dimension, r _i from the domain/> Attribute a _j record representing user u _i, attribute index j e 1, k;

If the query attribute is attribute A _j, then query attribute dataset

Obtaining a set of user secret specifications, dividing the query attribute data set into a sensitive subset and a non-sensitive subset based on the set of user secret specifications, comprising:

User secret specification set The secret specification of the user u _i is defined as a binary function: s _i：r_i→{0,1}^k, if user u _i defines attribute a _j to record as a sensitive record, the secret specification of user u _i is at value S _i(A_j) =1 of attribute a _j, and if user u _i defines attribute a _j to record as a insensitive record, the secret specification of user u _i is at value S _i(A_j) =0 of attribute a _j;

Let attribute a _j be the query attribute, execute for all users: if the secret specification of user u _i is at the value S _i(A_j) =1 of attribute a _j, attribute a _j of user u _i is recorded Is classified into a sensitive subset D _s, and if the secret specification of the user u _i is 0 in the value S _i(A_j) =0 of the attribute a _j, the attribute a _j of the user u _i is recorded/>The recorded values of (a) fall into a non-sensitive subset D _ns;

when the query function is a mean query function, obtaining a mean query result of the query attribute dataset according to a pre-constructed laplace mechanism based on a secret specification, including:

Where |·| represents the cardinality of the data set, f _mean (·) represents the mean query function, f _mean(D_s) represents the mean of the sensitive subset D _s, f _mean(D_ns) represents the mean of the non-sensitive subset D _ns;

step A2, constructing a query attribute dataset -Neighboring data set D', obtaining global sensitivity of the mean query function

Wherein f _mean (D') represents/>-The mean of the neighboring dataset D', ₁ representing the L ₁ norm;

step A3, adding noise meeting Laplace distribution in the mean value f _mean (D) of the query attribute data set Obtaining the average value query result/>

Wherein Lap (·) represents lapalce a distribution probability density function, ε represents a preset privacy protection degree, and ε is greater than 0;

When the query function is a median query function, obtaining a median query result of the query attribute dataset according to a pre-constructed exponential mechanism based on a secret specification, including:

step B1, sorting the record values of the query attribute records in the query attribute data set D from small to large;

step B2, setting the base |d|=2m+1 of the ordered query attribute dataset D, a first intermediate parameter Let O denote the median output space of the query attribute dataset D;

Step B3, for any output median O E O, calculating the score function value of the median O according to the following formula

Wherein D represents a dataset satisfying f _med(D^*) =o formed by changing the record values of any number of sensitive records in the query attribute dataset D; /(I)Representing a set of different sensitive records in D and D ^*; r represents a constituent belonging to/>Sensitive recording of (c); /(I)Expressed at/>When the number of medium sensitive records is minimum,/>The product obtained by multiplying the sensitive record number of the sensitive records by epsilon is a score function value of a median o, f _med (DEG) represents a median query function, epsilon represents a preset privacy protection degree, and epsilon is more than 0;

Step B4, calculating the output probability of all the medians in the median output space O, and setting the output probability of the medians O The method comprises the following steps:

And issuing a mean query result and/or a median query result of the query attribute dataset to the access device.

2. A method for protecting a census data set query, comprising:

The access device sends a query instruction to the data storage device which stores the census data set;

The data storage device performs:

Receiving and analyzing a query instruction to obtain a query function and a query attribute, wherein the query attribute comprises age and annual income;

extracting a query attribute dataset from a population census dataset, comprising:

The census dataset is noted as a set dataset, the set dataset is denoted D ₀＝(r₁,…,r_i…,r_n), n denotes the number of users, r _i denotes a record of user u _i, user index i e1, n, K represents the attribute dimension, r _i from the domain/> Attribute a _j record representing user u _i, attribute index j e 1, k;

If the query attribute is attribute A _j, then query attribute dataset

User secret specification set The secret specification of user u _i, which includes all users for which the census dataset corresponds, is defined as a binary function: s _i：r_i→{0,1}^k, if user u _i defines attribute a _j to record as a sensitive record, the secret specification of user u _i is at value S _i(A_j) =1 of attribute a _j, and if user u _i defines attribute a _j to record as a insensitive record, the secret specification of user u _i is at value S _i(A_j) =0 of attribute a _j;

Step C2, constructing a query attribute dataset -Neighboring data set D', obtaining global sensitivity of the mean query function

Step C3, adding noise meeting Laplace distribution in the mean value f _mean (D) of the query attribute data set Obtaining final average query result/>

step D1, sorting the record values of the query attribute records in the query attribute data set D from small to large;

step D2, setting the base |d|=2m+1 of the ordered query attribute dataset D, a first intermediate parameter Let O denote the median output space of the query attribute dataset D;

Step D3, for any output median O E O, calculating the score function value of the median O according to the following formula

Wherein D represents a dataset satisfying f _med(D^*) =o formed by changing the record values of any number of sensitive records in the query attribute dataset D; /(I)Representing a set of distinct sensitive records in D and D; r represents a constituent belonging to/>Sensitive recording of (c); /(I)Expressed at/>When the number of medium sensitive records is minimum,/>The product obtained by multiplying the sensitive record number of the sensitive records by epsilon is a score function value of a median o, f _med ()'s represent a median query function, epsilon represents a preset privacy protection degree, and epsilon is more than 0;

step D4, calculating the output probability of all the medians in the median output space O, and setting the output probability of the medians O The method comprises the following steps: