CN108363928B

CN108363928B - Adaptive Differential Privacy Protection Method in Linked Medical Data

Info

Publication number: CN108363928B
Application number: CN201810129671.6A
Authority: CN
Inventors: 李先贤; 罗春枫; 王利娥; 刘鹏; 于东然; 赵华兴; 唐雨薇
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2018-02-08
Filing date: 2018-02-08
Publication date: 2021-08-03
Anticipated expiration: 2038-02-08
Also published as: CN108363928A

Abstract

The invention discloses an adaptive differential privacy protection method in associative medical data. Aiming at the privacy problem caused by the association between attributes, the association between the alignment identifier and the sensitive attribute is analyzed by using the rough set theory, and the proposed The method of partial differential privacy to achieve the protection of aligned identifiers, compared with traditional differential privacy, ensures the utility of data; the privacy leakage problem caused by the correlation between records and the different sensitivity of different diseases lead to To solve the data utility problem, a global differential privacy protection method is proposed to protect sensitive attributes. The invention can effectively improve the security of privacy protection of medical data.

Description

Adaptive differential privacy protection method in associated medical data

Technical Field

The invention relates to the technical field of data privacy protection, in particular to a self-adaptive differential privacy protection method in associated medical data.

Background

The development of electronic medical records is promoted by the progress of scientific and technical information, more and more medical institutions adopt electronic medical record systems to generate massive medical data, and the electronic medical records are analyzed and mined to provide evidences for clinical decision support, clinical path optimization, personalized medical treatment and other applications. Because of the large amount of sensitive information contained in these data, privacy-preserving data mining and privacy-preserving data distribution have received much attention. However, there are complex association characteristics of electronic medical data, including the association between records (such as inheritance, complications, and the like) and between attributes (such as age and disease), making it more difficult to implement privacy protection for the records. Kifer et al first proposed in 2011 that if correlations between records are ignored, differential privacy does not provide sufficient privacy assurance that attackers can use these correlations to improve inferences about attack targets; if there is a correlation between attributes, such as bronchitis, a smoker can easily have bronchitis, we also need to protect this attribute if smoking, because it can increase the probability that an attacker can deduce whether a patient has bronchitis.

In the existing privacy protection research, there are many related researches on the privacy protection of medical data, but the research field considering the privacy protection of the related medical data is blank at present. However, because the relevance of the medical data can increase the success rate of an attacker for deducing that a certain person gets a certain disease, it is very necessary to protect privacy by considering the relevance of the medical data, and there are several main privacy challenges:

(1) due to the characteristics of multi-dimensional attributes, redundant records, high sensitivity and the like of the electronic medical data, the mainstream k-anonymous model designed aiming at the relational data is directly applied, and the problems of high information loss, low data effectiveness and the like exist.

(2) Due to the fact that relevance exists among records, such as genetic relation, complication relation and the like, background knowledge of an attacker is greatly increased, another privacy mainstream model, namely differential privacy, is directly applied, and the expected privacy requirement cannot be met.

(3) Because there is also an association between a quasi-identifier attribute and a sensitive attribute, in addition to privacy protection of the sensitive attribute, privacy protection operations should be performed on the associated quasi-identifier attribute, but how to ensure the utility of data while protecting the quasi-identifier attribute is also a challenge at present.

In the current research of privacy protection methods aiming at the medical industry, in addition to related privacy protection algorithms for genome-associated data, the privacy protection technology for electronic medical data mainly applies K-anonymity, L-diversity, rho-uncertainty, differential privacy and the expansion of the K-anonymity, the L-diversity, the rho-uncertainty and the differential privacy, but does not consider the association between the data. Since the correlations in the electronic medical data are indeed present, if they are ignored, this can greatly increase the background knowledge of the attacker, resulting in privacy disclosure. Recent studies on privacy protection of associated data are mainly: bayes differential privacy provided by Yang et a determines a privacy budget epsilon by defining Bayes differential privacy disclosure; the correlation differential privacy proposed by Zhu et al determines the magnitude of the added noise by calculating the correlation sensitivity. However, due to the characteristics of electronic medical data, these privacy protection methods for processing associated data cannot be directly applied to associated medical data.

Disclosure of Invention

The invention aims to solve the problem that the existing privacy protection method of the associated data cannot be directly applied to the associated medical data, and provides a self-adaptive differential privacy protection method in the associated medical data.

In order to solve the problems, the invention is realized by the following technical scheme:

the self-adaptive differential privacy protection method in the associated medical data comprises the following steps:

step 1, a user puts forward a query request aiming at an original medical data set;

step 2, judging whether the user inquires the quasi-identifier attribute or the inquiry sensitive attribute; when the user inquires about the quasi-identifier attribute, turning to the step 3A, and adopting a local differential privacy protection strategy; when the user inquires that the sensitive attribute is, turning to the step 3B, and adopting a global differential privacy protection strategy;

step 3A, local differential privacy protection strategy:

step 3A-1, performing relevance analysis by using a rough set theory to align the identifiers and the sensitive attributes, and determining whether the attributes of all quasi identifiers in the original medical data set are related to the sensitive attributes;

step 3A-2, classifying the medical records meeting the query request in the original medical data set according to the quasi-identifier attribute queried by the user: when the medical record meeting the query request contains the quasi-identifier attribute related to the sensitive attribute, classifying the medical record meeting the query request into a data subset needing to be protected; otherwise, the medical record meeting the query request is classified into a data subset which does not need to be protected;

3A-3, adding Laplace noise to the medical records in the data subset needing to be protected, returning the medical records to the user, and directly returning the medical records in the data subset needing not to be protected to the user;

step 3B, global differential privacy protection strategy:

step 3B-1, calculating the associated sensitivity of the medical records in the original medical data set meeting the query request

Step 3B-2, determining privacy budget epsilon of each sensitive attribute according to given total privacy budget_i；

Step 3B-3, adding the medical record meeting the query request in the original medical data set

And returning the noise to the user.

Compared with the prior art, the invention carries out different privacy protection strategies aiming at the inquired attribute types: for the quasi-identifier attribute, considering the relevance between the quasi-identifier attribute and the sensitive attribute, analyzing the relevance between the quasi-identifier attribute and the sensitive attribute by using a rough set theory, and carrying out local differential privacy according to the relevance aiming at the identifier attribute, so that the utility of data is improved; for the sensitive attribute, the correlation between the records is considered, and the correlation differential privacy is adopted to calculate the correlation sensitivity of the sensitive attribute

Then different privacy budgets epsilon are allocated according to the sensitive attributes_iAdding Laplace noise in statistical query

Adaptive differential privacy is achieved. The invention can effectively improve the security of privacy protection of medical data.

Drawings

FIG. 1 is a flow chart of a method of adaptive differential privacy protection in correlated medical data.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings in conjunction with specific examples.

The research of the invention is oriented to interactive counting query, namely, a user puts forward a query request to a system, and the system returns a query result to the user after adding noise to an object needing to be protected. Because the interactive query is directed to the original medical data set, the invention adopts different protection strategies aiming at the identifier attribute and the sensitive attribute: aiming at the quasi-identifier attribute, the invention adopts a local differential privacy protection strategy; aiming at the sensitive attribute, the invention adopts a self-adaptive differential privacy protection strategy.

Referring to fig. 1, a method for adaptive differential privacy protection in associated medical data specifically includes the following steps:

step 1: the user makes a query request. In the original medical data set shown in table 1, the name is ID, gender, age, and body temperature are quasi-identifier attributes, and diseases are sensitive attributes.

Name (I)

Sex

Age (age)

Body temperature

Disease and disorder

1

Bob

F

25

Height of

Influenza virus

2

Alice

F

8

Is normal

Cancer treatment

3

Mike

M

35

Is normal

Heart disease

4

Lonia

M

21

Height of

Cancer treatment

5

Jasper

F

13

Is normal

Heart disease

6

Jake

F

41

Is very high

Influenza virus

7

Linda

M

56

Height of

Cancer treatment

8

Helen

F

60

Is normal

Influenza virus

9

David

M

37

Is very high

Heart disease

Table 1 raw table data

Step 2: it is determined whether the user is querying for quasi-identifier attributes or query-sensitive attributes. If the quasi-identifier attribute is inquired, adopting a local differential privacy protection strategy aiming at the quasi-identifier attribute; and if the sensitive attribute is queried, adopting a global differential privacy protection strategy aiming at the sensitive attribute.

(1) Local differential privacy protection policies are employed for quasi-identifier attributes:

and step 3: although the quasi-identifier attribute is a non-sensitive attribute, some quasi-identifier attributes increase the probability of leakage of the sensitive attribute, and the quasi-identifier attribute is considered to be sensitive. For example, if the user Bob has influenza, the attribute of age is not related to the influenza, the attribute of age of the user Bob is non-sensitive, but the user Alice has cancer, the attribute of age is highly related to the cancer, and the attribute of age of the user Alice is sensitive, so that for each disease in the original table data, correlation analysis is performed on the identifier attribute and the sensitive attribute by using rough set theory to obtain which quasi-identifier attributes are related to each disease, for example, how many people are between 20 and 40 years old in the first 3 medical records in table 1, and which attribute of age of the user is sensitive is firstly analyzed according to the disease suffered by each user. Firstly, the 1 st medical record is analyzed, the user Bob of the 1 st medical record has influenza, the attribute of age is determined to be related to the influenza according to the roughness and the theory, and the 2 nd medical record and the 3 rd medical record are the same in the following specific way:

step 3.1: some of the attributes in the raw medical data set are defined according to rough set theory. In rough set theory, the quasi-identifier attribute is called a conditional attribute and the sensitive attribute is called a decision attribute. In the present embodiment, for the disease influenza in the 1 st medical record, the user set in the original table is U ═ e₁,e₂,e₃,…,u₉The condition attribute set is C ═ gender, age, body temperature, the decision attribute is D ═ influenza, let C be C ═ influenza }₁Gender ═ C₂Age, C₃Body temperature. The following steps are all processed around the attribute of the flu to find out which sensitive attributes are related to the flu.

Step 3.2: each attribute is classified. In this example, the gender can be divided into two categories (male and female); the age can be divided into three categories (0-20 years old, 21-40 years old, 41-60 years old); body temperature can be divided into three categories (normal, high, very high); the diseases can be divided into two categories (with influenza, without influenza). To obtain U/C₁＝{{e₃,e₄,e₇,e₉},{e₁,e₂,e₅,e₆,e₈}}，U/C₂＝{{e₂,e₅},{e₁,e₃,e₄,e₉},{e₆,e₇,e₈}}，U/C₃＝{{e₂,e₃,e₅,e₈},{e₁,e₄,e₇},{e₆,e₉}}，U/D＝{{e₁,e₆,e₈},{e₂,e₃,e₄,e₅,e₇,e₉}}。

Step 3.3: and acquiring a knowledge base set. The set of intersections of these three classes is denoted as U/C { { e { ]₁},{e₂},{e₃,e₄,e₉},{e₅},{e₇},{e₆,e₈And the knowledge base is the U/C and the union of the sets in the U/C. Such as: U/C₁In { e₁,e₂,e₅,e₆,e₈Denotes female, U/C₂In { e₂,e₅Denotes a person between the ages of 0-20, then the union of these two sets { e }₂,e₅It means women between 0-20 years old, which is a knowledge.

Step 3.4: delete the age attribute and get another knowledge base. The steps are the same as above.

Step 3.5: it is determined whether an age attribute is associated with influenza. In the original medical data set, the set of records for flu-related disease is { e }₁,e₆,e₈If in both repositories { e }₁,e₆,e₈The upper and lower approximations are all { e }₁,e₆,e₈It indicates that the age attribute is irrelevant to the sensitive attribute influenza, and vice versa.

The upper and lower approximation is the concept of a rough set of edges. No matter how intersection is required in some sets in the knowledge base, the union can not be obtained, so that the concepts of upper approximation and lower approximation are obtained. The lower approximation set is derived from a union in the knowledge base. The upper approximation is obtained by the union of the sets in the knowledge base. Such as: and { e₁,e₆,e₇That no matter how intersection is performed in the knowledge base, and that no intersection can be obtained, the lower approximation of the intersection is obtained by the operation of intersection₆,e₇Get its upper approximation { e } through union operation₁,e₂,e₆,e₇}。

And 4, step 4: according to the quasi-identifier attribute inquired by a user, dividing an original medical data set into two types, wherein one type is a data subset needing to be protected, and medical records in the data subset contain the quasi-identifier attribute related to a sensitive attribute; one type is a subset of data that does not require protection, and which does not contain a quasi-identifier attribute associated with a sensitive attribute.

For example, the first 3 medical records in table 1 are aged 20-40 years, and as can be seen from the above, the age of the 1 st medical record is not sensitive, and the ages of the 2 nd medical record and the 3 rd medical record are sensitive, so that the 1 st medical record is a data subset that does not need protection, and the 2 nd medical record and the 3 rd medical record are data subsets that need protection.

And 5: and (6) local noise adding. And adding Laplace (Laplacian) noise to the query result of the set needing protection, and not adding noise to the set needing no protection.

There are 2 parameters for Laplace: one is a scale parameter, determined from the query sensitivity and the privacy budget, equal to the query sensitivity divided by the privacy budget; one is a location parameter, which is 0 by default. The location parameter defaults to 0, the scale parameter is determined according to the query sensitivity and the privacy budget, and the scale parameter is the query sensitivity divided by the privacy budget. Here, the query sensitivity is 1 and the privacy budget is given.

Step 6: and returning the counting result to the user.

(2) Adopting a global differential privacy protection strategy aiming at sensitive attributes:

and step 3: and acquiring a correlation matrix. Obtaining the evidence of the correlation degree is to analyze the correlation degree between the records. This can be done in various ways by our background knowledge and data characteristics, the most typical one is that we already know this matrix of relevance as background knowledge.

For example: the number of people with heart diseases in the user lookup table is known by background knowledge, and the correlation degree between the people with heart diseases is obtained to obtain a correlation degree matrix delta which belongs to delta.

There are: 1) delta_ij＝δ_jiIndicating that the association between two records is independent of their order; 2) the element on the diagonal is 1, indicating that each record is fully relevant to itself; 3) threshold delta₀To eliminate weak correlation, | δ in Δ_ij|≥δ₀If is | δ_ij|＜δ₀Then, let δ_ij0; 4) only parts are interrelated.

And 4, step 4: and calculating the associated sensitivity.

Step 4.1: the recording sensitivity of each record was calculated.

For a correlation matrix triangle and a query Q, the record sensitivity of the ith record is:

wherein Q (D)^j) Representing a query on a data set D, Q (D)^-j) Indicating that a query is made for a deleted dataset that differs from dataset D by one record j, n being the number of records in dataset D, δ_ijIndicating the degree of association of the ith record with the jth record, δ_ijE.Δ, recording sensitivity represents the current recording r_iThe impact on all records in the dataset. This concept combines the number of relevant recordings and the degree of correlation, and when the data sets D are independent of each other, the recording sensitivity CS_iEqual to the global sensitivity.

Step 4.2: the associated sensitivities of all recordings are calculated.

The associated sensitivity is the maximum of all recording sensitivities. For one query Q, the correlation sensitivity is equal to the maximum recording sensitivity:

where Q is the record set for query Q.

And 5: calculating the privacy budget epsilon allocated for each disease_i。

Although the sensitive attributes need to be protected, if the same noise is added, that is, the same privacy budget is allocated, the sensitive attributes may be over-protected if the sensitive attributes are all the same, and the protection degree for the sensitive attributes is not enough. We make privacy budget allocations based on the distribution of the disease, and for the disease attribute we have reason to believe that the higher the frequency of occurrence of the disease, the lower its sensitivity. Thus, we assume a total privacy budget of ε, a data set size of n, and a frequency of occurrence of each disease of m_iThen the resulting privacy budget for the ith disease is:

in the formula, m_iThe frequency of occurrence of the sensitive attribute in the original medical data set; n is the number of sensitive attributes in the original medical data set; ε is the total privacy budget.

It can be seen here that the higher the frequency of occurrence of the disease, the greater the assigned privacy budget. This is because the higher the frequency of occurrence of diseases, the lower the sensitivity, and according to the definition of differential privacy, the larger the privacy budget, the lower the privacy, and the less noise is added.

Step 6: laplace noise is added to the accurate counting result.

Associated sensitivity calculated according to step 4

And privacy budget ε calculated in step 5_iDetermining the added noise as:

and 7: and returning the counting result to the user.

Such as: we looked up how many people had heart disease in the first 9 medical records and calculated the associated sensitivity as

The privacy budget which is obtained by the calculation of the step 5 is epsilon_iThe result of our count query obtained by step 6 is

As shown in table 2:

TABLE 2 count query

Since the association between the electronic medical data does exist, it is necessary to consider the association when implementing privacy protection. Aiming at the privacy problem caused by the relevance between the attributes, the relevance between the alignment identifier and the sensitive attribute is analyzed by utilizing a rough set theory, and a local differential privacy method is provided to protect the alignment identifier, so that the data effectiveness is ensured compared with the traditional differential privacy; aiming at the problem of privacy disclosure caused by relevance between records and the problem of data utility caused by different sensitivity degrees of different diseases, a self-adaptive differential privacy protection method is provided to realize protection of sensitive attributes.

It should be noted that, although the above-mentioned embodiments of the present invention are illustrative, the present invention is not limited thereto, and thus the present invention is not limited to the above-mentioned embodiments. Other embodiments, which can be made by those skilled in the art in light of the teachings of the present invention, are considered to be within the scope of the present invention without departing from its principles.

Claims

1. An adaptive differential privacy protection method in associated medical data, characterized in that it includes the following steps:

Step 1. The user makes a query request for the original medical data set;

Step 2. Determine whether the user is querying a quasi-identifier attribute or a sensitive attribute; when the user is querying a quasi-identifier attribute, go to step 3A, and adopt a local differential privacy protection strategy; when the user is querying a sensitive attribute, go to step 3A. Go to step 3B, adopt the global differential privacy protection strategy;

Step 3A, local differential privacy protection strategy:

Step 3A-1, using rough set theory to align identifiers and sensitive attributes to perform correlation analysis to determine whether all quasi-identifier attributes in the original medical data set are related to sensitive attributes;

Step 3A-2, according to the quasi-identifier attribute queried by the user, classify the medical records in the original medical data set that meet the query request: when the medical record that satisfies the query request contains the quasi-identifier attribute related to the sensitive attribute , the medical record that satisfies the query request is classified into the data subset that needs to be protected; otherwise, the medical record that meets the query request is classified into the data subset that does not require protection;

Step 3A-3. After adding Laplace noise to the medical records in the data subsets that need to be protected, return to the user, and directly return the medical records in the data subsets that do not need to be protected to the user;

Step 3B, global differential privacy protection policy:

Step 3B-1, calculate the association sensitivity of the medical records in the original medical data set that satisfy the query request

Step 3B-2, determine the privacy budget ε _i of each sensitive attribute according to the given total privacy budget;

Step 3B-3, adding medical records that meet the query request in the original medical data set

After the noise, it is returned to the user.