CN117521117A - Medical data application security and privacy protection method and system - Google Patents

Medical data application security and privacy protection method and system Download PDF

Info

Publication number
CN117521117A
CN117521117A CN202410013983.6A CN202410013983A CN117521117A CN 117521117 A CN117521117 A CN 117521117A CN 202410013983 A CN202410013983 A CN 202410013983A CN 117521117 A CN117521117 A CN 117521117A
Authority
CN
China
Prior art keywords
data
data set
differential privacy
privacy protection
trust level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410013983.6A
Other languages
Chinese (zh)
Inventor
鲜湛
贺昕
曾柏霖
张海滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wanhaisi Digital Medical Co ltd
Original Assignee
Shenzhen Wanhaisi Digital Medical Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Wanhaisi Digital Medical Co ltd filed Critical Shenzhen Wanhaisi Digital Medical Co ltd
Priority to CN202410013983.6A priority Critical patent/CN117521117A/en
Publication of CN117521117A publication Critical patent/CN117521117A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/70Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer
    • G06F21/78Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure storage of data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a medical data application security and privacy protection method and system, wherein the method comprises the following steps: acquiring original medical data and performing desensitization treatment; defining trust level and corresponding differential privacy budget parameters; and constructing a corresponding issuable data set subjected to differential privacy protection for each trust level and issuing. Constructing the issuable dataset includes: determining differential privacy budget parameters corresponding to the current trust level; normalization processing is carried out on the medical data subjected to desensitization processing, so that a sample data set is obtained; classifying the sample data set according to the characteristic attribute based on the decision tree, and determining the sub data set and the weight corresponding to each characteristic attribute classification; based on the differential privacy budget parameters corresponding to the current trust level, privacy budgets are distributed to each sub-data set based on the weight, noise with corresponding influence level is added, and the data set after differential privacy protection is obtained; and determining the data set which is subjected to differential privacy protection and passes through the evaluation as a issuable data set.

Description

Medical data application security and privacy protection method and system
Technical Field
The present disclosure relates to the field of data protection technologies, and in particular, to a method and system for protecting security and privacy of medical data application.
Background
The development of medical technology and Internet technology has increased the scale of informatization of medical data, and various huge medical data sets have become important resources. In order to prevent the disclosure of platform enterprises and user privacy information, the privacy protection of diagnosis and treatment data is very necessary, and the method plays an important role in the acquisition, analysis and development of medical big data.
And reducing errors caused by noise by adopting methods such as mathematical regression analysis, data distortion adjustment, noise scale parameter adjustment and the like so as to improve the usability of data. Privacy protection technology based on anonymity technology is used in a closed scene with smaller participants, and has larger privacy exposure risk in a large data sharing situation, and meanwhile, the schemes do not consider that when querying users with different authorities and reputation grades perform related queries on data with different sensitivities, the same query result can cause disclosure of privacy information. When a novel attack with strong background knowledge occurs, the privacy protection model based on anonymity needs to be continuously adjusted and perfected to avoid the attack, and the effectiveness of privacy protection and the like are difficult to prove. Encryption techniques (e.g., homomorphic encryption, secret sharing, etc.) commonly used in privacy protection based on encryption techniques only protect the confidentiality of the data and typically result in significant computational overhead and energy consumption.
In comparison, the differential privacy technique fully considers a powerful attacker model, and provides a strict and provable privacy protection performance guarantee. The method can quantify privacy, has lower calculation cost and transmission cost, does not depend on background knowledge assumption, does not have larger influence on the usability of the data set, has lower data processing cost, and well solves the problems of anonymization privacy protection algorithm. And has more flexible combination characteristics. Most algorithms in the existing differential privacy protection algorithms facing classification algorithms are developed based on ID3 and C4.5 decision tree algorithms, in order to reduce added noise as much as possible, the methods of applying an exponential mechanism, designing a scoring strategy to adaptively allocate privacy budget, processing continuous data in a subdivision mode and the like are adopted, the performance of the algorithms is improved to a certain extent, but the algorithms also have the problems of insufficient use efficiency of the privacy budget, large noise interference of leaf nodes and the like.
Disclosure of Invention
In order to at least overcome the problems that the use efficiency of privacy budget is insufficient and leaf nodes are greatly interfered by noise in the medical data protection by utilizing a differential privacy technology in the related technology to a certain extent, the application provides a medical data application safety and privacy protection method and system.
The scheme of the application is as follows:
according to a first aspect of embodiments of the present application, there is provided a medical data application security and privacy protection method, including:
acquiring original medical data, and performing desensitization treatment on sensitive information in the original medical data;
defining a trust level and a differential privacy budget parameter corresponding to the trust level;
constructing a publishable data set which corresponds to each trust level and is subjected to differential privacy protection;
issuing a issuable data set corresponding to each trust level and a corresponding differential privacy budget parameter;
the method for constructing the publishable data set after differential privacy protection corresponding to each trust level comprises the following steps:
determining differential privacy budget parameters corresponding to the current trust level;
normalization processing is carried out on the medical data subjected to desensitization processing, so that a sample data set is obtained;
classifying the sample data set according to the characteristic attribute based on a decision tree, and determining a sub data set and a weight corresponding to each characteristic attribute classification;
based on the differential privacy budget parameters corresponding to the current trust level, privacy budgets are distributed to each sub-data set based on the weight, noise with corresponding influence level is added, and the data set after differential privacy protection is obtained; the greater the weight of the sub-data set, the smaller the corresponding noise impact level;
and evaluating the data set subjected to differential privacy protection, and determining the data set as the issuable data set after the evaluation is passed.
Preferably, the method further comprises:
when a data query request is received, confirming the trust level corresponding to the visitor account;
searching and matching are carried out on the corresponding issueable data sets according to the trust level corresponding to the visitor account, and if matching is successful, the matched issueable data sets are provided for the visitor account.
Preferably, defining the trust level and the differential privacy budget parameter corresponding to the trust level includes:
the higher the defined trust level, the larger the corresponding differential privacy budget parameter.
Preferably, the normalization processing is performed on the medical data after the desensitization processing, including:
performing zero centering treatment on the medical data subjected to desensitization treatment;
compressing the medical data after zero centralization treatment to a specified range; the specified range is related to sensitivity in differential privacy protection, and the specified range is related to differential privacy algorithm classification accuracy assessment test model.
Preferably, evaluating the data set after differential privacy protection includes:
the sample data set and the data set subjected to differential privacy protection are used as training data sets to be input into a deep neural network training DNN (Deep Neural Network ) classifier, and two DNN classification models are obtained;
determining whether the use accuracy of the data set subjected to differential privacy protection meets a data availability threshold value or not by comparing classification results of the two DNN classification models;
and evaluating whether the classification accuracy of the data set subjected to differential privacy protection meets a classification accuracy threshold or not through a differential privacy algorithm classification accuracy evaluation test model.
Preferably, the method further comprises:
and when the evaluation fails, iteratively optimizing the distribution of the privacy budget parameters among the sub-data sets, and re-evaluating until the evaluation passes.
Preferably, classifying the sample data set according to the characteristic attribute based on a decision tree, and determining a sub data set and a weight corresponding to each characteristic attribute classification, including:
constructing a CART (Classification And Regression Tree, classification regression tree) decision tree as a classifier, selecting continuous feature splitting points by using an exponential mechanism in the tree construction process, selecting splitting features by using a Gini (genii) exponent, and calling the exponential mechanism only once in the iteration process.
Preferably, the sensitivity in differential privacy protection is the maximum difference in query results obtained for two data sets that differ for only one record.
Preferably, if the match fails, a notification message of the query failure is returned to the visitor account.
According to a second aspect of embodiments of the present application, there is provided a medical data application security and privacy protection system comprising:
a processor and a memory;
the processor is connected with the memory through a communication bus:
the processor is used for calling and executing the program stored in the memory;
the memory is used for storing a program at least for executing a medical data application security and privacy protection method as described in any one of the above.
The technical scheme that this application provided can include following beneficial effect: a medical data application security and privacy protection method comprising: acquiring original medical data, and performing desensitization treatment on sensitive information in the original medical data; defining a trust level and a differential privacy budget parameter corresponding to the trust level; constructing a publishable data set which corresponds to each trust level and is subjected to differential privacy protection; and publishing the corresponding issuable data set and the corresponding differential privacy budget parameters of each trust level. The method for constructing the publishable data set after differential privacy protection corresponding to each trust level comprises the following steps: determining differential privacy budget parameters corresponding to the current trust level; normalization processing is carried out on the medical data subjected to desensitization processing, so that a sample data set is obtained; classifying the sample data set according to the characteristic attribute based on the decision tree, and determining the sub data set and the weight corresponding to each characteristic attribute classification; based on the differential privacy budget parameters corresponding to the current trust level, privacy budgets are distributed to each sub-data set based on the weight, noise with corresponding influence level is added, and the data set after differential privacy protection is obtained; the greater the weight of the sub-data set, the smaller the corresponding noise impact level; and evaluating the data set subjected to differential privacy protection, and determining the data set as the issuable data set after the evaluation is passed. In the method, trust levels are classified, different differential privacy budget parameters are allocated for different trust levels, and the privacy and the data availability of the issuable data sets corresponding to the different trust levels are different. And each sub-data set of the sample data in each trust level is distributed with privacy budget based on the weight, the larger the weight of the sub-data set is, the smaller the noise added to the sub-data set is, so that the negative influence of differential privacy on the data availability is further reduced, the data availability and the privacy can reach a better balance, the use efficiency of the privacy budget is more sufficient, and the leaf nodes are less interfered by the noise.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 is a flow chart of a method for medical data application security and privacy protection provided in one embodiment of the present application;
FIG. 2 is a schematic flow chart of constructing a differential privacy-protected issuable data set corresponding to each trust level in a medical data application security and privacy protection method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a medical data application security and privacy protection system according to an embodiment of the present application.
Reference numerals: a processor-21; and a memory 22.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
Fig. 1 is a schematic flow chart of a method for protecting security and privacy of medical data application according to an embodiment of the present application, and referring to fig. 1, a method for protecting security and privacy of medical data application includes:
s11: acquiring original medical data, and performing desensitization treatment on sensitive information in the original medical data;
usually, the data set needs to be released through desensitization, and common desensitization methods such as deleting identifier attributes such as name and ID can protect personal privacy to some extent, but the data set is easy to break down when the data set is not large. It should be noted that, the dataset also has other attributes, such as gender, nationality, age, etc., which are called quasi-identifier attributes, and an attacker can use these quasi-identifier attributes to perform link attack. In this embodiment, the k-anonymization algorithm is used to desensitize sensitive information in the original medical data, so as to prevent an attacker from performing link attack by using the quasi-identifier attribute.
S12: defining a trust level and a differential privacy budget parameter corresponding to the trust level;
the higher the defined trust level, the larger the corresponding differential privacy budget parameter.
Defining hierarchical access rights and trust levels for different visitors in the platform; and respectively setting initial privacy budget parameters to achieve differential privacy protection effects.
For example: three levels of privacy budgets are defined from high to low according to the type of platform participating usersThe method comprises the steps of carrying out a first treatment on the surface of the The higher the level, the higher the confidence. As can be seen from the nature of the differential privacy noise mechanism, it is harder for an attacker to obtain detailed recorded information as the privacy budget parameters are smaller. The use of a smaller privacy budget allows the purpose of differential privacy protection to be met, however the availability of data is reduced therewith, possibly even resulting in published data having no value. The selection of privacy budget parameters is the key point to ensure the effectiveness of differential privacy algorithms. Thus, the higher the defined trust level, the larger the corresponding differential privacy budget parameter.
S13: constructing a publishable data set which corresponds to each trust level and is subjected to differential privacy protection;
s14: issuing a issuable data set corresponding to each trust level and a corresponding differential privacy budget parameter;
wherein, constructing the publishable data set after differential privacy protection corresponding to each trust level, referring to fig. 2, includes:
s131: determining differential privacy budget parameters corresponding to the current trust level;
in this embodiment, the issuable data sets after differential privacy protection corresponding to each trust level need to be constructed one by one, and before constructing, the corresponding data sets of the current trust level need to be determinedDifferential privacy budget parameters, such as privacy budget parameters corresponding to the current trust level 1, are
S132: normalization processing is carried out on the medical data subjected to desensitization processing, so that a sample data set is obtained; comprising the following steps:
performing zero centering treatment on the medical data subjected to desensitization treatment;
compressing the medical data after zero centralization treatment to a specified range; the specified range is related to sensitivity in differential privacy protection, and the specified range is related to differential privacy algorithm classification accuracy assessment test model.
Aiming at the characteristics of non-centralized medical data distribution and overlarge data distribution variance, combining a z-score method and a min-max method, firstly carrying out zero-centering treatment on medical data, and then compressing the data to a specified range r by utilizing the min-max method, wherein a variable r is related to sensitivity in differential privacy protection, the purpose of limiting the sensitivity is achieved by limiting r, and further the influence of the differential privacy protection on the usability of the data is controlled; the size of r is also related to the differential privacy algorithm classification accuracy evaluation test model, and r corresponding to the optimal test accuracy can be found through the evaluation model training.
Before normalization processing, the local sensitivities are different; by normalization, data are distributed more tightly, and the values of global sensitivity and all local sensitivities are unified to r, so that the data quality is improved, and meanwhile, the calculation amount of differential privacy protection application is greatly reduced. For two adjacent data sets with only one record being different, the sensitivity is only enabled when the different records are respectively maximum and minimumMaximum and the value r, sensitivity +.>
S133: classifying the sample data set according to the characteristic attribute based on the decision tree, and determining the sub data set and the weight corresponding to each characteristic attribute classification;
s134: based on the differential privacy budget parameters corresponding to the current trust level, privacy budgets are distributed to each sub-data set based on the weight, noise with corresponding influence level is added, and the data set after differential privacy protection is obtained; the greater the weight of the sub-data set, the smaller the corresponding noise impact level;
common access attacks are link attacks, skew attacks, background knowledge attacks, and the like. However, if excessive noise is applied to an attribute that affects classification in a particular way, the usability of the data may be suddenly reduced, which may be reflected by the accuracy of the machine-learned test. Thus, if these attributes, which have a great influence on the classification result, can be identified and relatively gentle noise is applied to these attributes, the usability of the data will be ensured. Important features can be extracted by both feature selection and feature extraction.
In the feature extraction method, most of the existing methods can perform data conversion and dimension reduction operations, and the processed data can be used as training data of a machine learning model for classification training, but lose important functions such as statistical analysis and data analysis. There is therefore a need for methods that circumvent such data conversion and dimension reduction.
In this embodiment, attribute classifications requiring noise disturbance increase are screened by training a CART decision tree, and weights required to be allocated to each attribute classification are calculated. By constructing the CART decision tree and observing the corresponding attribute of each node, the important attributes can be clearly analyzed, and then the specific weight value corresponding to each attribute can be obtained by initializing the decision tree. In order to realize that different levels of noise are added to different attributes by differential privacy protection, each attribute weight is combined with the differential privacy protection, and differential variance noise adding is performed, so that relatively gentle noise is applied to important attributes to reduce the negative influence of the differential privacy protection on the usability of data. The method is simple and quick.
In the embodiment, a CART decision tree is constructed as a classifier, a continuous feature splitting point is selected by using an exponential mechanism in the tree construction process, splitting features are selected by using Gini exponents, and the exponential mechanism is only called once in the iteration process, so that the whole algorithm ensures the full utilization of privacy budget.
The CART decision tree uses Gini index to select partitioning attributes, and classification and regression tasks are available.
Gini index refers to the probability that two samples are randomly drawn from a dataset with inconsistent class labels, the smaller the Gini index value, the higher the purity of the dataset.
The process of initializing the weights of the attributes is as follows: firstly, using a sample data set as training data of a CART decision tree, outputting a trained final tree, calculating the depth of the tree, giving a value d, resetting the attribute weight of a root node to d-1, setting the weights of all node attributes on a second layer to d-2, then decrementing the weights layer by layer, calculating the total weight value of each attribute, and giving the value of each weight.
Specifically, the detailed procedure of adding differential privacy noise data to a sample data set is as follows:
algorithm: build DP CART Tree (D (i),,/>,d)
input: medical health dataset D (i) = { (x) after desensitization to issue 1 ,y 1 ),(x 2 ,y 2 ),…(x n ,y n )},;/>The method comprises the steps of carrying out a first treatment on the surface of the Tree depth d; privacy budget->
And (3) outputting: data set D' after adding noise and CART classifier model
1)REPEAT
2)
3) From a set of attributes using a random subspace algorithm and in accordance with the Gini index minimization principle
Selecting attribute A for division;
4) If the node reaches the termination condition;
then let it be a leaf node to which the instance number is added
Laplace noise
The noise amount is
Classifying the current leaf node:
class count plus noise
The noise amount is
5) If the node does not reach the termination condition, the current split node is partitioned and the amount of added noise is
If subspaceContaining n consecutive properties, then specify +.>The method comprises the steps of carrying out a first treatment on the surface of the With the following probabilitySelecting each successive attribute split point:
wherein the method comprises the steps ofAs a utility function->Global sensitivity of the utility function, +.>Is the size of the interval set.
6) Dividing the data set D (i) into two subsets D l (i) And D r (i)
7) Recursively calling the build_dp_cart_tree algorithm respectively, establishing left and right subtrees,
t l =Build_DP_CART_Tree (D l (i),,/>,d+1);
t r =Build_DP_CART_Tree (D r (i),,/>,d+1);
8) The Until node reaches a termination condition, namely all record labels are consistent, and the maximum depth is reached; privacy budget exhaustion;
9) Model parameters are saved, and a data set added with noise is returned and output.
S135: and evaluating the data set subjected to differential privacy protection, and determining the data set as the issuable data set after the evaluation is passed.
And when the evaluation fails, iteratively optimizing the distribution of the privacy budget parameters among the sub-data sets, and re-evaluating until the evaluation passes.
Evaluating the data set after differential privacy protection, comprising:
the sample data set and the data set subjected to differential privacy protection are used as training data sets to be input into a deep neural network to train a DNN classifier, so that two DNN classification models are obtained;
determining whether the use accuracy of the data set subjected to differential privacy protection meets a data availability threshold value or not by comparing classification results of the two DNN classification models;
and evaluating whether the classification accuracy of the data set subjected to differential privacy protection meets a classification accuracy threshold or not through a differential privacy algorithm classification accuracy evaluation test model.
In the embodiment, the availability of the data set subjected to differential privacy protection is evaluated by using the DNN classification model, so that controllable sensitive data privacy and availability protection are realized, malicious attackers can be effectively prevented from acquiring user privacy information through information inquiry, and meanwhile, the effective utilization rate of release data can be greatly improved on the premise of reducing privacy disclosure.
It should be noted that the method further includes:
when a data query request is received, confirming the trust level corresponding to the visitor account;
searching and matching are carried out on the corresponding issueable data sets according to the trust level corresponding to the visitor account, and if matching is successful, the matched issueable data sets are provided for the visitor account.
If the matching is unsuccessful, a notification message of failed inquiry is returned to the visitor account.
It can be understood that the technical scheme in the embodiment simultaneously meets the data security requirements of two types of scenes of classified release and online inquiry of sensitive data.
It can be understood that all the operations of querying, data mining and the like of the querier with different trust levels are performed on the issuable data sets corresponding to the respective trust levels, so that the privacy information in the original data is ensured not to be revealed.
It should be noted that, the continuous development of information technology and medical health informatization makes medical data emerge on a large scale, provides conditions for deeper applications such as data analysis, data mining, intelligent diagnosis, etc., has a huge medical data set and involves a great deal of user privacy, and how to protect patient privacy while using medical data is very challenging. Privacy protection technology applied to the medical field at present mainly takes anonymization technology as a main principle, but when an attacker has strong background knowledge, the method cannot consider the privacy and the usability of a data set. The system applies differential privacy protection technology, solves the problems of data security and privacy disclosure during platform enterprises and personal health medical data release, statistical data release and interactive query in a remote medical platform, introduces differential privacy in a specific data application process, protects sensitive information and enables release data to exert the maximum usability of the release data.
The remote medical platform has the advantages that the operation capability is improved, the service form is continuously expanded, more data safety related problems are generated, the platform integrates multi-level medical service resources to form novel health medical big data of staff of a platform resident enterprise, and the complexity of data management control, analysis and release is increased. The problem of platform data safety and privacy protection is an important link that intelligent medical treatment and intelligent health development cannot be ignored. The generation and use of massive data and the analysis of big data pose a serious threat to the privacy of user sensitive data. Leakage of user sensitive data can pose a serious threat to the information security of the data owner. Thus, the platform needs to satisfy the data that protects the privacy of user data and provides as high availability as possible during the data query publishing and sharing process.
Medical data is peculiar in that the medical data set typically contains many patient privacy information, such as the business to which the user belongs, location information of the visit, medical diagnosis results, prescription information, inspection reports, etc. The medical data set is huge, complex and redundant due to the diversity of the detection indexes and the frequent detection times. Furthermore, the correlation of partial signs of a person correlates changes in the sign data, for example, as the blood pressure of a person increases, the heart rate increases in general. These characteristics all conform to the characteristics of the set value data. The set value data is composed of an identification and a data combination, and the unique identification of the patient in the medical data corresponds to a group of clinical sign data modes. The data quantity in the medical large data set can be effectively reduced by converting a plurality of pieces of record data into a form of set value data for storage, and meanwhile, the data with more close relation can be interfered to the same extent. By converting the medical data into a set value data set and performing data normalization processing, the variance of the data is reduced, so that the medical health data is distributed more compactly, and the data quality is improved. Meanwhile, the differential privacy and normalization are combined, so that normalization is a process of multiple adjustment, and the differential privacy protection is added to the normalized data, and the data availability change is observed, so that the optimal normalization range is found, the sensitivity of the differential privacy protection is reduced, and the negative influence of the differential privacy protection on the data availability is reduced.
The technical scheme in this embodiment is different from the traditional differential privacy protection, in this technical scheme, trust levels are firstly classified, and different differential privacy budget parameters are allocated for different trust levels, so that the privacy and the data availability of the issuable data sets corresponding to different trust levels are different. And each sub-data set of the sample data in each trust level is distributed with privacy budget based on the weight, the larger the weight of the sub-data set is, the smaller the noise added to the sub-data set is, so that the negative influence of differential privacy on the data availability is further reduced, the data availability and the privacy can reach a better balance, and the use efficiency of the privacy budget is more sufficient.
The differential privacy-related concept is explained:
1) Differential privacy definition
D is a data set sampled from a finite field Z, the data size is n, the dimension size of the attribute is D, and f is: → R d Representing the data set D to R as a query function d Is mapped once; f= { F 1 ,f 2 … is a set of query functions; algorithm K is a random disturbanceThe algorithm is used for disturbing the result of the query operation F to ensure that the output result meets the requirement of differential privacy protection, and the process is called a differential privacy protection mechanism.
Differential privacy protection is independent of the background knowledge of an attacker, has a strict statistical model, and can provide quantifiable privacy guarantee. The data with sensitive attribute can reach the distortion effect by randomly adding noise, but certain data and the data attribute thereof can be kept unchanged, and any data can not influence the output result by adding or deleting certain data to the data set at will, so that the privacy can be protected.
Assuming that there are two data sets D and D ', the symmetry difference of both is the set dΔd',
. If the data amount of the set dΔd ' is 1, i.e., |dΔd ' |=1, then D and D ' are referred to as neighboring data sets. For example, the sets d= {1,2,3,4,5} and D '= {1,2,3,4}, D Δd' = 1, D and D 'are adjacent data sets according to the definition D Δd' = {5 }.
2) Privacy preserving budgets
P for privacy preserving budget r Representing a probabilistic operation. For two data sets d and d Such two data sets are said to be in a contiguous relationship if at most one record is not identical. For any random function->And two adjacent data sets, if desired the data sets can meet +.>Differential privacy protection +.>The requirements are satisfied:
the random function K maps the determined data result to an uncertain probability interval through random operation, so that the aim of protecting privacy data is fulfilled. Furthermore, K is budgeted by privacyAdjusting the probability variation range of the output result to ensure that the maximum probability difference of outputting the same result when one data set D has one data variation>
3) Sensitivity to
The key parameter to add noise level is determined by the sensitivity of the function. Sensitivity refers to the maximum difference in query results obtained for two data sets that differ in terms of one record only. Differential privacy protection is achieved by adding appropriate noise, the magnitude of which depends on sensitivity, i.e. the maximum query difference for two different sets of recorded data.
Global sensitivity. For any meetingIs given by the query function +.>The sensitivity of the function f is:
where R represents the mapped real space, d represents the query dimension of f,is f (D) 1 ) And f (D) 2 ) 1-order norm distance therebetween.
4) Noise mechanism
Essential condition for differential privacy implementationIs the noise regime, the most common being laplace noise and gaussian noise. For the privacy protection of large published data, adoptingDifferential privacy protection and laplace noise mechanisms. In the Laplace noise mechanism, the random function K is determined by the pair function +.>The real value of (2) adds noise to achieve the purpose of protecting privacy, and the noise adding formula is as follows:
in the above-mentioned formula(s),is subject to scale parameter +.>The standard deviation of the Laplace distribution of (2) isMean value is 0, & gt>Representing the sensitivity of the function f
According to the probability density function distribution of Laplace,is 0, the variance is +.>When b is smaller, the noise added data will be closer to the center position parameter +.>When b is relatively large, passingThe noise added data is equivalent to a uniform distribution.
Adding to the function f (D)Noise, result set f (D) +>Fall at->The probability of (2) is +.>Wherein->Represented are probability density center position parameters, L changing the variable value of the position parameters, L can also be given specific values.
Setting position parameters=0, l=0.5, the probability of an attacker to succeed in querying the result dataset can be expressed as follows:
privacy budget thereforeThe upper limit of the selection of (2) is only required to satisfy the following formula:
the output result after adding random perturbations is generally approximately correct for numeric data, while exponential mechanisms are employed to add random perturbations for enumeration or boolean data. The exponential mechanism uses a quality scoring function to score the quality of all possible output results, with the higher the score the greater the likelihood of output results. The index mechanism outputs the result in a probability form, and details of the query result are not displayed, so that the privacy of the data is protected. Meanwhile, the quality scoring function only amplifies the probability of a result which is possibly output, so that a real result is obtained.
An exponential mechanism: let d output a physical object under the random function KF is a predetermined function, ">Representing the sensitivity of the quality scoring function f (d, r). If the random function K is proportional to +.>R is selected from the output result set R and output, then the algorithm K satisfies +.>Differential privacy protection.
Wherein r represents any output, d and d Two data sets with at most one different record. As can be seen from the exponential mechanism definition, the scoring values of the candidates are exponentially multiplied, so that the candidates with high scores are output with higher probability.
(5) Metrics for differential privacy preserving algorithms
The availability of data and the degree of privacy protection should be regarded as the main evaluation criteria of the differential privacy protection algorithm, and the criteria for specifically measuring the differential privacy protection algorithm can be considered from the following three aspects:
algorithm error: the Euclidean distance, absolute error, relative error, and variance of the error can all be used to measure the error.
Algorithm performance: the temporal complexity and progressive noise error boundaries are generally considered as criteria for performance evaluation.
Privacy budget parametersIs allocated to: for static data sets, privacy budget parameters are reasonably allocated +.>The usability of the data can be ensured while better privacy protection is obtained; for dynamic data flow, the increase of data leads to collision and exhaustion, so privacy protection is not significant.
6) Correlation properties of differential privacy
Sequence combinability: is provided with an algorithm M 1 ,M 2 ,…,M n Its privacy budgets are respectivelyFor the same data set D, a combined algorithm M (M 1 (D),M 2 (D),…,M n (D) Provide->Differential privacy protection.
Parallel combinability: is provided with an algorithm M 1 ,M 2 ,…,M n Its privacy budgets are respectivelyFor disjoint data sets D 1 ,D 2 ,…,D n The combination algorithm M (M 1 (D 1 ),M 2 (D 2 ),…,M n (D n ) ProvidingDifferential privacy protection.
It should be noted that, the technical scheme in this embodiment does not need to perform discretization preprocessing on the data, so as to reduce the consumption of the performance of the classification system. Under the condition that the requirement of differential privacy protection is met, the optimal parameter domain is analyzed, the higher classification accuracy is kept, and the privacy of the released data set is ensured.
Example two
Fig. 3 is a schematic structural diagram of a medical data application security and privacy protection system according to an embodiment of the present application, and referring to fig. 3, a medical data application security and privacy protection system includes:
a processor 21 and a memory 22;
the processor 21 is connected to the memory 22 via a communication bus:
wherein the processor 21 is used for calling and executing the program stored in the memory 22;
a memory 22 for storing a program for executing at least one medical data application security and privacy protection method as in the above embodiments.
It is to be understood that the same or similar parts in the above embodiments may be referred to each other, and that in some embodiments, the same or similar parts in other embodiments may be referred to.
It should be noted that in the description of the present application, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present application, unless otherwise indicated, the meaning of "plurality" means at least two.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims (10)

1. A medical data application security and privacy protection method, comprising:
acquiring original medical data, and performing desensitization treatment on sensitive information in the original medical data;
defining a trust level and a differential privacy budget parameter corresponding to the trust level;
constructing a publishable data set which corresponds to each trust level and is subjected to differential privacy protection;
issuing a issuable data set corresponding to each trust level and a corresponding differential privacy budget parameter;
the method for constructing the publishable data set after differential privacy protection corresponding to each trust level comprises the following steps:
determining differential privacy budget parameters corresponding to the current trust level;
normalization processing is carried out on the medical data subjected to desensitization processing, so that a sample data set is obtained;
classifying the sample data set according to the characteristic attribute based on a decision tree, and determining a sub data set and a weight corresponding to each characteristic attribute classification;
based on the differential privacy budget parameters corresponding to the current trust level, privacy budgets are distributed to each sub-data set based on the weight, noise with corresponding influence level is added, and the data set after differential privacy protection is obtained; the greater the weight of the sub-data set, the smaller the corresponding noise impact level;
and evaluating the data set subjected to differential privacy protection, and determining the data set as the issuable data set after the evaluation is passed.
2. The method according to claim 1, wherein the method further comprises:
when a data query request is received, confirming the trust level corresponding to the visitor account;
searching and matching are carried out on the corresponding issueable data sets according to the trust level corresponding to the visitor account, and if matching is successful, the matched issueable data sets are provided for the visitor account.
3. The method of claim 1, wherein defining the trust level and the differential privacy budget parameter corresponding to the trust level comprises:
the higher the defined trust level, the larger the corresponding differential privacy budget parameter.
4. The method of claim 1, wherein normalizing the desensitized medical data comprises:
performing zero centering treatment on the medical data subjected to desensitization treatment;
compressing the medical data after zero centralization treatment to a specified range; the specified range is related to sensitivity in differential privacy protection, and the specified range is related to differential privacy algorithm classification accuracy assessment test model.
5. The method of claim 1, wherein evaluating the differential privacy-protected data set comprises:
the sample data set and the data set subjected to differential privacy protection are used as training data sets to be input into a deep neural network to train a DNN classifier, so that two DNN classification models are obtained;
determining whether the use accuracy of the data set subjected to differential privacy protection meets a data availability threshold value or not by comparing classification results of the two DNN classification models;
and evaluating whether the classification accuracy of the data set subjected to differential privacy protection meets a classification accuracy threshold or not through a differential privacy algorithm classification accuracy evaluation test model.
6. The method according to claim 1, wherein the method further comprises:
and when the evaluation fails, iteratively optimizing the distribution of the privacy budget parameters among the sub-data sets, and re-evaluating until the evaluation passes.
7. The method of claim 1, wherein classifying the sample data set according to feature attributes based on a decision tree and determining a sub-data set and weight for each feature attribute classification comprises:
constructing a CART decision tree as a classifier, selecting continuous feature splitting points by using an exponential mechanism in the tree construction process, selecting splitting features by using Gini exponents, and calling the exponential mechanism only once in the iteration process.
8. The method of claim 4, wherein the sensitivity in differential privacy protection is the maximum difference in query results obtained for two different sets of data for only one record.
9. The method of claim 2, wherein if the matching is unsuccessful, returning a notification message to the visitor account that the query failed.
10. A medical data application security and privacy protection system, comprising:
a processor and a memory;
the processor is connected with the memory through a communication bus:
the processor is used for calling and executing the program stored in the memory;
the memory for storing a program for performing at least one medical data application security and privacy protection method according to any one of claims 1-9.
CN202410013983.6A 2024-01-05 2024-01-05 Medical data application security and privacy protection method and system Pending CN117521117A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410013983.6A CN117521117A (en) 2024-01-05 2024-01-05 Medical data application security and privacy protection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410013983.6A CN117521117A (en) 2024-01-05 2024-01-05 Medical data application security and privacy protection method and system

Publications (1)

Publication Number Publication Date
CN117521117A true CN117521117A (en) 2024-02-06

Family

ID=89755323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410013983.6A Pending CN117521117A (en) 2024-01-05 2024-01-05 Medical data application security and privacy protection method and system

Country Status (1)

Country Link
CN (1) CN117521117A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117892357A (en) * 2024-03-15 2024-04-16 大连优冠网络科技有限责任公司 Energy big data sharing and distribution risk control method based on differential privacy protection
CN117892357B (en) * 2024-03-15 2024-05-31 国网河南省电力公司经济技术研究院 Energy big data sharing and distribution risk control method based on differential privacy protection

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109726758A (en) * 2018-12-28 2019-05-07 辽宁工业大学 A kind of data fusion publication algorithm based on difference privacy
CN110097119A (en) * 2019-04-30 2019-08-06 西安理工大学 Difference secret protection support vector machine classifier algorithm based on dual variable disturbance
CN111027090A (en) * 2018-10-18 2020-04-17 山东科技大学 Medical data privacy protection method based on heteroscedastic difference and K-anonymity mechanism

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111027090A (en) * 2018-10-18 2020-04-17 山东科技大学 Medical data privacy protection method based on heteroscedastic difference and K-anonymity mechanism
CN109726758A (en) * 2018-12-28 2019-05-07 辽宁工业大学 A kind of data fusion publication algorithm based on difference privacy
CN110097119A (en) * 2019-04-30 2019-08-06 西安理工大学 Difference secret protection support vector machine classifier algorithm based on dual variable disturbance

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117892357A (en) * 2024-03-15 2024-04-16 大连优冠网络科技有限责任公司 Energy big data sharing and distribution risk control method based on differential privacy protection
CN117892357B (en) * 2024-03-15 2024-05-31 国网河南省电力公司经济技术研究院 Energy big data sharing and distribution risk control method based on differential privacy protection

Similar Documents

Publication Publication Date Title
Maseer et al. Benchmarking of machine learning for anomaly based intrusion detection systems in the CICIDS2017 dataset
US11574077B2 (en) Systems and methods for removing identifiable information
US11475143B2 (en) Sensitive data classification
CN110334548B (en) Data anomaly detection method based on differential privacy
US20130346350A1 (en) Computer-implemented semi-supervised learning systems and methods
Aguilar et al. Towards an interpretable autoencoder: A decision-tree-based autoencoder and its application in anomaly detection
CN111612041A (en) Abnormal user identification method and device, storage medium and electronic equipment
JP6892454B2 (en) Systems and methods for calculating the data confidentiality-practicality trade-off
Shyla et al. Cloud security: LKM and optimal fuzzy system for intrusion detection in cloud environment
US11533373B2 (en) Global iterative clustering algorithm to model entities' behaviors and detect anomalies
CN110929525B (en) Network loan risk behavior analysis and detection method, device, equipment and storage medium
Saito et al. Improving lime robustness with smarter locality sampling
Tao et al. An efficient method for network security situation assessment
EP3591561A1 (en) An anonymized data processing method and computer programs thereof
Abdzaid Atiyah et al. KC‐Means: A Fast Fuzzy Clustering
US20230252416A1 (en) Apparatuses and methods for linking action data to an immutable sequential listing identifier of a user
Sulayman et al. User modeling via anomaly detection techniques for user authentication
Vadavalli et al. An improved differential privacy-preserving truth discovery approach in healthcare
US11573986B1 (en) Apparatuses and methods for the collection and storage of user identifiers
JP7492088B2 (en) Weighted knowledge transfer apparatus, method and system
CN117521117A (en) Medical data application security and privacy protection method and system
Díaz et al. Comparison of machine learning models applied on anonymized data with different techniques
Nguyen et al. A survey of privacy-preserving model explanations: Privacy risks, attacks, and countermeasures
Tamtama et al. Increasing Accuracy of The Random Forest Algorithm Using PCA and Resampling Techniques with Data Augmentation for Fraud Detection of Credit Card Transaction
Wang et al. AIHGAT: A novel method of malware detection and homology analysis using assembly instruction heterogeneous graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination