CN116595584A - Physical medicine data fusion privacy protection method based on cloud and fog architecture longitudinal federal learning - Google Patents
Physical medicine data fusion privacy protection method based on cloud and fog architecture longitudinal federal learning Download PDFInfo
- Publication number
- CN116595584A CN116595584A CN202310565639.3A CN202310565639A CN116595584A CN 116595584 A CN116595584 A CN 116595584A CN 202310565639 A CN202310565639 A CN 202310565639A CN 116595584 A CN116595584 A CN 116595584A
- Authority
- CN
- China
- Prior art keywords
- data
- privacy
- model
- parameters
- local
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 85
- 230000004927 fusion Effects 0.000 title claims abstract description 34
- 239000003814 drug Substances 0.000 title description 3
- 238000012549 training Methods 0.000 claims abstract description 39
- 230000007246 mechanism Effects 0.000 claims abstract description 29
- 238000012545 processing Methods 0.000 claims abstract description 27
- 238000005516 engineering process Methods 0.000 claims abstract description 15
- 230000036541 health Effects 0.000 claims abstract description 12
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 230000008569 process Effects 0.000 claims description 29
- 238000009826 distribution Methods 0.000 claims description 24
- 230000006870 function Effects 0.000 claims description 22
- 238000012360 testing method Methods 0.000 claims description 20
- 238000004422 calculation algorithm Methods 0.000 claims description 13
- 230000035945 sensitivity Effects 0.000 claims description 13
- 230000002159 abnormal effect Effects 0.000 claims description 10
- 238000010606 normalization Methods 0.000 claims description 10
- 238000010801 machine learning Methods 0.000 claims description 9
- 238000005259 measurement Methods 0.000 claims description 9
- 238000004140 cleaning Methods 0.000 claims description 8
- 238000004458 analytical method Methods 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 5
- 239000008280 blood Substances 0.000 claims description 4
- 210000004369 blood Anatomy 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000007619 statistical method Methods 0.000 claims description 3
- 230000000386 athletic effect Effects 0.000 claims description 2
- 238000007405 data analysis Methods 0.000 claims description 2
- 238000012804 iterative process Methods 0.000 claims description 2
- 239000011159 matrix material Substances 0.000 claims description 2
- 238000005457 optimization Methods 0.000 claims description 2
- 238000005070 sampling Methods 0.000 claims description 2
- 201000010099 disease Diseases 0.000 abstract description 4
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 abstract description 4
- 238000013528 artificial neural network Methods 0.000 abstract description 3
- 238000011160 research Methods 0.000 abstract description 2
- 230000002776 aggregation Effects 0.000 description 3
- 238000004220 aggregation Methods 0.000 description 3
- 230000008439 repair process Effects 0.000 description 2
- 208000017667 Chronic Disease Diseases 0.000 description 1
- WQZGKKKJIJFFOK-GASJEMHNSA-N Glucose Natural products OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O WQZGKKKJIJFFOK-GASJEMHNSA-N 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000001684 chronic effect Effects 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 206010012601 diabetes mellitus Diseases 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 235000005911 diet Nutrition 0.000 description 1
- 230000037213 diet Effects 0.000 description 1
- 230000006806 disease prevention Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000008103 glucose Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000013031 physical testing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000005067 remediation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H20/00—ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
- G16H20/40—ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to mechanical, radiation or invasive therapies, e.g. surgery, laser therapy, dialysis or acupuncture
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H40/00—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
- G16H40/20—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the management or administration of healthcare resources or facilities, e.g. managing hospital staff or surgery rooms
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Business, Economics & Management (AREA)
- Public Health (AREA)
- Epidemiology (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- Primary Health Care (AREA)
- Bioethics (AREA)
- Surgery (AREA)
- Urology & Nephrology (AREA)
- Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention discloses a physical medical data fusion privacy protection method based on cloud architecture longitudinal federal learning, and aims to solve the problems of independence and privacy of movement guidance centers and hospital data in cloud architecture research. The method of the invention uses differential privacy technology to protect the privacy of data and models, and adopts the central server to aggregate the model parameters, thereby avoiding the risks of data disclosure and model privacy. The method is divided into three stages: (1) A data preprocessing stage, namely selecting relevant characteristics from a data set of a motion guidance center and a hospital for processing; (2) In a longitudinal federal learning model stage, a specific neural network architecture is adopted, individual data is encrypted through a differential privacy mechanism, the processed data is transmitted to a central server for model training, and a shared global model is finally generated; (3) And in the model prediction stage, the body-building health guidance, the disease prediction and the decision making are performed by using the model obtained by the joint modeling so as to improve the prediction accuracy.
Description
Technical Field
The invention belongs to the field of data privacy protection and machine learning of physical and medical fusion, and particularly relates to a physical and medical data fusion privacy protection method based on longitudinal federal learning.
Background
With the rapid development of big data and artificial intelligence technology, more and more medical institutions and sports fitness institutions begin to share own data so as to perform joint modeling, and the accuracy and generalization capability of the model are improved. However, such data sharing often involves personal privacy, and how to protect personal privacy on the premise of guaranteeing the quality of a model becomes an important problem.
To address this problem, federal learning-based data fusion and privacy protection methods have been developed. The longitudinal federal learning is a method for carrying out joint training and analysis on data on the premise of protecting privacy. This approach allows joint learning of multiple data sources without sharing the original data and generating models with predictive capabilities. In longitudinal federal learning, each data source is responsible for providing only a portion of the information, and other data sources are not visible. Thus, even if an attacker obtains information from some data sources, it is not possible to infer information from other data sources from this information. Thus, longitudinal federal learning can guarantee data privacy and security.
In the field of physical and medical data fusion, the data fusion privacy protection method based on longitudinal federal learning can be applied to joint analysis of motion data and medical data of different crowds, and simultaneously is combined with chronic disease prevention and control, a motion prescription library for personalized health needs of different crowds and intervention of chronic single diseases is built, and based on basic information of service objects, health examination, physical testing, health state monitoring and evaluation, intervention guidance, scheme implementation and other data sources of different data mechanisms, key information sharing between various Internet of things terminals and various health service systems is realized, the relationship between human body motion and physical conditions is better understood, and in order to enable the user to know dynamic changes of physical health indexes and motion intervention effects in a motion period at any time, the health level and motion performance of people are further improved. At the same time, data privacy and security are also critical as it involves the sharing of personal privacy data. The differential privacy technology can be used for protecting personal privacy and guaranteeing accuracy and practicability of data. Therefore, the application of the data fusion privacy protection method based on longitudinal federal learning in the data fusion of the physical medicine is very feasible, and effective and accurate data fusion technical service is provided for promoting the body building of the whole people and the deep fusion of the health of the whole people.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a physical and medical data fusion privacy protection method based on cloud and fog architecture longitudinal federal learning, which can improve the performance and prediction accuracy of a model, can be applied to data fusion, collaborative learning and privacy protection in different fields while protecting the privacy of data and the model, and has important practical value and application prospect.
A physical medical data fusion privacy protection method based on cloud architecture longitudinal federal learning is characterized by comprising the following steps:
step one, collecting the original data of each participant participating in federal learning and carrying out relevant processing, wherein the specific process is as follows:
step 1.1, firstly, a motion guidance center and a hospital collect original data of all participants participating in federal learning, data cleaning is carried out on the original data, after data cleaning operation is carried out, in order to enable the data to be more suitable for the use of a machine learning algorithm, preprocessing is needed to be carried out on the data, data formats, labels and value fields of different participants are unified to form a standardized data format, influences of dimension differences and abnormal values are eliminated, the data of different participants are ensured to have similar distribution and characteristics, and therefore the federal learning effect and reliability are improved;
step 1.2, adding random noise into the preprocessed data through a differential privacy technology, segmenting and encrypting privacy features, and protecting the privacy of original data, so that the disclosure of personal sensitive information is prevented, and different methods are needed to be adopted for numerical data and non-numerical data respectively;
step 1.3, finally, storing the cleaned and preprocessed data into a database or file system to form a local data set for later analysis and model training, wherein the exercise guidance center and the hospital respectively have the local data set D center and Dhispital ;
Respectively carrying out model training on respective local data sets by a motion guidance center and a hospital to obtain a local model;
the specific process is as follows:
step 2.1, sampling and selecting user data of participants participating in federal learning by both sides of the exercise guidance center and the hospital, and carrying out encryption ID matching alignment to obtain a user ID list shared by both sides, wherein the data of the institutions are used for federal learning;
step 2.2, then, the two parties encrypt the user data homomorphic, and train a local model on a local data set by using a longitudinal federal learning mechanism;
step 2.3, in the process of local model training, noise needs to be added into a gradient update algorithm according to the requirement of differential privacy so as to protect data privacy, and meanwhile, when model parameters are updated, corresponding noise is added to the gradient of each participant so as to protect the data privacy, and a gradient update formula of each participant can be expressed as follows:
wherein ,Δwt Representing the updated amount of the parameter, n represents the number of samples in the training set, y i A label, x, representing sample i i Features representing samples i, f t Representing the predictive function, w, of the current model t Representing parameters of the current model, lambda representing regularized parameters, sigma t The standard deviation is indicated as such,mean value 0, variance ++>Is a distributed noise of (a);
step three, uploading the local model to a federal server by both the exercise guidance center and the hospital; adding noise into the parameters of the local model during uploading, and sending the parameters after noise addition to a federal server for parameter updating;
fourthly, the federation server carries out longitudinal federation learning on the uploaded local model to generate a global model, and the specific process is as follows:
in the longitudinal federation learning process, the federation server needs to aggregate the local models of all the participants to update the global model parameters so as to improve the accuracy and performance of the models, and in the process of updating the global model parameters, in order to protect the privacy of each participant, a differential privacy technology is used for adding random noise to protect the privacy of the local model;
step five, the motion guidance center and the hospital input the local data set into a global model, train the global model, and then predict the data by using the trained global model to obtain a prediction result;
step six, the motion guidance center and the hospital upload the prediction result to the federal server;
before uploading the predicted result, each mechanism needs to perform differential privacy processing on the predicted result, namely, each mechanism uploads the predicted result with random noise to a federal server instead of the original predicted result, the size of the random noise can be controlled through privacy parameters in differential privacy, and a Laplace mechanism is adopted to add the random noise to each predicted result;
namely:
wherein ,the predicted result after random noise is added is represented by Lap, laplacian distribution is represented by Δf, sensitivity of the function f to changes in the data set D is represented by ε, and privacy parameters are represented by ε. This formula can add a certain amount of noise to each prediction, thus preserving its privacy;
step seven, after receiving the prediction results with random noise from different participants, the federal server aggregates the parameters through weighted average, the aggregated prediction results are encrypted to generate final prediction results, and the results are returned to each participant;
step eight, testing and verifying the trained global model to verify the prediction performance of the model, specifically, dividing a data set into a plurality of parts, taking one part as a test set each time, taking the other part as a training set, then performing model training on the training set, performing model testing on the test set, calculating the performance index of the model, repeating for a plurality of times, and finally obtaining the average performance index; the process can be repeated for a plurality of times until the global model reaches the expected precision, so that scientific body-building health guidance of the body medical data fusion is realized.
Further, the data cleaning means: removing missing values, abnormal values and repeated values in the original data to ensure the accuracy and reliability of subsequent data analysis; filling the missing values by adopting an interpolation method; for abnormal values, detecting and repairing the abnormal values by adopting a statistical method or a machine learning algorithm; for the repeated values, a deleting or merging method is adopted for processing.
Further, the preprocessing refers to data centralization processing, data scaling processing and data normalization processing.
Further, the data centralization process refers to: different participants use different units of measurement to measure the height and weight of the patient, resulting in different units of data, and the average of the data is moved to zero by a data-centric process, i.e., the average of each feature minus the average of the feature in the entire dataset, such that the data of the different participants has a similar distribution over the units of measurement.
Further, the data scaling process refers to: different participants use different measuring equipment to measure the blood sugar level of the patient, so that the range of the data is different, the data is scaled down or amplified by adopting a data scaling processing mode, so that the range of each feature is the same, and the data of different participants has similar distribution in measurement precision.
Further, the data normalization processing means: different participants sample the patient in different time periods, so that the data are distributed differently on the time axis, and the data are limited in a certain range by adopting a data normalization processing mode, so that the deviation of the data is avoided, and the data of different participants have similar distribution on the time axis.
Further, in the first step, aiming at the numerical data, a Laplace mechanism is adopted to add random dynamic disturbance noise to the data result; for a function f, the Laplace mechanism adds one noise N (0, Δf/ε) from the Laplace distribution to each f (x), where ε is the privacy parameter; specifically, for a query q and a database a, the results of the query calculated by the laplace mechanism are:
q(A)=f(A)+Lap(0,Δf/ε)
wherein f (A) is the query result of the original data, Δf represents sensitivity, represents the sensitivity degree of the function f to the change of the data set, lap (0, b) represents Laplacian distribution with the mean value of 0 and the standard deviation of b; the noise added in the laplace mechanism is symmetrical so that the average error for a query result can be kept at a constant level of Δf/epsilon.
Further, in the second step, aiming at non-numerical data, an exponential mechanism is adopted to introduce a scoring mechanism, output scores of discrete classifications are enumerated and calculated, and the output scores are characterized as probability values after normalization; specifically, for a function f, given a data set D, the exponential mechanism adds a random noise N to the query results such that the probability of issuance of each query result f (D) is proportional to the probability of issuance of f (d+Δd), where Δd is a small variation of D, and N is a random variable subject to an exponential distribution, whose probability density function is:
wherein epsilon is a privacy parameter and represents the degree of privacy protection, deltaf is sensitivity and represents the sensitivity of a function to the change of a data set, and random noise of an exponential mechanism can lead a query result to have certain uncertainty, so that the effect of privacy protection is improved.
Further, the third specific method is that the local model parameter of the party i is assumed to be theta i To protect differential privacy, party i will be at the local model parameter θ i Adding noise which is compliant with Laplace distribution, i.e. theta to ultra i =θ i +Δθ i, wherein Δθi Obeying the Laplacian distribution Lap (0, b), and Δθ i Meeting the requirement of differential privacy, namely meeting the (epsilon, delta) -differential privacy, so that the local model parameters after noise addition, which are sent to the federal server by the party i, are theta i I.e.
wherein ,Δθi Representing the Laplace distributed noise, the probability density function is:
receiving the noisy local model parameters theta to the extent that are transmitted by the party i on the federal server i Then, the federal server calculates the global model parameters θ according to a weighted average global I.e.
wherein ,wi Is the weight of the participant i, and the weight is determined according to the data quantity and the data quality factor of the participant.
Further, the specific method in the fourth step is as follows: assuming M participants, each participant has trained a model M locally i Uploading the local models to a federation server through a differential privacy technology, and then combining the models into a global model M by the federation server, and updating parameters of the global model M to realize optimization of the global model;
set up local model M i The gradient in the t-th iteration isThe parameter of the global model M before the t-th round of iteration is theta t-1 The gradient of the global model M in the t-th round iteration is g t In the differential privacy technique, the global model updates parameters with each gradient +.>Random noise is added to protect the privacy of each participant user while still maintaining the accuracy of the model, and specifically, the update formula of the global model can be expressed as:
wherein N (0, sigma) 2 I) Representation is 0 as mean, sigma 2 As distributed noise of variance, I is an identity matrix, thus, in global modelIn the parameter updating process, the differential privacy technology can add noise into the gradient of each participant user so as to protect the privacy of each participant user and ensure the accuracy of the global model.
Further, the training process of the step five global model may be described as the following multi-round iterative process:
(1) Initially, the exercise guidance center and the hospital randomly initialize local model parameters, respectively and />
(2) In each iteration, the athletic guideline center and the hospital upload local model parameters to the federal server, namely:
wherein t represents the iteration round number, eta represents the learning rate, f (·) represents the loss function,a gradient representing a loss function;
(3) In each iteration, the global model server adds some noise to each participant so that the output model parameters do not reveal individual privacy information, assuming and />Local model parameters representing exercise guidance center and hospital, respectively +.> and />May be calculated using the following formula:
(4) After the global model server receives the local model parameters of the differential privacy version uploaded by the participants, the local model parameters are weighted and averaged to update the global model parameters theta (t) The method comprises the following steps:
wherein n represents the number of participants, w i The weight representing the ith participant may generally be determined based on the amount and quality of the data of the participants.
Further, a seventh specific method is that, assuming that there are m participants, the local model parameter of each participant is θ i The weight is w i The calculation formula of the global model parameters is:
wherein the weight w i The weights of all the participants in the training are expressed, the weights can be dynamically adjusted according to the data quantity and the quality of the weights, and the denominator represents the sum of the weights of all the participants, so that the result that the global model parameters are weighted average is ensured.
Further, the method includes testing and verifying data by using a global model to verify predicted performance, taking one part as a test set and the other part as a training set each time, performing model training on the training set, performing model testing on the test set, and calculating performance indexes of the model.
Further, the participants comprise various medical, fitness, physical examination and rehabilitation institutions, and each institution can share data and knowledge with other institutions by adding a federal learning network, so that comprehensive analysis and decision making capability of the data is improved.
The training data and the treatment data comprise morphological indexes, physiological and biochemical indexes, kinematic indexes, dynamic indexes, medical records and the like, and are subjected to privacy protection treatment.
And the model training process adopts a differential privacy technology to carry out noise processing on the information such as model parameters, gradients, updating and the like in the Union learning process so as to protect privacy data of users.
The noise level in the differential privacy technology meets the balance of privacy protection and data utility.
The encryption algorithm is an algorithm based on homomorphic encryption, and can encrypt and decrypt data on the premise of not revealing data content, so that privacy data of users are protected, and only authorized users can obtain complete model parameters after decryption in a mode of retaining key information.
The longitudinal federal learning model established by the exercise guidance center and the hospital is a neural network model.
The computing devices used in the model building and predicting steps comprise a server, a PC, a sports bracelet, a sports watch, a mobile terminal and the like.
The invention aims to solve the problems of independence and privacy of movement guidance centers and hospital data in cloud and fog architecture research. The invention uses differential privacy technology to protect the privacy of data and models, and adopts the federal server to aggregate the model parameters, thereby avoiding the risks of data disclosure and model privacy.
Drawings
FIG. 1 is a schematic illustration of longitudinal federal learning data fusion privacy protection of the present invention;
FIG. 2 is a schematic diagram of the three-party model training process of the exercise guidance center, hospital and federal center of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In this embodiment, we select a hospital and a sports guidance center, and the data set of the hospital is an electronic medical record data set containing more than 1000 patients, including age, sex, height, weight, blood sugar and other characteristics, and whether the patient has diabetes and other sensitive characteristics. The data set of the exercise guidance center has the same amount of patient data, but unlike hospitals, the exercise guidance center has different characteristics, such as life style characteristics of the patient's diet, exercise program, exercise duration, exercise intensity, exercise frequency, etc. The two institutions can perform data fusion through a longitudinal federal learning method to obtain a more comprehensive data set, and a more accurate disease prediction model is established.
Firstly, preprocessing data: the data are grouped according to the characteristics, the data with the same characteristics are encrypted into encrypted data, and the encrypted data are subjected to noise processing through a differential privacy technology, so that the privacy of sensitive characteristics and models is protected.
All the data are collected by the exercise guidance center and the hospital and classified and arranged. Because the original data may have problems such as deletion, abnormality, repetition, etc., data cleaning is required.
Specifically, the following data cleansing operations are required:
filling in missing data: for partial missing data, interpolation or other methods are adopted to fill the data so as to ensure the integrity of the data.
Processing duplicate data: for fully duplicated data, only one copy is kept and the other copies are deleted. And merging different items for repeated data with incorrect format or missing key information, and deleting invalid data.
Repairing abnormal data: for abnormal data, a statistical method or a machine learning algorithm is adopted for detection and repair. We used a statistical box-plot method to detect outliers and used mean, median, etc. methods for remediation. In addition, a clustering algorithm based on machine learning is also used to cluster data sets, treat data that does not belong to any cluster as outliers, and then use the mean of the cluster for repair.
After the data cleaning operation is performed by both the exercise guidance center and the hospital, the data needs to be preprocessed in order to make the data more suitable for the use of machine learning algorithms. The method comprises the following steps:
data centralization treatment: different institutions use different units of measurement to measure the height and weight of a patient, resulting in different units of data. The data centering processing mode is adopted, and the average value of the data is moved to the zero point, so that the data of different mechanisms have similar distribution on the measurement unit.
Data scaling processing: different institutions use different measuring devices to measure the blood glucose level of patients, resulting in different ranges of data. The data is scaled down or enlarged by adopting a data scaling processing mode, so that the range of each feature is the same, and the data of different mechanisms have similar distribution in measurement precision.
And (3) data normalization processing: different institutions sample patients over different time periods, resulting in different distributions of data on the time axis. The data is limited in a certain range by adopting a data normalization processing mode, so that deviation of the data is avoided, and the data of different mechanisms have similar distribution on a time axis.
Finally, the cleaned and preprocessed data needs to be stored in a database or file system for subsequent analysis and training of the machine learning model.
Table 1 hospital side data preprocessing
Table 2 motion guidance center data preprocessing
And secondly, building a longitudinal federal learning model: model training is carried out by exchanging encrypted data, a model is built together, and parameter updating of the model is more reasonable and accurate through a gradient updating algorithm.
The exercise guidance center and the hospital divide the own data set into a training set and a testing set respectively, and model training is carried out on the respective training sets. In local training, the physical examination center and the hospital can use different model architectures, and can also use the same model architecture. In the invention, a longitudinal federal learning method based on a neural network is adopted to construct a model. Specifically, each participant trains its own model locally and uploads model parameters to a central server for aggregation, thereby obtaining a joint model. The method can utilize the data of different institutions to perform model training, and can obtain a model with higher accuracy and generalization capability under the condition of ensuring privacy safety.
TABLE 3 neural network Structure employed in the present invention
Hierarchy of layers | Input device | Output of | Activation function |
Input layer | Feature vector | - | - |
Hidden layer 1 | - | 64 | ReLU |
Hidden layer 2 | 64 | 32 | ReLU |
Output layer | 32 | Target variable | Sigmoid |
Each participant trains own model locally, differential privacy processing is carried out on model parameters, and noise is added in a gradient updating algorithm according to the requirement of differential privacy so as to protect data privacy. And uploading the encrypted model parameters to a central server for aggregation. The central server also adopts a differential privacy method during aggregation to protect model parameters and data privacy.
Table 4 differential privacy parameters employed in the present invention and their values
Parameters (parameters) | Value taking |
ε | 0.01 |
δ | 1e-5 |
Sensitivity | 0.1 |
Where ε represents the privacy budget, δ represents the leakage probability, and Sensitivity represents the Sensitivity of the query function.
Finally, a model is applied to predict: and the model obtained by the joint modeling is used for body-building health guidance, disease prediction and decision making so as to achieve the aim of improving the prediction accuracy.
On the basis of the model obtained by the joint modeling, the trained model needs to be tested and verified to verify the predicted performance of the model on the new data set. In the present invention, a 10-fold cross-validation method was used to evaluate the model. Specifically, the data set is divided into 10 parts, one of which is the test set at a time, and the rest of which is the training set. Model training is then performed on the training set, model testing is performed on the testing set, and performance indexes (such as accuracy, recall, F1 values, and the like) of the model are calculated. Repeating for 10 times to finally obtain the average performance index.
TABLE 5 Performance index used in the present invention and its calculation formula
Performance index | Formula (VI) |
Accuracy rate of | (TP+TN)/(TP+TN+FP+FN) |
Recall rate of recall | TP/(TP+FN) |
F1 value | 2 (precision x recall)/(precision + recall) |
Where TP represents the true number of cases, TN represents the true number of cases, FP represents the false number of cases, and FN represents the false number of cases.
Table 6 comparison of the performance of the invention with other methods
The combined modeling process can be repeated for a plurality of times until the global model reaches the expected precision, and the result is returned to the exercise guidance center and the hospital, so that scientific fitness health guidance of the body medical data fusion is realized. As can be seen from the table, compared with other methods, the method has the advantages that on two data sets, the accuracy and generalization capability of the model can be improved on the premise of ensuring the data privacy, and the effectiveness and feasibility of the method are proved.
The foregoing is considered as illustrative of the principles of the present invention, and has been described herein before with reference to the accompanying drawings, in which the invention is not limited to the specific embodiments shown.
Claims (13)
1. A physical medical data fusion privacy protection method based on cloud architecture longitudinal federal learning is characterized by comprising the following steps:
step one, collecting the original data of each participant participating in federal learning and carrying out relevant processing, wherein the specific process is as follows:
step 1.1, firstly, a motion guidance center and a hospital collect original data of all participants participating in federal learning, data cleaning is carried out on the original data, after data cleaning operation is carried out, in order to enable the data to be more suitable for the use of a machine learning algorithm, preprocessing is needed to be carried out on the data, data formats, labels and value fields of different participants are unified to form a standardized data format, influences of dimension differences and abnormal values are eliminated, the data of different participants are ensured to have similar distribution and characteristics, and therefore the federal learning effect and reliability are improved;
step 1.2, adding random noise into the preprocessed data through a differential privacy technology, segmenting and encrypting privacy features, and protecting the privacy of original data, so that the disclosure of personal sensitive information is prevented, and different methods are needed to be adopted for numerical data and non-numerical data respectively;
step 1.3, finally, storing the cleaned and preprocessed data into a database or file system to form a local data set for later analysis and model training, wherein the exercise guidance center and the hospital respectively have the local data set D center and Dhispital ;
Respectively carrying out model training on respective local data sets by a motion guidance center and a hospital to obtain a local model;
the specific process is as follows:
step 2.1, sampling and selecting user data of participants participating in federal learning by both sides of the exercise guidance center and the hospital, and carrying out encryption ID matching alignment to obtain a user ID list shared by both sides, wherein the data of the institutions are used for federal learning;
step 2.2, then, the two parties encrypt the user data homomorphic, and train a local model on a local data set by using a longitudinal federal learning mechanism;
step 2.3, in the process of local model training, noise needs to be added into a gradient update algorithm according to the requirement of differential privacy so as to protect data privacy, and meanwhile, when model parameters are updated, corresponding noise is added to the gradient of each participant so as to protect the data privacy, and a gradient update formula of each participant can be expressed as follows:
wherein ,Δwt Representing the updated amount of the parameter, n represents the number of samples in the training set, y i A label, x, representing sample i i Features representing samples i, f t Representing the predictive function, w, of the current model t Representing parameters of the current model, lambda representing regularized parameters, sigma t The standard deviation is indicated as such,mean value 0, variance ++>Is a distributed noise of (a);
step three, uploading the local model to a federal server by both the exercise guidance center and the hospital; adding noise into the parameters of the local model during uploading, and sending the parameters after noise addition to a federal server for parameter updating;
fourthly, the federation server carries out longitudinal federation learning on the uploaded local model to generate a global model, and the specific process is as follows:
in the longitudinal federation learning process, the federation server needs to aggregate the local models of all the participants to update the global model parameters so as to improve the accuracy and performance of the models, and in the process of updating the global model parameters, in order to protect the privacy of each participant, a differential privacy technology is used for adding random noise to protect the privacy of the local model;
step five, the motion guidance center and the hospital input the local data set into a global model, train the global model, and then predict the data by using the trained global model to obtain a prediction result;
step six, the motion guidance center and the hospital upload the prediction result to the federal server;
before uploading the predicted result, each mechanism needs to perform differential privacy processing on the predicted result, namely, each mechanism uploads the predicted result with random noise to a federal server instead of the original predicted result, the size of the random noise can be controlled through privacy parameters in differential privacy, and a Laplace mechanism is adopted to add the random noise to each predicted result;
namely:
wherein ,the predicted result after random noise is added is represented by Lap, laplacian distribution is represented by Δf, sensitivity of the function f to changes in the data set D is represented by ε, and privacy parameters are represented by ε. This formula can add a certain amount of noise to each prediction, thus preserving its privacy;
step seven, after receiving the prediction results with random noise from different participants, the federal server aggregates the parameters through weighted average, the aggregated prediction results are encrypted to generate final prediction results, and the results are returned to each participant;
step eight, testing and verifying the trained global model to verify the prediction performance of the model, specifically, dividing a data set into a plurality of parts, taking one part as a test set each time, taking the other part as a training set, then performing model training on the training set, performing model testing on the test set, calculating the performance index of the model, repeating for a plurality of times, and finally obtaining the average performance index; the process can be repeated for a plurality of times until the global model reaches the expected precision, so that scientific body-building health guidance of the body medical data fusion is realized.
2. The method for protecting the privacy of the fusion of body medical data based on cloud architecture longitudinal federal learning as set forth in claim 1, wherein the data cleaning means: removing missing values, abnormal values and repeated values in the original data to ensure the accuracy and reliability of subsequent data analysis; filling the missing values by adopting an interpolation method; for abnormal values, detecting and repairing the abnormal values by adopting a statistical method or a machine learning algorithm; for the repeated values, a deleting or merging method is adopted for processing.
3. The method for protecting the privacy of the fusion of the body medical data based on the cloud and fog architecture longitudinal federal learning according to claim 1, wherein the preprocessing is data centralization processing, data scaling processing and data normalization processing.
4. The method for protecting the privacy of the fusion of body medical data based on cloud architecture longitudinal federal learning as set forth in claim 3, wherein the data centering process is: different participants use different units of measurement to measure the height and weight of the patient, resulting in different units of data, and the average of the data is moved to zero by a data-centric process, i.e., the average of each feature minus the average of the feature in the entire dataset, such that the data of the different participants has a similar distribution over the units of measurement.
5. The method for protecting the privacy of the fusion of body medical data based on cloud architecture longitudinal federal learning as set forth in claim 3, wherein the data scaling process means: different participants use different measuring equipment to measure the blood sugar level of the patient, so that the range of the data is different, the data is scaled down or amplified by adopting a data scaling processing mode, so that the range of each feature is the same, and the data of different participants has similar distribution in measurement precision.
6. The method for protecting the privacy of the fusion of body medical data based on cloud architecture longitudinal federal learning as set forth in claim 3, wherein the data normalization process means: different participants sample the patient in different time periods, so that the data are distributed differently on the time axis, and the data are limited in a certain range by adopting a data normalization processing mode, so that the deviation of the data is avoided, and the data of different participants have similar distribution on the time axis.
7. The method for protecting the privacy of the fusion of body medical data based on cloud and fog architecture longitudinal federal learning as claimed in claim 1, wherein in the first step, for numerical data, a Laplace mechanism is adopted to add random dynamic disturbance noise to the data result; for a function f, the Laplace mechanism adds one noise N (0, Δf/ε) from the Laplace distribution to each f (x), where ε is the privacy parameter; specifically, for a query q and a database a, the results of the query calculated by the laplace mechanism are:
q(A)=f(A)+Lap(0,Δf/ε)
wherein f (A) is the query result of the original data, Δf represents sensitivity, represents the sensitivity degree of the function f to the change of the data set, lap (0, b) represents Laplacian distribution with the mean value of 0 and the standard deviation of b; the noise added in the laplace mechanism is symmetrical so that the average error for a query result can be kept at a constant level of Δf/epsilon.
8. The method for protecting the privacy of the fusion of the body medical data based on the cloud and fog architecture longitudinal federal learning, which is characterized by comprising the following steps of adopting an exponential mechanism for non-numerical data so as to introduce a scoring mechanism, enumerating and calculating output scores of discrete classifications of the non-numerical data, and characterizing the non-numerical data as probability values after normalization; specifically, for a function f, given a data set D, the exponential mechanism adds a random noise N to the query results such that the probability of issuance of each query result f (D) is proportional to the probability of issuance of f (d+Δd), where Δd is a small variation of D, and N is a random variable subject to an exponential distribution, whose probability density function is:
wherein epsilon is a privacy parameter and represents the degree of privacy protection, deltaf is sensitivity and represents the sensitivity of a function to the change of a data set, and random noise of an exponential mechanism can lead a query result to have certain uncertainty, so that the effect of privacy protection is improved.
9. The method for protecting privacy of body medical data fusion based on cloud architecture longitudinal federal learning as set forth in claim 1, wherein the third specific method is to assume that the local model parameter of the participant i is θ i To protect differential privacy, party i will be at the local model parameter θ i Adding noise following Laplacian distribution, i.e wherein Δθi Obeying the Laplacian distribution Lap (0, b), and Δθ i Meeting the requirement of differential privacy, i.e., (epsilon, delta) -differential privacy, thus the noisy local model parameters sent by party i to the federal server are +.>I.e.
wherein ,Δθi Representing the Laplace distributed noise, the probability density function is:
receiving, on a federal server, noisy local model parameters sent by party iThen, the federal server calculates the global model parameters θ according to a weighted average global I.e.
wherein ,wi Is the weight of the participant i, and the weight is determined according to the data quantity and the data quality factor of the participant.
10. The method for protecting the privacy of the fusion of body medical data based on cloud architecture longitudinal federal learning as set forth in claim 1, wherein the method for protecting the privacy of the fusion of body medical data based on cloud architecture longitudinal federal learning is characterized by comprising the following steps: assuming M participants, each participant has trained a model M locally i Uploading the local models to a federation server through a differential privacy technology, and then combining the models into a global model M by the federation server, and updating parameters of the global model M to realize optimization of the global model;
set up local model M i The gradient in the t-th iteration isThe parameter of the global model M before the t-th round of iteration is theta t-1 The gradient of the global model M in the t-th round iteration is g t In the differential privacy technique, the global model updates parameters with each gradient +.>Random noise is added to protect the privacy of each participant user while still maintaining the accuracy of the model, and specifically, the update formula of the global model can be expressed as:
wherein,N(0,σ 2 i) Representation is 0 as mean, sigma 2 And the variance is distributed noise, and I is an identity matrix, so that in the parameter updating process of the global model, the differential privacy technology can add noise into the gradient of each participant user so as to protect the privacy of each participant user and ensure the accuracy of the global model.
11. The method for protecting the privacy of the fusion of body medical data based on cloud architecture longitudinal federal learning according to claim 1, wherein the training process of the fifth global model can be described as the following multi-round iterative process:
(1) Initially, the exercise guidance center and the hospital randomly initialize local model parameters, respectively and />
(2) In each iteration, the athletic guideline center and the hospital upload local model parameters to the federal server, namely:
wherein t represents the iteration round number, eta represents the learning rate, f (·) represents the loss function,a gradient representing a loss function;
(3) In each iteration, the global model server adds some noise to each participant so that the output model parameters do not reveal individual privacy information, assuming and />Local model parameters representing exercise guidance center and hospital, respectively +.> and />May be calculated using the following formula:
(4) After the global model server receives the local model parameters of the differential privacy version uploaded by the participants, the local model parameters are weighted and averaged to update the global model parameters theta (t) The method comprises the following steps:
wherein n represents the number of participants, w i The weight representing the ith participant may generally be determined based on the amount and quality of the data of the participants.
12. The method for protecting privacy of body medical data fusion based on cloud architecture longitudinal federal learning as set forth in claim 1, wherein the seventh specific method is that m participants are assumed, and the local model parameter of each participant is θ i The weight is w i The calculation formula of the global model parameters is:
wherein the weight w i The weights of all the participants in the training are expressed, the weights can be dynamically adjusted according to the data quantity and the quality of the weights, and the denominator represents the sum of the weights of all the participants, so that the result that the global model parameters are weighted average is ensured.
13. The method for protecting the privacy of the fusion of body medical data based on the longitudinal federal learning of cloud and fog architecture as claimed in claim 1, wherein the method is characterized in that the global model is used for testing and verifying the data to verify the prediction performance of the data, one part is used as a test set each time, the other part is used as a training set, model training is carried out on the training set, model testing is carried out on the test set, and the performance index of the model is calculated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310565639.3A CN116595584A (en) | 2023-05-19 | 2023-05-19 | Physical medicine data fusion privacy protection method based on cloud and fog architecture longitudinal federal learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310565639.3A CN116595584A (en) | 2023-05-19 | 2023-05-19 | Physical medicine data fusion privacy protection method based on cloud and fog architecture longitudinal federal learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116595584A true CN116595584A (en) | 2023-08-15 |
Family
ID=87589402
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310565639.3A Pending CN116595584A (en) | 2023-05-19 | 2023-05-19 | Physical medicine data fusion privacy protection method based on cloud and fog architecture longitudinal federal learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116595584A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117236420A (en) * | 2023-11-14 | 2023-12-15 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Method and system for debugging vertical federation learning abnormal data based on data subset |
CN117393148A (en) * | 2023-10-27 | 2024-01-12 | 中科晶锐(苏州)科技有限公司 | Intelligent medical federal learning method and device capable of protecting privacy of patient |
CN117579215A (en) * | 2024-01-17 | 2024-02-20 | 杭州世平信息科技有限公司 | Longitudinal federal learning differential privacy protection method and system based on tag sharing |
CN117640253A (en) * | 2024-01-25 | 2024-03-01 | 济南大学 | Federal learning privacy protection method and system based on homomorphic encryption |
CN117648543A (en) * | 2024-01-30 | 2024-03-05 | 金数信息科技(苏州)有限公司 | Self-evolving substation equipment learning method |
-
2023
- 2023-05-19 CN CN202310565639.3A patent/CN116595584A/en active Pending
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117393148A (en) * | 2023-10-27 | 2024-01-12 | 中科晶锐(苏州)科技有限公司 | Intelligent medical federal learning method and device capable of protecting privacy of patient |
CN117393148B (en) * | 2023-10-27 | 2024-06-07 | 中科晶锐(苏州)科技有限公司 | Intelligent medical federal learning method and device capable of protecting privacy of patient |
CN117236420A (en) * | 2023-11-14 | 2023-12-15 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Method and system for debugging vertical federation learning abnormal data based on data subset |
CN117236420B (en) * | 2023-11-14 | 2024-03-26 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Method and system for debugging vertical federation learning abnormal data based on data subset |
CN117579215A (en) * | 2024-01-17 | 2024-02-20 | 杭州世平信息科技有限公司 | Longitudinal federal learning differential privacy protection method and system based on tag sharing |
CN117579215B (en) * | 2024-01-17 | 2024-03-29 | 杭州世平信息科技有限公司 | Longitudinal federal learning differential privacy protection method and system based on tag sharing |
CN117640253A (en) * | 2024-01-25 | 2024-03-01 | 济南大学 | Federal learning privacy protection method and system based on homomorphic encryption |
CN117640253B (en) * | 2024-01-25 | 2024-04-05 | 济南大学 | Federal learning privacy protection method and system based on homomorphic encryption |
CN117648543A (en) * | 2024-01-30 | 2024-03-05 | 金数信息科技(苏州)有限公司 | Self-evolving substation equipment learning method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116595584A (en) | Physical medicine data fusion privacy protection method based on cloud and fog architecture longitudinal federal learning | |
Kompa et al. | Second opinion needed: communicating uncertainty in medical machine learning | |
Yuan et al. | A stable AI-based binary and multiple class heart disease prediction model for IoMT | |
Rahman et al. | Machine learning approaches for tackling novel coronavirus (COVID-19) pandemic | |
Khedr et al. | An efficient association rule mining from distributed medical databases for predicting heart diseases | |
CN111477337B (en) | Infectious disease early warning method, system and medium based on individual self-adaptive transmission network | |
Ha et al. | Spatio-temporal split learning for privacy-preserving medical platforms: Case studies with covid-19 ct, x-ray, and cholesterol data | |
Kim et al. | Using deep learning to predict temporomandibular joint disc perforation based on magnetic resonance imaging | |
Hatt et al. | Sequential deconfounding for causal inference with unobserved confounders | |
Mahbub et al. | Covid-19 detection using chest x-ray images with a regnet structured deep learning model | |
Pradhan et al. | Optimizing CNN‐LSTM hybrid classifier using HCA for biomedical image classification | |
Ningrum et al. | A deep learning model to predict knee osteoarthritis based on nonimage longitudinal medical record | |
Chen et al. | A New Optimal Diagnosis System for Coronavirus (COVID‐19) Diagnosis Based on Archimedes Optimization Algorithm on Chest X‐Ray Images | |
Gollapalli et al. | An Artificial Intelligence Approach for Data Modelling Patients Inheritance of Sickle Cell Disease (SCD) in the Eastern Regions of Saudi Arabia. | |
Nugroho et al. | Performance of root-mean-square propagation and adaptive gradient optimization algorithms on covid-19 pneumonia classification | |
Alodat | Using deep learning model for adapting and managing COVID-19 pandemic crisis | |
Roul et al. | COVIHunt: An Intelligent CNN-Based COVID-19 Detection Using CXR Imaging | |
Manocha et al. | Edge intelligence-assisted smart healthcare solution for health pandemic: a federated environment approach | |
Tourassi et al. | Multifractal texture analysis of perfusion lung scans as a potential diagnostic tool for acute pulmonary embolism | |
Xu et al. | CoxNAM: An interpretable deep survival analysis model | |
Banyal et al. | Technology landscape for epidemiological prediction and diagnosis of covid-19 | |
Dutta et al. | Forecasting the Growth in Covid-19 Infection Rates | |
Bala et al. | Applications of Machine Learning and Deep Learning for maintaining Electronic Health Records | |
Li et al. | Multiview deep forest for overall survival prediction in cancer | |
Ravi et al. | Prediction of heart disease using machine learning algorithms. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |