CN116611115A

CN116611115A - Medical data diagnosis model, method, system and memory based on federal learning

Info

Publication number: CN116611115A
Application number: CN202310889420.9A
Authority: CN
Inventors: 吴艳平; 马韵洁; 王佐成; 王飞
Original assignee: Data Space Research Institute
Current assignee: Data Space Research Institute
Priority date: 2023-07-20
Filing date: 2023-07-20
Publication date: 2023-08-18

Abstract

The invention relates to the technical field of medical data diagnosis and machine learning, in particular to a medical data diagnosis model, a medical data diagnosis method, a medical data diagnosis system and a medical data diagnosis memory based on federal learning. According to the invention, the study of the medical data diagnosis model is realized through federal aggregation, the joint training of the hospital diagnosis models is realized through federal study under the condition of not sharing a hospital database, noise and disturbance are added in the local model parameter aggregation process, so that a private diagnosis model training method is ensured, and the risk of medical data leakage is avoided; meanwhile, the model training precision is improved through federal learning. According to the invention, noise is added for the global parameter and the local model parameter respectively, so that the possibility of acquiring data of each medical unit through model analysis is further reduced, and the privacy budget of model training is reduced through multiple noise addition.

Description

Medical data diagnosis model, method, system and memory based on federal learning

Technical Field

The invention relates to the technical field of medical data diagnosis and machine learning, in particular to a medical data diagnosis model, a medical data diagnosis method, a medical data diagnosis system and a medical data diagnosis memory based on federal learning.

Background

Along with popularization of machine learning, automatic diagnosis, prediction and classification of diagnosis and treatment data provide a new direction for development of the medical field. However, the medical databases of the respective hospitals are independent, and the medical data has high privacy, and a shared medical database cannot be established. At present, many big hospitals perform training of a medical data diagnosis model based on own medical databases, but because single hospital data are limited and different hospitals have different expertise directions, only machine learning is performed based on the databases of the single hospitals, and the obtained medical data diagnosis model effect is not ideal.

Federal learning, which is a distributed machine learning, can significantly protect private data of clients from exposure. Nonetheless, private information can still be revealed by analyzing parameters uploaded by the client, such as weights trained in deep neural networks.

Disclosure of Invention

In order to overcome the defect that the medical data diagnosis model cannot be shared and limited by the medical data diagnosis model machine learning in the prior art, the invention provides a training method of the medical data diagnosis model based on federal learning, and the medical data diagnosis model with high precision can be realized through machine learning under the condition of ensuring privacy safety.

The invention provides a training method of a medical data diagnosis model based on federal learning, which comprises the following steps:

s1, acquiring a participant, wherein the participant has a local medical database; acquiring a part to be trained in a medical data diagnosis model of a participant as a local model; the local model performs diagnosis based on the input medical data to obtain a diagnosis result; the local model structures of all the participants are the same;

s2, the server gives global parameters w (0) to local models of all the participants, the participants perform local training on the medical data diagnosis model, and after the local training is finished, the parameters of the local model of the ith participant are recorded as w (i, 0); w (0) is an initialized global parameter;

s3, at the time of t, each local model carries out federal aggregation on a parameter w (i, t) uploading server to obtain a global parameter w (t, 0), adds noise to the global parameter w (t, 0), and records the global parameter after noise addition as w (t, 1); the initial value of t is 0;

s4, carrying out parameter updating by combining the local models with a local medical database, global parameters w (t, 1) and a set optimization target, wherein the parameters of the i-th participant after the local model updating are w (i, t+1, 1);

s5, judging whether t+1 is greater than or equal to a set value T; if yes, completing the training of each local model, substituting each local model into the corresponding medical data diagnosis model of the participator so as to fix the medical data diagnosis model of each participator; if not, executing step S6;

s6, adding noise and disturbance to parameters w (i, t+1, 1) of each local model, and recording the parameters after adding the noise and the disturbance to the local model of the ith participant as w (i, t+1); updating t to t+1, and returning to S3;

w(i,t+1)=|w(i,t+1,1)|×L(r)+n _D (i,t+1)

let L (r) and r be transition terms, r=w (i, t+1, 1)/|w (i, t+1, 1) |;

the value of L (r) is as follows:

taking a random number x in [0,1 ];

if x<(e ^ε -1)/(e ^ε +1), L (r) is in the interval [ (r/2- (C-1)/2), (r/2+ (C-1)/2) ]]A random number is fetched;

if x is not less than (e) ^ε -1)/(e ^ε +1), L (r) is in the interval [ - (r/2+ (C-1)/2), (C-1)/2) -r/2]A random number is fetched;

c is a transition term, c= (e ^ε +1)/(e ^ε -1); epsilon is the set privacy budget; n is n _D (i, t+1) is noise added at t+1 iterations of the local model of the ith participant; e is a natural number.

Preferably, the global parameter update formula in S3 is:

w(t,0)= ∑ ^N _i=1 w(i,t)

w(t,1)=w(t,0)+n _D (t)

n is the number of participants, i is ordinal number; n is n _D And (t) is a set noise.

Preferably, noise n _D (t) obeys the expectation of 0, variance σ ² Gaussian distribution N (0, sigma) ² )；σ ² =[2×ln(1.25/δ)]/ε ² Delta is the set differential privacy significance level and epsilon is the set privacy budget.

Preferably, in S1, the participants are divided into an aggregate object and a receiver, and the aggregate object is divided into J levels; the calculation formula of w (t, 1) in S3 is as follows:

w(j,t,c)=[∑ _i∈Zj w(i,t)]/n(Zj)

w(t,0)=[∑ ^J _j=1 p(j)×w(j,t,c)]/J

w(t,1)=w(t,0)+n _D (t)

w (j, t, c) represents an aggregation parameter of the jth hierarchy, zj represents a set of participants within the jth hierarchy as aggregation objects, and n (Zj) represents the number of participants in Zj; c represents hierarchical aggregation; j is the number of layers, p (J) is the set weight of the J-th layer, and J is more than or equal to 1 and less than or equal to J; n is n _D And (t) is a set noise.

Preferably, the optimization objective in S4 is to minimize the function F (w (i)) +γ×|w (i) -w (t, 1) |; w (i) represents a parameter of the local model of the i-th participant, and F (w (i)) represents a loss of the local model of the i-th participant at the parameter w (i); gamma denotes the set regularization parameter.

Preferably, n _D (i, t+1) obeys the expectation that 0 variance is σ (i) ² Is (N) (0, sigma (i)) of the gaussian distribution ² )；

σ(i) ² =[2×ln(1.25/δ)]/[ε ² ×N×m(i)]

Where δ is the set differential privacy significance level, N is the number of participants, and m (i) is the sensitivity of the medical data diagnostic model of the ith participant.

Preferably, in S4, at time t, the server sends the global parameter w (t, 1) to the local model of each participant, the local model parameter is updated to the mean value of the current model parameter w (i, t) and the global parameter w (t, 1), and then the medical data diagnosis model performs local training in combination with the optimization target and the local medical database.

The invention also provides a medical data diagnosis method based on federal learning, which can be combined with the medical data diagnosis model to carry out medical data diagnosis and improve the medical service level, and comprises the following steps:

SA1, the medical institution is used as a participant to execute the training method of the medical data diagnosis model based on federal learning so as to complete the training of the medical data diagnosis model of the medical institution;

and SA2, inputting the medical data to be diagnosed into the trained medical data diagnosis model by the medical unit, and obtaining the output of the medical data diagnosis model as a diagnosis result.

The invention also provides a medical data diagnosis system and a memory based on federal learning, which provide a carrier for the diagnosis method and facilitate the application and popularization of the medical data diagnosis model. The system comprises a memory and a processor, wherein the memory stores a computer program, the processor is connected with the memory, and the processor is used for executing the computer program to realize the medical data diagnosis method based on federal learning.

The invention also provides a memory, which stores a computer program, and the computer program is used for realizing the medical data diagnosis method based on federal learning when being executed.

The invention has the advantages that:

(1) Under the condition of not sharing a hospital database, the medical data diagnosis model joint training of each hospital is realized through federal learning, noise and disturbance are added in the local model parameter aggregation process, the privacy diagnosis model training process is ensured, and the medical data leakage risk is avoided; meanwhile, the model training precision is improved through federal learning.

(2) According to the invention, noise is added for the global parameter and the local model parameter respectively, so that the possibility of acquiring data of each medical unit through model analysis is further reduced, and the privacy budget of model training is reduced through multiple noise addition.

(3) According to the method, the difficulty and the calculated amount of parameter aggregation are reduced and the calculation efficiency is improved through the segmentation and the hierarchical aggregation of the participants. Meanwhile, through hierarchical division, the data weights of different medical units can be controlled, so that the generalization performance of the trained global parameters is improved, and the accuracy and generalization of the medical data diagnosis model of each participant are improved.

(4) By the application of the method and the device, the precision loss of the model can be reduced by smaller privacy budget in a multi-user scene, and the usability of the model is improved. According to the training method provided by the invention, on the premise of privacy protection, more accurate global parameters and model diagnosis results can be obtained, and better effects can be obtained in a multi-user scene.

Drawings

FIG. 1 is a training method of a federal learning-based medical data diagnostic model;

FIG. 2 is a statistical diagram of test accuracy according to an embodiment.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Noun definition:

medical database: the medical detection information is used for storing medical detection information marked with diagnosis results; the medical detection information is a detection report, such as a special detection report, a physical examination report and the like; the diagnostic structures are in the category of diseases such as diabetes, diabetic complications, kidney stones, hypertension, etc.

Medical data diagnostic model: and inputting medical detection information and outputting a neural network model which is a diagnosis result.

Training method of medical data diagnosis model based on federal learning

The training method of the medical data diagnosis model based on federal learning, which is provided in the embodiment, is used for realizing the training of the medical data diagnosis model under the condition that medical data of all hospitals are not shared.

Referring to fig. 1, the training method includes the following steps S1 to S6.

S1, acquiring a participant, wherein the participant has a local medical database; acquiring a part to be trained in a medical data diagnosis model of a participant as a local model; the local model performs diagnosis based on the input medical data to obtain a diagnosis result; the local model structure of each participant is the same.

In particular, the local model can be provided in two ways.

A first medical data diagnosis model, which is input as preprocessed medical data and output as a diagnosis result; the first medical data diagnostic model is entirely a local model.

A second medical data diagnostic model comprising a pre-processing module and a diagnostic module; the preprocessing module is used for preprocessing the input medical data, the input of the diagnosis module is the output data of the preprocessing module, and the output of the diagnosis module is the diagnosis result. In the second medical data diagnosis model, the preprocessing module is a pre-training model, and only parameters of the diagnosis module are updated in the training process, namely, only the diagnosis module is used as a local model in the second medical data diagnosis model.

In this embodiment, the data preprocessing includes: data cleaning, data conversion and data standardization; the data cleaning comprises the steps of removing null values, abnormal values, repeated values and the like, so that the accuracy and the integrity of the data are ensured; the data conversion comprises numerical value, coding, feature selection and the like, so that algorithm training and analysis are facilitated; data normalization comprises normalization, dimension reduction and the like, and data preprocessing is beneficial to improving the efficiency and accuracy of model training. The preprocessing steps are all for optimizing the performance and the precision of the model, and the data preprocessing can adopt any existing data preprocessing mode and various combinations of the existing data and processing modes, which are not described herein.

S2, the server gives global parameters w (0) to the local models of all the participants, the medical data diagnosis models of the participants perform local training, and after the local training is finished, the parameters of the local models of the ith participant are marked as w (i, 0); w (0) is an initialized global parameter.

In particular, w (0) is a random initialization parameter or a weighted average of local models in the existing medical data diagnostic model of each participant.

w(t,0)= ∑ ^N _i=1 w(i,t)

w(t,1)=w(t,0)+n _D (t)

Specifically, in the present embodiment, noise n _D (t) obeys the expectation of 0, variance σ ² Gaussian distribution N (0, sigma) ² ) The method comprises the steps of carrying out a first treatment on the surface of the And:

σ ² =[2×ln(1.25/δ)]/ε ²

where δ is the set differential privacy significance level and ε is the set privacy budget.

S4, carrying out parameter updating by combining the local models with a local medical database, global parameters w (t, 1) and an optimization target, wherein the parameters after the local models of the ith participant are updated are w (i, t+1, 1);

the optimization objective is to minimize the function F (w (i)) +γ×|w (i) -w (t, 1) |;

the optimization objective formula is expressed as:

w(i,t+1,1)←argmin _w(i) {F(w(i))+γ×|w(i)-w(t,1)|}

w (i) represents a parameter of a local model of the ith participant, arg represents an updated object taking the parameter w (i), and min represents a minimum value; f (w (i)) represents the loss of the local model of the ith participant at the parameter w (i); gamma denotes the set regularization parameter.

It is worth noting that only the local model in the medical data diagnosis model needs to be subjected to parameter updating, so that the local training of the medical data diagnosis model is the local training of the local model, and the loss of the local model is the loss of the medical data diagnosis model.

s6, adding noise and disturbance to parameters w (i, t+1, 1) of each local model, and recording the parameters after adding the noise and the disturbance to the local model of the ith participant as w (i, t+1); let t update to t+1, and then return to S3.

Specifically, the acquisition of w (i, t+1) in S6 is expressed as follows:

w(i,t+1)=|w(i,t+1,1)|×L(r)+n _D (i,t+1)

let L (r) and r be transition terms, r=w (i, t+1, 1)/|w (i, t+1, 1) |;

the value of L (r) is as follows:

taking a random number x in [0,1 ];

c is a transition term, c= (e ^ε +1)/(e ^ε -1); epsilon is the set privacy budget; e is a natural number.

n _D (i, t+1) is noise added at t+1 iterations of the local model of the ith participant, n _D (i, t+1) obeys the expectation that 0 variance is σ (i) ² Is (N) (0, sigma (i)) of the gaussian distribution ² ) The method comprises the steps of carrying out a first treatment on the surface of the And:

σ(i) ² =[2×ln(1.25/δ)]/[ε ² ×N×m(i)]

Second training method of medical data diagnosis model based on federal learning

The second training method is improved on the basis of the first training method. Specifically, compared with the first training method, in the training method, steps S1-S3 are improved, and the improved steps are realized as follows.

S1, acquiring a participant, selecting an aggregation object from the participant, and dividing the aggregation object into J layers; taking the participants outside the aggregation object as the receiving party; the participants are provided with a local medical database and a medical data diagnosis model, and the medical data diagnosis model is used for diagnosing medical data so as to obtain diagnosis results. Enabling a part to be trained in the local module to serve as a local model; the local model structure of each participant is the same.

S2, the server sends the global parameter w (0) to the local model of each participant, the medical data diagnosis model updates the model parameters to the average value of the current model parameters w (i, 0) and the global parameter w (0), then the medical data diagnosis model combines a local medical database to carry out local training, and after the local training is finished, the parameter of the local model of the ith participant is recorded as w (i, 0); w (0) is an initialized global parameter.

S3, at the time of t, carrying out intra-level aggregation on the local models of the participants in each level to obtain parameters of each level; let the level parameter of the j-th level be w (j, t, c); the server performs federation aggregation on the hierarchical parameters to obtain global parameters w (t, 0), adds noise to the global parameters w (t, 0), and records the global parameters after noise addition as w (t, 1); the initial value of t is 0;

w(j,t,c)=[∑ _i∈Zj w(i,t)]/n(Zj)

zj represents the set of participants within the j-th hierarchy, n (Zj) represents the number of participants in Zj; c represents hierarchical aggregation;

w(t,0)=[∑ ^J _j=1 p(j)×w(j,t,c)]/J

w(t,1)=w(t,0)+n _D (t)

j is the number of layers, p (J) is the set weight of the J-th layer, and J is more than or equal to 1 and less than or equal to J; n is n _D And (t) is a set noise.

The medical data diagnosis model obtained by the training method of the first medical data diagnosis model based on federal learning provided by the invention is verified by combining a specific embodiment.

In the present embodiment, the differential privacy significance level δ=0.1 is set.

In this embodiment, 3 tertiary hospitals and 2 secondary hospitals are selected as participants, and local medical databases of the participants are excel relational databases, that is, medical detection information is presented through an excel table.

In this embodiment, the medical data diagnosis model of each participant is defined as a second medical data diagnosis model.

In this embodiment, the local model of each participant is trained by using the above-mentioned first training method for the medical data diagnosis model based on federal learning, so as to obtain the final medical data diagnosis model of the participant.

And finally, substituting the final medical data diagnosis model of each participant into the S3 by the server, calculating the global parameter w (t, 1) as the final global parameter, constructing a medical test model by combining the final global parameter, and performing medical data diagnosis on any primary hospital through the medical test model. The medical test model is a medical data diagnosis model adopting final global parameters for the local model.

In this embodiment, the medical databases of the participants are divided into a training set and a testing set; in the federal learning process, the local model learns only the training set to update the model parameters.

In this embodiment, two evaluation indexes, namely, training accuracy and testing accuracy, are constructed.

In this example, the diagnostic accuracy of each level of hospitals was counted at privacy budgets of 0.7, 0.6, 0.5 and 0.4, respectively, as shown in tables 1 and 2 below.

Table 1: precision of medical data diagnosis model in hospitals of all levels

In table 1, the third-level hospital training accuracy average value is the diagnosis accuracy average value of the medical data diagnosis model of the third-level hospital on the corresponding training set;

the test accuracy average value of the third-level hospital is the diagnosis accuracy average value of the final medical data diagnosis model of the third-level hospital on the corresponding test set;

the training accuracy average value of the secondary hospital is the diagnosis accuracy average value of the final medical data diagnosis model of the secondary hospital on the corresponding training set;

the test accuracy average value of the secondary hospital is the diagnosis accuracy average value of the final medical data diagnosis model of the secondary hospital on the corresponding test set;

the first-level hospital test accuracy average value is the diagnosis accuracy average value of the medical test model on the medical databases of a plurality of first-level hospitals;

none of the primary hospitals are participants.

As can be seen from the combination of Table 2, by adopting the training method provided by the invention, when the medical data diagnosis model of the participator has lower privacy budget of 0.4, namely higher safety, the model test precision is higher than 82%, and on the primary hospital of the non-participator, the model precision can also reach 78%. The medical data diagnosis model of the participator has the model test precision higher than 88% when the privacy budget is higher, namely the safety is lower, and the model precision can reach 86% in a primary hospital of a non-participator. Therefore, the training method provided by the invention can realize high-precision model training under the condition of not sharing medical data, and the final global parameters obtained by the invention have good generalization.

It will be understood by those skilled in the art that the present invention is not limited to the details of the foregoing exemplary embodiments, but includes other specific forms of the same or similar structures that may be embodied without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

The technology, shape, and construction parts of the present invention, which are not described in detail, are known in the art.

Claims

1. A training method of a medical data diagnosis model based on federal learning is characterized by comprising the following steps:

w(i,t+1)=|w(i,t+1,1)|×L(r)+n _D (i,t+1)

let L (r) and r be transition terms, r=w (i, t+1, 1)/|w (i, t+1, 1) |;

the value of L (r) is as follows:

taking a random number x in [0,1 ];

2. The method of training a federally learned medical data diagnostic model according to claim 1, wherein the global parameter update formula in S3 is:

3. The method for training a federally learned medical data diagnostic model according to claim 2, wherein the noise n _D (t) obeys the expectation of 0, variance σ ² Gaussian distribution N (0, sigma) ² )；σ ² =[2×ln(1.25/δ)]/ε ² Delta is the set differential privacy significance level and epsilon is the set privacy budget.

4. The training method of a federal learning-based medical data diagnostic model according to claim 1, wherein the step S1 is to divide the participants into an aggregate object and a receiver, and the aggregate object is divided into J levels; the calculation formula of w (t, 1) in S3 is as follows:

w(j,t,c)=[∑ _i∈Zjw (i,t)]/n(Zj)

w(t,0)=[∑ ^J _j=1 p(j)×w(j,t,c)]/J

w(t,1)=w(t,0)+n _D (t)

5. The training method of a federally learned medical data diagnostic model according to claim 1, wherein the optimization objective in S4 is to minimize the function F (w (i)) +γ×| w (i) -w (t, 1) |; w (i) represents a parameter of the local model of the i-th participant, and F (w (i)) represents a loss of the local model of the i-th participant at the parameter w (i); gamma denotes the set regularization parameter.

6. The method for training a federally learned medical data diagnostic model according to claim 1, wherein n _D (i, t+1) obeys the expectation that 0 variance is σ (i) ² Is (N) (0, sigma (i)) of the gaussian distribution ² )；

σ(i) ² =[2×ln(1.25/δ)]/[ε ² ×N×m(i)]

Wherein δ is the set differential privacy significance level; n is the number of participants; m (i) is the sensitivity of the medical data diagnostic model of the ith participant.

7. The method for training a federally learned medical data diagnostic model according to claim 5, wherein in S4, the server transmits the global parameter w (t, 1) to the local model of each participant at time t, the local model parameters are updated to the mean of the current model parameters w (i, t) and the global parameters w (t, 1), and the medical data diagnostic model is then trained locally in combination with the optimization objective and the local medical database.

8. A medical data diagnosis method based on federal learning, comprising the steps of:

SA1, the medical unit performing, as a participant, the training method of the federal learning-based medical data diagnostic model according to any one of claims 1 to 7 to complete training of the medical data diagnostic model of the medical unit;

9. A federal study-based medical data diagnostic system comprising a memory having a computer program stored therein and a processor coupled to the memory for executing the computer program to implement the federal study-based medical data diagnostic method of claim 8.

10. A memory, characterized in that a computer program is stored, which computer program, when executed, is adapted to carry out the federal learning-based medical data diagnosis method according to claim 8.