CN115310130B - Multi-site medical data analysis method and system based on federal learning - Google Patents
Multi-site medical data analysis method and system based on federal learning Download PDFInfo
- Publication number
- CN115310130B CN115310130B CN202210972939.9A CN202210972939A CN115310130B CN 115310130 B CN115310130 B CN 115310130B CN 202210972939 A CN202210972939 A CN 202210972939A CN 115310130 B CN115310130 B CN 115310130B
- Authority
- CN
- China
- Prior art keywords
- model
- site
- source
- target
- auxiliary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000007405 data analysis Methods 0.000 title claims abstract description 30
- 238000012549 training Methods 0.000 claims abstract description 23
- 230000004931 aggregating effect Effects 0.000 claims abstract description 17
- 238000005457 optimization Methods 0.000 claims abstract description 11
- 238000004458 analytical method Methods 0.000 claims abstract description 10
- 230000006870 function Effects 0.000 claims description 68
- 239000000523 sample Substances 0.000 claims description 46
- 239000013598 vector Substances 0.000 claims description 26
- 210000004556 brain Anatomy 0.000 claims description 17
- 238000009826 distribution Methods 0.000 claims description 13
- 238000002599 functional magnetic resonance imaging Methods 0.000 claims description 13
- 230000000284 resting effect Effects 0.000 claims description 12
- 201000010099 disease Diseases 0.000 claims description 7
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 7
- 231100000915 pathological change Toxicity 0.000 claims description 6
- 230000036285 pathological change Effects 0.000 claims description 6
- 238000012216 screening Methods 0.000 claims description 6
- 230000002776 aggregation Effects 0.000 claims description 3
- 238000004220 aggregation Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 239000013074 reference sample Substances 0.000 claims description 3
- 238000010801 machine learning Methods 0.000 description 9
- 208000014644 Brain disease Diseases 0.000 description 6
- 238000011161 development Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000003745 diagnosis Methods 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000001575 pathological effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000003759 clinical diagnosis Methods 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000002405 diagnostic procedure Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000003902 lesion Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000002610 neuroimaging Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000013432 robust analysis Methods 0.000 description 1
- 238000012358 sourcing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H30/00—ICT specially adapted for the handling or processing of medical images
- G16H30/20—ICT specially adapted for the handling or processing of medical images for handling medical images, e.g. DICOM, HL7 or PACS
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- Primary Health Care (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
- Bioethics (AREA)
- Pathology (AREA)
- Radiology & Medical Imaging (AREA)
- Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computer Security & Cryptography (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a multi-site medical data analysis method and system based on federal learning, wherein the method comprises the following steps: obtaining a plurality of source site models by local learning according to a plurality of source site data sets; sending the model parameters of each source station to a target station, obtaining pseudo tags and predicted values of a target station data set corresponding to each source station, integrating the predicted values to obtain a predicted value set, and taking the predicted value set as an auxiliary data set; building an initial model, and training the initial model according to an auxiliary data set to obtain a trained auxiliary model; aggregating each source site model and the auxiliary model to obtain a target model; obtaining target model parameters, and carrying out parameter optimization on each source station model according to the target model parameters to obtain an optimized source station model; the optimized source site model is used for carrying out medical analysis on the sample data to be detected. The invention can realize data sharing on the premise of avoiding privacy disclosure and improve the accuracy of medical data analysis.
Description
Technical Field
The invention relates to the technical field of machine learning, in particular to a multi-site medical data analysis method and system based on federal learning.
Background
With the continuous development of machine learning technology, the machine learning technology is also widely applied in the medical field. The application of machine learning in medicine has focused mainly on the analysis and recognition of biomedical data and images. It can be said that machine learning expands the boundaries of medical research to some extent, and greatly helps the development of modern medicine.
An important topic of machine learning in combination with medical research is neuro-imaging analysis techniques, which can help medical researchers better reveal the underlying pathological mechanisms of brain diseases and have proven to be effective in brain disease diagnosis. However, because of the high acquisition and labeling costs of the neuro-image data, it is difficult for a single medical institution to obtain a sufficient amount of data to train to obtain a robust analysis model.
In recent years, in order to overcome the problem of small sample size of a single site, researchers have proposed many machine learning methods to fuse multi-site data. These methods generally assume that the samples at each site have the same distribution and train the predictive model directly with samples from all sites. However, due to scanner and its protocols differences between sites, multi-site fusion diagnostic methods often have data consistency, which can severely degrade the performance of the model. To address the problem of data consistency, researchers have employed multi-source domain adaptation (MSDA) methods to learn domain invariant representations. The MSDA can align the source domain features and the target domain features in a common feature space, thereby improving the generalization of the diagnostic model. MSDA has been used in multi-site brain disease diagnosis.
However, with the continuous development of big data, the events that private user privacy data is compromised are frequent, and many countries are aware of the importance of privacy protection and begin to perfect privacy protection systems through legislative means. However, the traditional multi-site-based data sharing learning method needs to locally access the data of the source domain and the target domain, which definitely increases the risk of data leakage and omits the requirement of privacy protection of patients in clinical diagnosis. Therefore, there is an urgent need to develop a multi-site medical data analysis method that satisfies privacy protection-related regulations.
Disclosure of Invention
The invention aims to provide a multi-site medical data analysis method and system based on federal learning, which can realize data sharing on the premise of avoiding privacy disclosure and improve the accuracy of medical data analysis.
In order to achieve the above object, the present invention provides the following solutions:
the invention provides a multi-site medical data analysis method based on federal learning, which comprises the following steps:
acquiring a plurality of source site data sets, and obtaining a plurality of source site models based on local learning of each source site data set; the source site data set includes a number of sample data; the sample data are resting state functional magnetic resonance imaging images divided into a plurality of brain areas;
model parameters of each source station model are sent to a target station, and pseudo labels of target station data sets corresponding to each source station and predicted values of each source station model are obtained; the pseudo tag of the target site data set is a prediction result obtained by predicting the target site data set according to the source site model; the predicted value of the source station model is a predicted probability obtained by predicting the target station data set according to the source station model; the prediction result represents the influence degree of pathological changes on diseases of each brain region in the resting state functional magnetic resonance imaging image;
integrating the predicted values of the source site models to obtain a predicted value set of the source site models, and taking the predicted value set of the source site models as an auxiliary data set;
setting up an initial model, and training the initial model according to the auxiliary data set to obtain a trained auxiliary model;
aggregating each source site model and the auxiliary model to obtain a target model;
obtaining model parameters of a target model, and carrying out parameter optimization on each source station model according to the model parameters of the target model to obtain an optimized source station model; the optimized source site model is used for carrying out medical analysis on sample data to be detected.
Optionally, the building an initial model, and training the initial model according to the auxiliary data set to obtain a trained auxiliary model, which specifically includes:
training the initial model by adopting a contrast constraint method based on the auxiliary data set until the loss function converges to obtain an intermediate model;
and optimizing the intermediate model by adopting a self-learning strategy to obtain the trained auxiliary model.
Optionally, the optimizing the intermediate model by adopting a self-learning strategy specifically includes:
screening target samples in the target site data set by adopting the reliability score to obtain target samples with the reliability score larger than a set threshold value, and obtaining predicted values of all source site models according to the target samples with the reliability score larger than the set threshold value; the credibility score is determined by the number of credible source sites and the average probability of pseudo tags of the target site data set;
optimizing the objective function by combining the pseudo tag of the objective site and the loss function to obtain a final objective function; the final objective function is used to train the intermediate model.
Optionally, the loss function includes a classification loss function and a contrast loss function; wherein, the expression of the classification loss function is:
L cla (x i )=l c (-log[F N+1 (x i )],y i );
wherein,representing a classification loss function, x i For a target sample in a target site, i is the number of unlabeled samples for the target site,n t Indicating the number of unlabeled exemplars of the target site, L cla Represents the cross entropy loss of each sample, l c (. Cndot.) represents cross entropy loss, F N+1 (x i ) Representing the output of the auxiliary model, y i A pseudo tag representing sample i in the target site;
the expression of the contrast loss function is:
wherein,as a contrast loss function, L con Represents the contrast loss of each sample, +.>A reference sample representing a sample of category j.
Optionally, the aggregating the source site model and the auxiliary model to obtain a target model specifically includes:
based on the quality of the pseudo labels of the target sites corresponding to the source sites, the quality of the source sites is evaluated to obtain the federal weight of the source sites;
and aggregating the federal weight of each source site, each source site model and the auxiliary model by adopting a weighted average strategy to obtain the target model.
Optionally, the obtaining the model parameters of the target model performs parameter optimization on each source site model to obtain an optimized source site model, which specifically includes:
extracting the characteristics of the target site through the target model to obtain the characteristic vector of the target site;
processing the characteristic vector of the target site by adopting a linear rectification function and a hash function in sequence to obtain a target non-zero distribution vector;
extracting the characteristics of each source station by adopting each source station model to obtain the characteristic vector of each source station;
calculating MMD distances between the target non-zero distribution vector and each source station feature vector, and optimizing the MMD distances through an MMD loss function;
based on the optimized MMD distance, each source site optimizes own model parameters.
In order to achieve the above purpose, the present invention also provides the following solutions:
a multi-site medical data analysis system based on federal learning, the system comprising:
the source site model determining unit is used for acquiring a plurality of source site data sets and obtaining a plurality of source site models based on local learning of each source site data set; the source site data set includes a number of sample data; the sample data are resting state functional magnetic resonance imaging images divided into a plurality of brain areas;
the system comprises a target site pseudo tag and a predicted value acquisition unit of a source site model, a prediction value acquisition unit and a prediction value generation unit, wherein the target site pseudo tag and the predicted value acquisition unit is used for transmitting model parameters of each source site model to a target site and acquiring pseudo tags of target site data sets corresponding to each source site and predicted values of each source site model; the pseudo tag of the target site data set is a prediction result obtained by predicting the target site data set according to the source site model; the predicted value of the source station model is a predicted probability obtained by predicting the target station data set according to the source station model; the prediction result represents the influence degree of pathological changes on diseases of each brain region in the resting state functional magnetic resonance imaging image;
the auxiliary data set determining unit is used for integrating the predicted values of the source station models to obtain a predicted value set of the source station model, and taking the predicted value set of the source station model as an auxiliary data set;
the auxiliary model building unit is used for building an initial model, training the initial model according to the auxiliary data set and obtaining a trained auxiliary model;
the target model determining unit is used for aggregating each source station model and each auxiliary model to obtain a target model;
the source station model optimizing unit is used for acquiring model parameters of a target model, and carrying out parameter optimization on each source station model according to the model parameters of the target model to obtain an optimized source station model; the optimized source site model is used for carrying out medical analysis on sample data to be detected.
Optionally, the auxiliary model building unit specifically includes:
the intermediate model determining subunit is used for training the initial model by adopting a contrast constraint method based on the auxiliary data set until the loss function converges to obtain an intermediate model;
and the auxiliary model determining subunit is used for optimizing the intermediate model by adopting a self-step learning strategy to obtain the trained auxiliary model.
Optionally, the auxiliary model determining subunit specifically includes:
the prediction value acquisition module of the source site model is used for screening samples of the target site by adopting the credibility score to obtain target samples with the credibility score larger than a set threshold value, and obtaining the prediction value of each source site model according to the target samples with the credibility score larger than the set threshold value;
the final objective function determining module is used for combining the pseudo tag of the objective station and the loss function, optimizing the objective function and obtaining a final objective function; the final objective function is used to train the intermediate model.
Optionally, the object model determining unit specifically includes:
the federation weight determining unit is used for evaluating the quality of each source station based on the quality of the target station pseudo tag corresponding to each source station to obtain federation weight of each source station;
and the model aggregation subunit is used for aggregating the federal weight of each source site, each source site model and the auxiliary model by adopting a weighted average strategy to obtain the target model.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention provides a multi-site medical data analysis method and system based on federal learning, wherein the method comprises the following steps: acquiring a plurality of source site data sets, and obtaining a plurality of source site models through local learning; model parameters of each source station model are sent to a target station, and pseudo labels of target station data sets corresponding to each source station and predicted values of each source station model are obtained; integrating the predicted values of the source site models to obtain a predicted value set of the source site models, and taking the predicted value set as an auxiliary data set; building an initial model, and training the initial model according to an auxiliary data set to obtain a trained auxiliary model; aggregating each source site model and the auxiliary model to obtain a target model; obtaining model parameters of a target model, and carrying out parameter optimization on each source station model according to the model parameters of the target model to obtain an optimized source station model; the optimized source site model is used for carrying out medical analysis on sample data to be detected. The invention can realize data sharing on the premise of avoiding privacy disclosure and improve the accuracy of medical data analysis.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a multi-site medical data analysis method based on federal learning according to the present invention;
fig. 2 is a schematic block diagram of a multi-site medical data analysis system based on federal learning.
Symbol description:
the system comprises a source station model determining unit-1, a target station pseudo tag and source station model predicted value acquiring unit-2, an auxiliary data set determining unit-3, an auxiliary model establishing unit-4, a target model determining unit-5 and a source station model optimizing unit-6.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention aims to provide a multi-site medical data analysis method and system based on federal learning, which can realize data sharing on the premise of avoiding privacy disclosure and improve the accuracy of medical data analysis.
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
As shown in fig. 1, the present invention provides a multi-site medical data analysis method based on federal learning, the method comprising:
s1: acquiring a plurality of source site data sets, and obtaining a plurality of source site models based on local learning of each source site data set; the source site data set includes a number of sample data; the sample data is a resting state functional magnetic resonance imaging image divided into a plurality of brain regions. Specifically, assume that there are N source sites S N =S 1 ,S 2 ,…,S N And a target site T, which will contain n during federal learning k Kth source site S of marked sample k Is marked asWill contain n t Target sites T of unlabeled samples are denoted as/>Wherein each sample refers to a resting-state functional magnetic resonance imaging image (rs-fMRI), each image consisting of time-series electrical signals measured from 90 brain regions.
S2: model parameters of each source station model are sent to a target station, and pseudo labels of target station data sets corresponding to each source station and predicted values of each source station model are obtained; the pseudo tag of the target site data set is a prediction result obtained by predicting the target site data set according to the source site model; the predicted value of the source station model is a predicted probability obtained by predicting the target station data set according to the source station model; and the prediction result represents the influence degree of pathological changes of each brain region in the resting state functional magnetic resonance imaging image on diseases. Specifically, the target site contains a model F of each source site k Model F for each Source site k k A prediction result y can be given to the data set T on the target site k And predicted probabilitiesThe prediction result is a pseudo tag of the sample on the target site for the one source site model. On the target site, for each source site model F k Obtaining pseudo tag y of data set in target site k And F k Predicted value of the upper partWherein x is i For a target sample in the target site dataset, +.>Is its probability value. Meanwhile, the importance of each brain region in the classification task is calculated respectively, namely, the influence degree of which brain regions generate lesions on diseases exceeds a set threshold value, namely, the brain region is predicted to belong to the brain region with large influence degree.
S3: and integrating the predicted values of the source site models to obtain a predicted value set of the source site models, and taking the predicted value set of the source site models as an auxiliary data set. Specifically, the predicted values of the source site models are integrated, i.e.Aggregate as auxiliary data set S N+1 。
Wherein S is N+1 Representing the auxiliary dataset, i being the number of unlabeled samples of the target site, x i For the target sample, y i A pseudo tag representing sample i in the target site,representing the number of trusted source sites, P i Representing the average probability of a pseudo tag of a target sample, n t Representing the number of unlabeled exemplars in the target site.
S4: and building an initial model, and training the initial model according to the auxiliary data set to obtain a trained auxiliary model.
S5: and aggregating each source site model and the auxiliary model to obtain a target model.
S6: obtaining model parameters of a target model, and carrying out parameter optimization on each source station model according to the model parameters of the target model to obtain an optimized source station model; the optimized source site model is used for carrying out medical analysis on sample data to be detected.
Further, in step S4, the building an initial model, and training the initial model according to the auxiliary data set to obtain a trained auxiliary model, which specifically includes:
s41: and training the initial model by adopting a contrast constraint method based on the auxiliary data set until the loss function converges to obtain an intermediate model.
S42: and optimizing the intermediate model by adopting a self-learning strategy to obtain the trained auxiliary model. In order to improve the robustness of the model, a self-learning strategy is adopted on the auxiliary data set to optimize the intermediate model.
Further, in step S42, the optimizing the intermediate model by using a self-learning strategy specifically includes:
s421: screening target samples in the target site data set by adopting the reliability score to obtain target samples with the reliability score larger than a set threshold value, and obtaining predicted values of all source site models according to the target samples with the reliability score larger than the set threshold value; the confidence score is determined by the number of trusted source sites and the average probability of false labels for the target site dataset.
S422: optimizing the objective function by combining the pseudo tag of the objective site and the loss function to obtain a final objective function; the final objective function is used to train the intermediate model.
Specifically, the credibility score (x i ) The score, used to describe the trustworthiness of each target sample, consists of the number of source models (number of source sites) that select the final pseudo tag and the average probability of the pseudo tags for the target site dataset:
wherein,n is the number of the original sites and P is the number of the trusted source sites i Representing the average probability of a false label for the target sample.
Selecting score (x) i ) Greater than the settingAdding a target sample of the threshold delta into the model for training, and finally, optimizing a final target function by combining the pseudo tag and the contrast loss:
wherein,representing the final objective function, L cla (x i ) For each sample, loss of classification, L con (x i ) For each sample contrast loss, δ is a set threshold and β is a balance factor.
Further, in step S422, the loss functions include a cross entropy loss function and a contrast loss function; wherein, the expression of the cross entropy loss function is:
L cla (x i )=l c (-log[F N+1 (x i )],y i );
wherein,representing cross entropy loss function, x i For the target sample, i is the number of unlabeled samples at the target site, n t Indicating the number of unlabeled exemplars of the target site, L cla Representing the classification loss, l c (. Cndot.) represents cross entropy loss, F N+1 (x i ) Representing the output of the auxiliary model, y i A pseudo tag representing sample i in the target site;
the expression of the contrast loss function is:
wherein,as a contrast loss function, L con Represents the contrast loss of each sample, +.>A reference sample representing a sample of category j.
Further, in step S5, the aggregating the source site model and the auxiliary model to obtain a target model specifically includes:
s51: and evaluating the quality of each source station based on the quality of the pseudo label of the target station corresponding to each source station to obtain the federal weight of each source station.
S52: and aggregating the federal weight of each source site, each source site model and the auxiliary model by adopting a weighted average strategy to obtain the target model.
Specifically, the quality of the pseudo tag of the target site corresponding to each source site is calculated by adopting the following formula:
Q(S k )=Q(S)-Q(S|S k );
calculating the federal weight alpha of each source site by adopting the following formula k :
Wherein Q (S) k ) Representing the quality of the pseudo tag of the destination site corresponding to each source site, Q (S) representing the total pseudo tag quality, Q (S|S k ) Representing non-sourcing sites S k Is the total pseudo tag mass, alpha N+1 For assisting the model F N+1 Weights participating in weighted averaging, n k Representing the number of marked samples for the kth source site.
Further, in step S6, the obtaining the model parameters of the target model performs parameter optimization on each source station model to obtain an optimized source station model, which specifically includes:
s61: and extracting the characteristics of the target site through the target model to obtain the characteristic vector of the target site.
S62: and processing the characteristic vector of the target site by adopting a linear rectification function and a hash function in sequence to obtain a target non-zero distribution vector. Wherein the Hash function Hash is defined as follows:
further, the expression of the target non-zero distribution vector is as follows:
wherein,feature vector representing target site, d i V as the i-th dimension element therein t Representing the target non-zero distribution vector.
S63: and extracting the characteristics of each source station by adopting each source station model to obtain the characteristic vector of each source station.
S64: and calculating MMD distances between the target non-zero distribution vector and each source station characteristic vector, and optimizing the MMD distances through an MMD loss function. In particular, MMD method is adopted to relieve number between source site and target siteFrom the isomerism, the MMD method constructs a Regenerated Kernel Hilbert Space (RKHS) with a characteristic kernel kIt can be achieved by minimizing +.>MMD distance above to optimize +.>The MMD distance is expressed as follows:
wherein MMD (S) k ,v t ) Representing MMD distances between the target non-zero distribution vector and the respective source site feature vectors,representing the kth source site model F k Is described.
S65: based on the optimized MMD distance, each source site optimizes the model parameters of the source site, and medical data analysis is carried out according to the optimized source site model. Wherein the target site characteristics are encoded by a non-linear hash map and transferred to each source site. By a target non-zero distribution vector v t MMD loss between the source site and the target site is optimized to learn the target similarity representation. Source site training loss functionThe expression (MMD loss function) is as follows:
wherein,represents the MMD loss function, MMD (S k ,v t ) Representing MMD distances between the target non-zero distribution vector and the respective source site feature vectors.
Further, as shown in fig. 2, the present invention further provides a multi-site medical data analysis system based on federal learning, the system comprising: a source site model determining unit 1, a target site pseudo tag and a predicted value acquiring unit 2 of the source site model, an auxiliary data set determining unit 3, an auxiliary model establishing unit 4, a target model determining unit 5 and a source site model optimizing unit 6.
A source site model determining unit 1, configured to obtain a plurality of source site data sets, and obtain a plurality of source site models by local learning based on each source site data set; the source site data set includes a number of sample data; the sample data is a resting state functional magnetic resonance imaging image divided into a plurality of brain regions.
The predicted value obtaining unit 2 is used for sending the model parameters of each source station model to the target station and obtaining the pseudo labels of the target station data sets corresponding to each source station and the predicted values of each source station model; the pseudo tag of the target site data set is a prediction result obtained by predicting the target site data set according to the source site model; the predicted value of the source station model is a predicted probability obtained by predicting the target station data set according to the source station model; and the prediction result represents the influence degree of pathological changes of each brain region in the resting state functional magnetic resonance imaging image on diseases.
And the auxiliary data set determining unit 3 is used for integrating the predicted values of the source station models to obtain a predicted value set of the source station models, and taking the predicted value set of the source station models as an auxiliary data set.
And the auxiliary model building unit 4 is used for building an initial model, and training the initial model according to the auxiliary data set to obtain a trained auxiliary model.
And the target model determining unit 5 is used for aggregating the source site model and the auxiliary model to obtain a target model.
The source station model optimizing unit 6 is used for acquiring model parameters of a target model, and carrying out parameter optimization on each source station model according to the model parameters of the target model to obtain an optimized source station model; the optimized source site model is used for carrying out medical analysis on sample data to be detected.
Further, the auxiliary model building unit 4 specifically includes:
and the intermediate model determining subunit is used for training the initial model by adopting a comparison constraint method based on the auxiliary data set until the loss function converges to obtain an intermediate model.
And the auxiliary model determining subunit is used for optimizing the intermediate model by adopting a self-step learning strategy to obtain the trained auxiliary model.
Further, the auxiliary model determining subunit specifically includes:
the prediction value acquisition module of the source site model is used for screening samples of the target site by adopting the credibility score to obtain target samples with the credibility score larger than a set threshold value, and obtaining the prediction value of each source site model according to the target samples with the credibility score larger than the set threshold value.
The final objective function determining module is used for combining the pseudo tag of the objective station and the loss function, optimizing the objective function and obtaining a final objective function; the final objective function is used to train the intermediate model.
Further, the object model determining unit 5 specifically includes:
and the federation weight determining unit is used for evaluating the quality of each source station based on the quality of the target station pseudo tag corresponding to each source station to obtain the federation weight of each source station.
And the model aggregation subunit is used for aggregating the federal weight of each source site, each source site model and the auxiliary model by adopting a weighted average strategy to obtain the target model.
The invention can also be applied to the machine learning process of other fields with shared data requirements, helps researchers in the fields to train the model better, and does not need to worry about the risk of privacy disclosure
The invention has the technical effects that:
1) Unlike traditional medical data sharing system, which ignores privacy protection of patient, the federal learning system of the invention does not require data to be directly shared into a centralized data storage platform to construct a machine learning model, but performs training of a target model on each isolated data site, and simultaneously keeps the data localized, thereby achieving the purpose of privacy protection.
2) In a traditional unsupervised multi-site model training process, the targeted site dataset, if not marked, would not be able to join the federal training process, resulting in local training blocking. Furthermore, because of the variation in data distribution, the direct application of the central model to the site may result in unsatisfactory results, and the algorithm employed in the present invention may effectively solve these problems and may significantly improve the performance of the target model.
3) The invention is mainly used for the neural image analysis and the brain disease data analysis of medical institutions and scientific research institutions. Under the condition that the data sets of all institutions are smaller, medical data of a large hospital can be indirectly acquired through the method, and sharing of the data can be achieved on the premise that privacy leakage is avoided, so that researchers of the institutions with smaller data sets can be helped to better identify and analyze brain diseases and reveal potential pathological mechanisms of the brain diseases.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.
Claims (10)
1. A multi-site medical data analysis method based on federal learning, the method comprising:
acquiring a plurality of source site data sets, and obtaining a plurality of source site models based on local learning of each source site data set; the source site data set includes a number of sample data; the sample data are resting state functional magnetic resonance imaging images divided into a plurality of brain areas;
model parameters of each source station model are sent to a target station, and pseudo labels of target station data sets corresponding to each source station and predicted values of each source station model are obtained; the pseudo tag of the target site data set is a prediction result obtained by predicting the target site data set according to the source site model; the predicted value of the source station model is a predicted probability obtained by predicting the target station data set according to the source station model; the prediction result represents the influence degree of pathological changes on diseases of each brain region in the resting state functional magnetic resonance imaging image;
integrating the predicted values of the source site models to obtain a predicted value set of the source site models, and taking the predicted value set of the source site models as an auxiliary data set;
setting up an initial model, and training the initial model according to the auxiliary data set to obtain a trained auxiliary model;
aggregating each source site model and the auxiliary model to obtain a target model;
obtaining model parameters of a target model, and carrying out parameter optimization on each source station model according to the model parameters of the target model to obtain an optimized source station model; the optimized source site model is used for carrying out medical analysis on sample data to be detected.
2. The multi-site medical data analysis method based on federal learning according to claim 1, wherein the constructing an initial model and training the initial model according to the auxiliary data set to obtain a trained auxiliary model specifically comprises:
training the initial model by adopting a contrast constraint method based on the auxiliary data set until the loss function converges to obtain an intermediate model;
and optimizing the intermediate model by adopting a self-learning strategy to obtain the trained auxiliary model.
3. The multi-site medical data analysis method based on federal learning according to claim 2, wherein the optimizing the intermediate model by adopting a self-walking learning strategy specifically comprises:
screening target samples in the target site data set by adopting the reliability score to obtain target samples with the reliability score larger than a set threshold value, and obtaining predicted values of all source site models according to the target samples with the reliability score larger than the set threshold value; the credibility score is determined by the number of credible source sites and the average probability of pseudo tags of the target site data set;
optimizing the objective function by combining the pseudo tag of the objective site and the loss function to obtain a final objective function; the final objective function is used to train the intermediate model.
4. The federally learned multi-site medical data analysis method according to claim 2, wherein the loss functions include a cross entropy loss function and a contrast loss function; wherein, the expression of the cross entropy loss function is:
L cla (x i )=l c (-log[F N+1 (x i )],y i );
wherein,representing cross entropy loss function, x i For the sample of the target site, i is the number of unlabeled samples of the target site, n t Indicating the number of unlabeled exemplars of the target site, L cla Representing the classification loss of each sample, l c (. Cndot.) represents cross entropy loss, F N+1 (x i ) Representing the output of the auxiliary model, y i A pseudo tag representing sample i in the target site;
the expression of the contrast loss function is:
wherein,as a contrast loss function, L con Represents the contrast loss of each sample, +.>Reference sample representing a sample of category j, +.>Representing the number of trusted source sites.
5. The multi-site medical data analysis method according to claim 1, wherein the aggregating each of the source site model and the auxiliary model to obtain a target model specifically comprises:
based on the quality of the pseudo labels of the target sites corresponding to the source sites, the quality of the source sites is evaluated to obtain the federal weight of the source sites;
and aggregating the federal weight of each source site, each source site model and the auxiliary model by adopting a weighted average strategy to obtain the target model.
6. The multi-site medical data analysis method according to claim 1, wherein the obtaining the model parameters of the target model performs parameter optimization on each source site model to obtain an optimized source site model, and specifically comprises:
extracting the characteristics of the target site through the target model to obtain the characteristic vector of the target site;
processing the characteristic vector of the target site by adopting a linear rectification function and a hash function in sequence to obtain a target non-zero distribution vector;
extracting the characteristics of each source station by adopting each source station model to obtain the characteristic vector of each source station;
calculating MMD distances between the target non-zero distribution vector and each source station feature vector, and optimizing the MMD distances through an MMD loss function;
based on the optimized MMD distance, each source site optimizes own model parameters.
7. A multi-site medical data analysis system based on federal learning, the system comprising:
the source site model determining unit is used for acquiring a plurality of source site data sets and obtaining a plurality of source site models based on local learning of each source site data set; the source site data set includes a number of sample data; the sample data are resting state functional magnetic resonance imaging images divided into a plurality of brain areas;
the system comprises a target site pseudo tag and a predicted value acquisition unit of a source site model, a prediction value acquisition unit and a prediction value generation unit, wherein the target site pseudo tag and the predicted value acquisition unit is used for transmitting model parameters of each source site model to a target site and acquiring pseudo tags of target site data sets corresponding to each source site and predicted values of each source site model; the pseudo tag of the target site data set is a prediction result obtained by predicting the target site data set according to the source site model; the predicted value of the source station model is a predicted probability obtained by predicting the target station data set according to the source station model; the prediction result represents the influence degree of pathological changes on diseases of each brain region in the resting state functional magnetic resonance imaging image;
the auxiliary data set determining unit is used for integrating the predicted values of the source station models to obtain a predicted value set of the source station model, and taking the predicted value set of the source station model as an auxiliary data set;
the auxiliary model building unit is used for building an initial model, training the initial model according to the auxiliary data set and obtaining a trained auxiliary model;
the target model determining unit is used for aggregating each source station model and each auxiliary model to obtain a target model;
the source station model optimizing unit is used for acquiring model parameters of a target model, and carrying out parameter optimization on each source station model according to the model parameters of the target model to obtain an optimized source station model; the optimized source site model is used for carrying out medical analysis on sample data to be detected.
8. The multi-site medical data analysis system based on federal learning according to claim 7, wherein the auxiliary model building unit specifically comprises:
the intermediate model determining subunit is used for training the initial model by adopting a contrast constraint method based on the auxiliary data set until the loss function converges to obtain an intermediate model;
and the auxiliary model determining subunit is used for optimizing the intermediate model by adopting a self-step learning strategy to obtain the trained auxiliary model.
9. The multi-site medical data analysis system based on federal learning of claim 8, wherein the auxiliary model determination subunit specifically comprises:
the prediction value acquisition module of the source site model is used for screening samples of the target site by adopting the credibility score to obtain target samples with the credibility score larger than a set threshold value, and obtaining the prediction value of each source site model according to the target samples with the credibility score larger than the set threshold value;
the final objective function determining module is used for combining the pseudo tag of the objective station and the loss function, optimizing the objective function and obtaining a final objective function; the final objective function is used to train the intermediate model.
10. The multi-site medical data analysis system based on federal learning according to claim 7, wherein the object model determining unit specifically comprises:
the federation weight determining unit is used for evaluating the quality of each source station based on the quality of the target station pseudo tag corresponding to each source station to obtain federation weight of each source station;
and the model aggregation subunit is used for aggregating the federal weight of each source site, each source site model and the auxiliary model by adopting a weighted average strategy to obtain the target model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210972939.9A CN115310130B (en) | 2022-08-15 | 2022-08-15 | Multi-site medical data analysis method and system based on federal learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210972939.9A CN115310130B (en) | 2022-08-15 | 2022-08-15 | Multi-site medical data analysis method and system based on federal learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115310130A CN115310130A (en) | 2022-11-08 |
CN115310130B true CN115310130B (en) | 2023-11-17 |
Family
ID=83862353
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210972939.9A Active CN115310130B (en) | 2022-08-15 | 2022-08-15 | Multi-site medical data analysis method and system based on federal learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115310130B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117496191B (en) * | 2024-01-03 | 2024-03-29 | 南京航空航天大学 | Data weighted learning method based on model collaboration |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112686385A (en) * | 2021-01-07 | 2021-04-20 | 中国人民解放军国防科技大学 | Multi-site three-dimensional image oriented federal deep learning method and system |
CN112686388A (en) * | 2020-12-10 | 2021-04-20 | 广州广电运通金融电子股份有限公司 | Data set partitioning method and system under federated learning scene |
CN113052333A (en) * | 2021-04-02 | 2021-06-29 | 中国科学院计算技术研究所 | Method and system for data analysis based on federal learning |
CN113094758A (en) * | 2021-06-08 | 2021-07-09 | 华中科技大学 | Gradient disturbance-based federated learning data privacy protection method and system |
DE112020000281T5 (en) * | 2019-03-22 | 2021-10-14 | International Business Machines Corporation | COMBINING MODELS THAT HAVE RESPECTIVE TARGET CLASSES WITH DISTILLATION |
CN113570069A (en) * | 2021-07-28 | 2021-10-29 | 神谱科技(上海)有限公司 | Model evaluation method for self-adaptive starting model training based on safe federal learning |
EP3940604A1 (en) * | 2020-07-09 | 2022-01-19 | Nokia Technologies Oy | Federated teacher-student machine learning |
CN113962988A (en) * | 2021-12-08 | 2022-01-21 | 东南大学 | Power inspection image anomaly detection method and system based on federal learning |
CN113989595A (en) * | 2021-11-05 | 2022-01-28 | 西安交通大学 | Federal multi-source domain adaptation method and system based on shadow model |
WO2022043741A1 (en) * | 2020-08-25 | 2022-03-03 | 商汤国际私人有限公司 | Network training method and apparatus, person re-identification method and apparatus, storage medium, and computer program |
CN114564743A (en) * | 2022-02-18 | 2022-05-31 | 华中科技大学 | Privacy protection transfer learning method applied to motor imagery brain-computer interface system |
CN114897063A (en) * | 2022-04-29 | 2022-08-12 | 中山大学 | Indoor positioning method based on-line pseudo label semi-supervised learning and personalized federal learning |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210042645A1 (en) * | 2019-08-06 | 2021-02-11 | doc.ai, Inc. | Tensor Exchange for Federated Cloud Learning |
-
2022
- 2022-08-15 CN CN202210972939.9A patent/CN115310130B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE112020000281T5 (en) * | 2019-03-22 | 2021-10-14 | International Business Machines Corporation | COMBINING MODELS THAT HAVE RESPECTIVE TARGET CLASSES WITH DISTILLATION |
EP3940604A1 (en) * | 2020-07-09 | 2022-01-19 | Nokia Technologies Oy | Federated teacher-student machine learning |
WO2022043741A1 (en) * | 2020-08-25 | 2022-03-03 | 商汤国际私人有限公司 | Network training method and apparatus, person re-identification method and apparatus, storage medium, and computer program |
CN112686388A (en) * | 2020-12-10 | 2021-04-20 | 广州广电运通金融电子股份有限公司 | Data set partitioning method and system under federated learning scene |
CN112686385A (en) * | 2021-01-07 | 2021-04-20 | 中国人民解放军国防科技大学 | Multi-site three-dimensional image oriented federal deep learning method and system |
CN113052333A (en) * | 2021-04-02 | 2021-06-29 | 中国科学院计算技术研究所 | Method and system for data analysis based on federal learning |
CN113094758A (en) * | 2021-06-08 | 2021-07-09 | 华中科技大学 | Gradient disturbance-based federated learning data privacy protection method and system |
CN113570069A (en) * | 2021-07-28 | 2021-10-29 | 神谱科技(上海)有限公司 | Model evaluation method for self-adaptive starting model training based on safe federal learning |
CN113989595A (en) * | 2021-11-05 | 2022-01-28 | 西安交通大学 | Federal multi-source domain adaptation method and system based on shadow model |
CN113962988A (en) * | 2021-12-08 | 2022-01-21 | 东南大学 | Power inspection image anomaly detection method and system based on federal learning |
CN114564743A (en) * | 2022-02-18 | 2022-05-31 | 华中科技大学 | Privacy protection transfer learning method applied to motor imagery brain-computer interface system |
CN114897063A (en) * | 2022-04-29 | 2022-08-12 | 中山大学 | Indoor positioning method based on-line pseudo label semi-supervised learning and personalized federal learning |
Non-Patent Citations (1)
Title |
---|
联盟学习在生物医学大数据隐私保护中的原理与应用;窦佐超;陈峰;邓杰仁;陈如梵;郑灏;孙琪;谢康;沈百荣;王爽;;医学信息学杂志(05);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN115310130A (en) | 2022-11-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Cervical image classification based on image segmentation preprocessing and a CapsNet network model | |
WO2022041307A1 (en) | Method and system for constructing semi-supervised image segmentation framework | |
CN114188021B (en) | Intelligent analysis system for children intussusception diagnosis based on multi-mode fusion | |
CN113344044B (en) | Cross-species medical image classification method based on field self-adaption | |
Yue et al. | Retinal vessel segmentation using dense U-net with multiscale inputs | |
CN115310130B (en) | Multi-site medical data analysis method and system based on federal learning | |
CN108550151A (en) | A kind of reversed domain adaptive approach based on dual training | |
Arun Prakash et al. | Pediatric pneumonia diagnosis using stacked ensemble learning on multi-model deep CNN architectures | |
Ma et al. | Attention-guided deep graph neural network for longitudinal Alzheimer’s disease analysis | |
CN118136239A (en) | Chest medical image multi-label intelligent diagnosis algorithm based on multi-mode contrast learning | |
Wang et al. | Automatic measurement of fetal head circumference using a novel GCN-assisted deep convolutional network | |
Rajput et al. | A transfer learning-based brain tumor classification using magnetic resonance images | |
Yang et al. | Retinal vessel segmentation based on an improved deep forest | |
Vafaeezadeh et al. | CarpNet: Transformer for mitral valve disease classification in echocardiographic videos | |
Wang et al. | Prototype early diagnostic model for invasive pulmonary aspergillosis based on deep learning and big data training | |
Wu et al. | Application of artificial intelligence in anatomical structure recognition of standard section of fetal heart | |
CN114093507A (en) | Skin disease intelligent classification method based on contrast learning in edge computing network | |
Guo et al. | LLTO: towards efficient lesion localization based on template occlusion strategy in intelligent diagnosis | |
CN113011514A (en) | Intracranial hemorrhage sub-type classification algorithm applied to CT image based on bilinear pooling | |
Li et al. | IAS‐NET: Joint intraclassly adaptive GAN and segmentation network for unsupervised cross‐domain in neonatal brain MRI segmentation | |
CN116703850A (en) | Medical image segmentation method based on field self-adaption | |
CN116739988A (en) | Deep learning cerebral hemorrhage classification method based on multi-difficulty course learning | |
Cecchetti | Why introduce machine learning to rural health care? | |
Wei et al. | An improved image segmentation algorithm ct superpixel grid using active contour | |
Tan et al. | Malaria Parasite Detection using Residual Attention U-Net |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |