CN115310130B - Multi-site medical data analysis method and system based on federal learning - Google Patents

Multi-site medical data analysis method and system based on federal learning Download PDF

Info

Publication number
CN115310130B
CN115310130B CN202210972939.9A CN202210972939A CN115310130B CN 115310130 B CN115310130 B CN 115310130B CN 202210972939 A CN202210972939 A CN 202210972939A CN 115310130 B CN115310130 B CN 115310130B
Authority
CN
China
Prior art keywords
model
site
source
target
auxiliary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210972939.9A
Other languages
Chinese (zh)
Other versions
CN115310130A (en
Inventor
朱旗
杨启鸣
王明明
邵伟
张道强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202210972939.9A priority Critical patent/CN115310130B/en
Publication of CN115310130A publication Critical patent/CN115310130A/en
Application granted granted Critical
Publication of CN115310130B publication Critical patent/CN115310130B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H30/00ICT specially adapted for the handling or processing of medical images
    • G16H30/20ICT specially adapted for the handling or processing of medical images for handling medical images, e.g. DICOM, HL7 or PACS
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Primary Health Care (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Bioethics (AREA)
  • Pathology (AREA)
  • Radiology & Medical Imaging (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a multi-site medical data analysis method and system based on federal learning, wherein the method comprises the following steps: obtaining a plurality of source site models by local learning according to a plurality of source site data sets; sending the model parameters of each source station to a target station, obtaining pseudo tags and predicted values of a target station data set corresponding to each source station, integrating the predicted values to obtain a predicted value set, and taking the predicted value set as an auxiliary data set; building an initial model, and training the initial model according to an auxiliary data set to obtain a trained auxiliary model; aggregating each source site model and the auxiliary model to obtain a target model; obtaining target model parameters, and carrying out parameter optimization on each source station model according to the target model parameters to obtain an optimized source station model; the optimized source site model is used for carrying out medical analysis on the sample data to be detected. The invention can realize data sharing on the premise of avoiding privacy disclosure and improve the accuracy of medical data analysis.

Description

Multi-site medical data analysis method and system based on federal learning
Technical Field
The invention relates to the technical field of machine learning, in particular to a multi-site medical data analysis method and system based on federal learning.
Background
With the continuous development of machine learning technology, the machine learning technology is also widely applied in the medical field. The application of machine learning in medicine has focused mainly on the analysis and recognition of biomedical data and images. It can be said that machine learning expands the boundaries of medical research to some extent, and greatly helps the development of modern medicine.
An important topic of machine learning in combination with medical research is neuro-imaging analysis techniques, which can help medical researchers better reveal the underlying pathological mechanisms of brain diseases and have proven to be effective in brain disease diagnosis. However, because of the high acquisition and labeling costs of the neuro-image data, it is difficult for a single medical institution to obtain a sufficient amount of data to train to obtain a robust analysis model.
In recent years, in order to overcome the problem of small sample size of a single site, researchers have proposed many machine learning methods to fuse multi-site data. These methods generally assume that the samples at each site have the same distribution and train the predictive model directly with samples from all sites. However, due to scanner and its protocols differences between sites, multi-site fusion diagnostic methods often have data consistency, which can severely degrade the performance of the model. To address the problem of data consistency, researchers have employed multi-source domain adaptation (MSDA) methods to learn domain invariant representations. The MSDA can align the source domain features and the target domain features in a common feature space, thereby improving the generalization of the diagnostic model. MSDA has been used in multi-site brain disease diagnosis.
However, with the continuous development of big data, the events that private user privacy data is compromised are frequent, and many countries are aware of the importance of privacy protection and begin to perfect privacy protection systems through legislative means. However, the traditional multi-site-based data sharing learning method needs to locally access the data of the source domain and the target domain, which definitely increases the risk of data leakage and omits the requirement of privacy protection of patients in clinical diagnosis. Therefore, there is an urgent need to develop a multi-site medical data analysis method that satisfies privacy protection-related regulations.
Disclosure of Invention
The invention aims to provide a multi-site medical data analysis method and system based on federal learning, which can realize data sharing on the premise of avoiding privacy disclosure and improve the accuracy of medical data analysis.
In order to achieve the above object, the present invention provides the following solutions:
the invention provides a multi-site medical data analysis method based on federal learning, which comprises the following steps:
acquiring a plurality of source site data sets, and obtaining a plurality of source site models based on local learning of each source site data set; the source site data set includes a number of sample data; the sample data are resting state functional magnetic resonance imaging images divided into a plurality of brain areas;
model parameters of each source station model are sent to a target station, and pseudo labels of target station data sets corresponding to each source station and predicted values of each source station model are obtained; the pseudo tag of the target site data set is a prediction result obtained by predicting the target site data set according to the source site model; the predicted value of the source station model is a predicted probability obtained by predicting the target station data set according to the source station model; the prediction result represents the influence degree of pathological changes on diseases of each brain region in the resting state functional magnetic resonance imaging image;
integrating the predicted values of the source site models to obtain a predicted value set of the source site models, and taking the predicted value set of the source site models as an auxiliary data set;
setting up an initial model, and training the initial model according to the auxiliary data set to obtain a trained auxiliary model;
aggregating each source site model and the auxiliary model to obtain a target model;
obtaining model parameters of a target model, and carrying out parameter optimization on each source station model according to the model parameters of the target model to obtain an optimized source station model; the optimized source site model is used for carrying out medical analysis on sample data to be detected.
Optionally, the building an initial model, and training the initial model according to the auxiliary data set to obtain a trained auxiliary model, which specifically includes:
training the initial model by adopting a contrast constraint method based on the auxiliary data set until the loss function converges to obtain an intermediate model;
and optimizing the intermediate model by adopting a self-learning strategy to obtain the trained auxiliary model.
Optionally, the optimizing the intermediate model by adopting a self-learning strategy specifically includes:
screening target samples in the target site data set by adopting the reliability score to obtain target samples with the reliability score larger than a set threshold value, and obtaining predicted values of all source site models according to the target samples with the reliability score larger than the set threshold value; the credibility score is determined by the number of credible source sites and the average probability of pseudo tags of the target site data set;
optimizing the objective function by combining the pseudo tag of the objective site and the loss function to obtain a final objective function; the final objective function is used to train the intermediate model.
Optionally, the loss function includes a classification loss function and a contrast loss function; wherein, the expression of the classification loss function is:
L cla (x i )=l c (-log[F N+1 (x i )],y i );
wherein,representing a classification loss function, x i For a target sample in a target site, i is the number of unlabeled samples for the target site,n t Indicating the number of unlabeled exemplars of the target site, L cla Represents the cross entropy loss of each sample, l c (. Cndot.) represents cross entropy loss, F N+1 (x i ) Representing the output of the auxiliary model, y i A pseudo tag representing sample i in the target site;
the expression of the contrast loss function is:
wherein,as a contrast loss function, L con Represents the contrast loss of each sample, +.>A reference sample representing a sample of category j.
Optionally, the aggregating the source site model and the auxiliary model to obtain a target model specifically includes:
based on the quality of the pseudo labels of the target sites corresponding to the source sites, the quality of the source sites is evaluated to obtain the federal weight of the source sites;
and aggregating the federal weight of each source site, each source site model and the auxiliary model by adopting a weighted average strategy to obtain the target model.
Optionally, the obtaining the model parameters of the target model performs parameter optimization on each source site model to obtain an optimized source site model, which specifically includes:
extracting the characteristics of the target site through the target model to obtain the characteristic vector of the target site;
processing the characteristic vector of the target site by adopting a linear rectification function and a hash function in sequence to obtain a target non-zero distribution vector;
extracting the characteristics of each source station by adopting each source station model to obtain the characteristic vector of each source station;
calculating MMD distances between the target non-zero distribution vector and each source station feature vector, and optimizing the MMD distances through an MMD loss function;
based on the optimized MMD distance, each source site optimizes own model parameters.
In order to achieve the above purpose, the present invention also provides the following solutions:
a multi-site medical data analysis system based on federal learning, the system comprising:
the source site model determining unit is used for acquiring a plurality of source site data sets and obtaining a plurality of source site models based on local learning of each source site data set; the source site data set includes a number of sample data; the sample data are resting state functional magnetic resonance imaging images divided into a plurality of brain areas;
the system comprises a target site pseudo tag and a predicted value acquisition unit of a source site model, a prediction value acquisition unit and a prediction value generation unit, wherein the target site pseudo tag and the predicted value acquisition unit is used for transmitting model parameters of each source site model to a target site and acquiring pseudo tags of target site data sets corresponding to each source site and predicted values of each source site model; the pseudo tag of the target site data set is a prediction result obtained by predicting the target site data set according to the source site model; the predicted value of the source station model is a predicted probability obtained by predicting the target station data set according to the source station model; the prediction result represents the influence degree of pathological changes on diseases of each brain region in the resting state functional magnetic resonance imaging image;
the auxiliary data set determining unit is used for integrating the predicted values of the source station models to obtain a predicted value set of the source station model, and taking the predicted value set of the source station model as an auxiliary data set;
the auxiliary model building unit is used for building an initial model, training the initial model according to the auxiliary data set and obtaining a trained auxiliary model;
the target model determining unit is used for aggregating each source station model and each auxiliary model to obtain a target model;
the source station model optimizing unit is used for acquiring model parameters of a target model, and carrying out parameter optimization on each source station model according to the model parameters of the target model to obtain an optimized source station model; the optimized source site model is used for carrying out medical analysis on sample data to be detected.
Optionally, the auxiliary model building unit specifically includes:
the intermediate model determining subunit is used for training the initial model by adopting a contrast constraint method based on the auxiliary data set until the loss function converges to obtain an intermediate model;
and the auxiliary model determining subunit is used for optimizing the intermediate model by adopting a self-step learning strategy to obtain the trained auxiliary model.
Optionally, the auxiliary model determining subunit specifically includes:
the prediction value acquisition module of the source site model is used for screening samples of the target site by adopting the credibility score to obtain target samples with the credibility score larger than a set threshold value, and obtaining the prediction value of each source site model according to the target samples with the credibility score larger than the set threshold value;
the final objective function determining module is used for combining the pseudo tag of the objective station and the loss function, optimizing the objective function and obtaining a final objective function; the final objective function is used to train the intermediate model.
Optionally, the object model determining unit specifically includes:
the federation weight determining unit is used for evaluating the quality of each source station based on the quality of the target station pseudo tag corresponding to each source station to obtain federation weight of each source station;
and the model aggregation subunit is used for aggregating the federal weight of each source site, each source site model and the auxiliary model by adopting a weighted average strategy to obtain the target model.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention provides a multi-site medical data analysis method and system based on federal learning, wherein the method comprises the following steps: acquiring a plurality of source site data sets, and obtaining a plurality of source site models through local learning; model parameters of each source station model are sent to a target station, and pseudo labels of target station data sets corresponding to each source station and predicted values of each source station model are obtained; integrating the predicted values of the source site models to obtain a predicted value set of the source site models, and taking the predicted value set as an auxiliary data set; building an initial model, and training the initial model according to an auxiliary data set to obtain a trained auxiliary model; aggregating each source site model and the auxiliary model to obtain a target model; obtaining model parameters of a target model, and carrying out parameter optimization on each source station model according to the model parameters of the target model to obtain an optimized source station model; the optimized source site model is used for carrying out medical analysis on sample data to be detected. The invention can realize data sharing on the premise of avoiding privacy disclosure and improve the accuracy of medical data analysis.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a multi-site medical data analysis method based on federal learning according to the present invention;
fig. 2 is a schematic block diagram of a multi-site medical data analysis system based on federal learning.
Symbol description:
the system comprises a source station model determining unit-1, a target station pseudo tag and source station model predicted value acquiring unit-2, an auxiliary data set determining unit-3, an auxiliary model establishing unit-4, a target model determining unit-5 and a source station model optimizing unit-6.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention aims to provide a multi-site medical data analysis method and system based on federal learning, which can realize data sharing on the premise of avoiding privacy disclosure and improve the accuracy of medical data analysis.
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
As shown in fig. 1, the present invention provides a multi-site medical data analysis method based on federal learning, the method comprising:
s1: acquiring a plurality of source site data sets, and obtaining a plurality of source site models based on local learning of each source site data set; the source site data set includes a number of sample data; the sample data is a resting state functional magnetic resonance imaging image divided into a plurality of brain regions. Specifically, assume that there are N source sites S N =S 1 ,S 2 ,…,S N And a target site T, which will contain n during federal learning k Kth source site S of marked sample k Is marked asWill contain n t Target sites T of unlabeled samples are denoted as/>Wherein each sample refers to a resting-state functional magnetic resonance imaging image (rs-fMRI), each image consisting of time-series electrical signals measured from 90 brain regions.
S2: model parameters of each source station model are sent to a target station, and pseudo labels of target station data sets corresponding to each source station and predicted values of each source station model are obtained; the pseudo tag of the target site data set is a prediction result obtained by predicting the target site data set according to the source site model; the predicted value of the source station model is a predicted probability obtained by predicting the target station data set according to the source station model; and the prediction result represents the influence degree of pathological changes of each brain region in the resting state functional magnetic resonance imaging image on diseases. Specifically, the target site contains a model F of each source site k Model F for each Source site k k A prediction result y can be given to the data set T on the target site k And predicted probabilitiesThe prediction result is a pseudo tag of the sample on the target site for the one source site model. On the target site, for each source site model F k Obtaining pseudo tag y of data set in target site k And F k Predicted value of the upper partWherein x is i For a target sample in the target site dataset, +.>Is its probability value. Meanwhile, the importance of each brain region in the classification task is calculated respectively, namely, the influence degree of which brain regions generate lesions on diseases exceeds a set threshold value, namely, the brain region is predicted to belong to the brain region with large influence degree.
S3: and integrating the predicted values of the source site models to obtain a predicted value set of the source site models, and taking the predicted value set of the source site models as an auxiliary data set. Specifically, the predicted values of the source site models are integrated, i.e.Aggregate as auxiliary data set S N+1
Wherein S is N+1 Representing the auxiliary dataset, i being the number of unlabeled samples of the target site, x i For the target sample, y i A pseudo tag representing sample i in the target site,representing the number of trusted source sites, P i Representing the average probability of a pseudo tag of a target sample, n t Representing the number of unlabeled exemplars in the target site.
S4: and building an initial model, and training the initial model according to the auxiliary data set to obtain a trained auxiliary model.
S5: and aggregating each source site model and the auxiliary model to obtain a target model.
S6: obtaining model parameters of a target model, and carrying out parameter optimization on each source station model according to the model parameters of the target model to obtain an optimized source station model; the optimized source site model is used for carrying out medical analysis on sample data to be detected.
Further, in step S4, the building an initial model, and training the initial model according to the auxiliary data set to obtain a trained auxiliary model, which specifically includes:
s41: and training the initial model by adopting a contrast constraint method based on the auxiliary data set until the loss function converges to obtain an intermediate model.
S42: and optimizing the intermediate model by adopting a self-learning strategy to obtain the trained auxiliary model. In order to improve the robustness of the model, a self-learning strategy is adopted on the auxiliary data set to optimize the intermediate model.
Further, in step S42, the optimizing the intermediate model by using a self-learning strategy specifically includes:
s421: screening target samples in the target site data set by adopting the reliability score to obtain target samples with the reliability score larger than a set threshold value, and obtaining predicted values of all source site models according to the target samples with the reliability score larger than the set threshold value; the confidence score is determined by the number of trusted source sites and the average probability of false labels for the target site dataset.
S422: optimizing the objective function by combining the pseudo tag of the objective site and the loss function to obtain a final objective function; the final objective function is used to train the intermediate model.
Specifically, the credibility score (x i ) The score, used to describe the trustworthiness of each target sample, consists of the number of source models (number of source sites) that select the final pseudo tag and the average probability of the pseudo tags for the target site dataset:
wherein,n is the number of the original sites and P is the number of the trusted source sites i Representing the average probability of a false label for the target sample.
Selecting score (x) i ) Greater than the settingAdding a target sample of the threshold delta into the model for training, and finally, optimizing a final target function by combining the pseudo tag and the contrast loss:
wherein,representing the final objective function, L cla (x i ) For each sample, loss of classification, L con (x i ) For each sample contrast loss, δ is a set threshold and β is a balance factor.
Further, in step S422, the loss functions include a cross entropy loss function and a contrast loss function; wherein, the expression of the cross entropy loss function is:
L cla (x i )=l c (-log[F N+1 (x i )],y i );
wherein,representing cross entropy loss function, x i For the target sample, i is the number of unlabeled samples at the target site, n t Indicating the number of unlabeled exemplars of the target site, L cla Representing the classification loss, l c (. Cndot.) represents cross entropy loss, F N+1 (x i ) Representing the output of the auxiliary model, y i A pseudo tag representing sample i in the target site;
the expression of the contrast loss function is:
wherein,as a contrast loss function, L con Represents the contrast loss of each sample, +.>A reference sample representing a sample of category j.
Further, in step S5, the aggregating the source site model and the auxiliary model to obtain a target model specifically includes:
s51: and evaluating the quality of each source station based on the quality of the pseudo label of the target station corresponding to each source station to obtain the federal weight of each source station.
S52: and aggregating the federal weight of each source site, each source site model and the auxiliary model by adopting a weighted average strategy to obtain the target model.
Specifically, the quality of the pseudo tag of the target site corresponding to each source site is calculated by adopting the following formula:
Q(S k )=Q(S)-Q(S|S k );
calculating the federal weight alpha of each source site by adopting the following formula k
Wherein Q (S) k ) Representing the quality of the pseudo tag of the destination site corresponding to each source site, Q (S) representing the total pseudo tag quality, Q (S|S k ) Representing non-sourcing sites S k Is the total pseudo tag mass, alpha N+1 For assisting the model F N+1 Weights participating in weighted averaging, n k Representing the number of marked samples for the kth source site.
Further, in step S6, the obtaining the model parameters of the target model performs parameter optimization on each source station model to obtain an optimized source station model, which specifically includes:
s61: and extracting the characteristics of the target site through the target model to obtain the characteristic vector of the target site.
S62: and processing the characteristic vector of the target site by adopting a linear rectification function and a hash function in sequence to obtain a target non-zero distribution vector. Wherein the Hash function Hash is defined as follows:
further, the expression of the target non-zero distribution vector is as follows:
wherein,feature vector representing target site, d i V as the i-th dimension element therein t Representing the target non-zero distribution vector.
S63: and extracting the characteristics of each source station by adopting each source station model to obtain the characteristic vector of each source station.
S64: and calculating MMD distances between the target non-zero distribution vector and each source station characteristic vector, and optimizing the MMD distances through an MMD loss function. In particular, MMD method is adopted to relieve number between source site and target siteFrom the isomerism, the MMD method constructs a Regenerated Kernel Hilbert Space (RKHS) with a characteristic kernel kIt can be achieved by minimizing +.>MMD distance above to optimize +.>The MMD distance is expressed as follows:
wherein MMD (S) k ,v t ) Representing MMD distances between the target non-zero distribution vector and the respective source site feature vectors,representing the kth source site model F k Is described.
S65: based on the optimized MMD distance, each source site optimizes the model parameters of the source site, and medical data analysis is carried out according to the optimized source site model. Wherein the target site characteristics are encoded by a non-linear hash map and transferred to each source site. By a target non-zero distribution vector v t MMD loss between the source site and the target site is optimized to learn the target similarity representation. Source site training loss functionThe expression (MMD loss function) is as follows:
wherein,represents the MMD loss function, MMD (S k ,v t ) Representing MMD distances between the target non-zero distribution vector and the respective source site feature vectors.
Further, as shown in fig. 2, the present invention further provides a multi-site medical data analysis system based on federal learning, the system comprising: a source site model determining unit 1, a target site pseudo tag and a predicted value acquiring unit 2 of the source site model, an auxiliary data set determining unit 3, an auxiliary model establishing unit 4, a target model determining unit 5 and a source site model optimizing unit 6.
A source site model determining unit 1, configured to obtain a plurality of source site data sets, and obtain a plurality of source site models by local learning based on each source site data set; the source site data set includes a number of sample data; the sample data is a resting state functional magnetic resonance imaging image divided into a plurality of brain regions.
The predicted value obtaining unit 2 is used for sending the model parameters of each source station model to the target station and obtaining the pseudo labels of the target station data sets corresponding to each source station and the predicted values of each source station model; the pseudo tag of the target site data set is a prediction result obtained by predicting the target site data set according to the source site model; the predicted value of the source station model is a predicted probability obtained by predicting the target station data set according to the source station model; and the prediction result represents the influence degree of pathological changes of each brain region in the resting state functional magnetic resonance imaging image on diseases.
And the auxiliary data set determining unit 3 is used for integrating the predicted values of the source station models to obtain a predicted value set of the source station models, and taking the predicted value set of the source station models as an auxiliary data set.
And the auxiliary model building unit 4 is used for building an initial model, and training the initial model according to the auxiliary data set to obtain a trained auxiliary model.
And the target model determining unit 5 is used for aggregating the source site model and the auxiliary model to obtain a target model.
The source station model optimizing unit 6 is used for acquiring model parameters of a target model, and carrying out parameter optimization on each source station model according to the model parameters of the target model to obtain an optimized source station model; the optimized source site model is used for carrying out medical analysis on sample data to be detected.
Further, the auxiliary model building unit 4 specifically includes:
and the intermediate model determining subunit is used for training the initial model by adopting a comparison constraint method based on the auxiliary data set until the loss function converges to obtain an intermediate model.
And the auxiliary model determining subunit is used for optimizing the intermediate model by adopting a self-step learning strategy to obtain the trained auxiliary model.
Further, the auxiliary model determining subunit specifically includes:
the prediction value acquisition module of the source site model is used for screening samples of the target site by adopting the credibility score to obtain target samples with the credibility score larger than a set threshold value, and obtaining the prediction value of each source site model according to the target samples with the credibility score larger than the set threshold value.
The final objective function determining module is used for combining the pseudo tag of the objective station and the loss function, optimizing the objective function and obtaining a final objective function; the final objective function is used to train the intermediate model.
Further, the object model determining unit 5 specifically includes:
and the federation weight determining unit is used for evaluating the quality of each source station based on the quality of the target station pseudo tag corresponding to each source station to obtain the federation weight of each source station.
And the model aggregation subunit is used for aggregating the federal weight of each source site, each source site model and the auxiliary model by adopting a weighted average strategy to obtain the target model.
The invention can also be applied to the machine learning process of other fields with shared data requirements, helps researchers in the fields to train the model better, and does not need to worry about the risk of privacy disclosure
The invention has the technical effects that:
1) Unlike traditional medical data sharing system, which ignores privacy protection of patient, the federal learning system of the invention does not require data to be directly shared into a centralized data storage platform to construct a machine learning model, but performs training of a target model on each isolated data site, and simultaneously keeps the data localized, thereby achieving the purpose of privacy protection.
2) In a traditional unsupervised multi-site model training process, the targeted site dataset, if not marked, would not be able to join the federal training process, resulting in local training blocking. Furthermore, because of the variation in data distribution, the direct application of the central model to the site may result in unsatisfactory results, and the algorithm employed in the present invention may effectively solve these problems and may significantly improve the performance of the target model.
3) The invention is mainly used for the neural image analysis and the brain disease data analysis of medical institutions and scientific research institutions. Under the condition that the data sets of all institutions are smaller, medical data of a large hospital can be indirectly acquired through the method, and sharing of the data can be achieved on the premise that privacy leakage is avoided, so that researchers of the institutions with smaller data sets can be helped to better identify and analyze brain diseases and reveal potential pathological mechanisms of the brain diseases.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims (10)

1. A multi-site medical data analysis method based on federal learning, the method comprising:
acquiring a plurality of source site data sets, and obtaining a plurality of source site models based on local learning of each source site data set; the source site data set includes a number of sample data; the sample data are resting state functional magnetic resonance imaging images divided into a plurality of brain areas;
model parameters of each source station model are sent to a target station, and pseudo labels of target station data sets corresponding to each source station and predicted values of each source station model are obtained; the pseudo tag of the target site data set is a prediction result obtained by predicting the target site data set according to the source site model; the predicted value of the source station model is a predicted probability obtained by predicting the target station data set according to the source station model; the prediction result represents the influence degree of pathological changes on diseases of each brain region in the resting state functional magnetic resonance imaging image;
integrating the predicted values of the source site models to obtain a predicted value set of the source site models, and taking the predicted value set of the source site models as an auxiliary data set;
setting up an initial model, and training the initial model according to the auxiliary data set to obtain a trained auxiliary model;
aggregating each source site model and the auxiliary model to obtain a target model;
obtaining model parameters of a target model, and carrying out parameter optimization on each source station model according to the model parameters of the target model to obtain an optimized source station model; the optimized source site model is used for carrying out medical analysis on sample data to be detected.
2. The multi-site medical data analysis method based on federal learning according to claim 1, wherein the constructing an initial model and training the initial model according to the auxiliary data set to obtain a trained auxiliary model specifically comprises:
training the initial model by adopting a contrast constraint method based on the auxiliary data set until the loss function converges to obtain an intermediate model;
and optimizing the intermediate model by adopting a self-learning strategy to obtain the trained auxiliary model.
3. The multi-site medical data analysis method based on federal learning according to claim 2, wherein the optimizing the intermediate model by adopting a self-walking learning strategy specifically comprises:
screening target samples in the target site data set by adopting the reliability score to obtain target samples with the reliability score larger than a set threshold value, and obtaining predicted values of all source site models according to the target samples with the reliability score larger than the set threshold value; the credibility score is determined by the number of credible source sites and the average probability of pseudo tags of the target site data set;
optimizing the objective function by combining the pseudo tag of the objective site and the loss function to obtain a final objective function; the final objective function is used to train the intermediate model.
4. The federally learned multi-site medical data analysis method according to claim 2, wherein the loss functions include a cross entropy loss function and a contrast loss function; wherein, the expression of the cross entropy loss function is:
L cla (x i )=l c (-log[F N+1 (x i )],y i );
wherein,representing cross entropy loss function, x i For the sample of the target site, i is the number of unlabeled samples of the target site, n t Indicating the number of unlabeled exemplars of the target site, L cla Representing the classification loss of each sample, l c (. Cndot.) represents cross entropy loss, F N+1 (x i ) Representing the output of the auxiliary model, y i A pseudo tag representing sample i in the target site;
the expression of the contrast loss function is:
wherein,as a contrast loss function, L con Represents the contrast loss of each sample, +.>Reference sample representing a sample of category j, +.>Representing the number of trusted source sites.
5. The multi-site medical data analysis method according to claim 1, wherein the aggregating each of the source site model and the auxiliary model to obtain a target model specifically comprises:
based on the quality of the pseudo labels of the target sites corresponding to the source sites, the quality of the source sites is evaluated to obtain the federal weight of the source sites;
and aggregating the federal weight of each source site, each source site model and the auxiliary model by adopting a weighted average strategy to obtain the target model.
6. The multi-site medical data analysis method according to claim 1, wherein the obtaining the model parameters of the target model performs parameter optimization on each source site model to obtain an optimized source site model, and specifically comprises:
extracting the characteristics of the target site through the target model to obtain the characteristic vector of the target site;
processing the characteristic vector of the target site by adopting a linear rectification function and a hash function in sequence to obtain a target non-zero distribution vector;
extracting the characteristics of each source station by adopting each source station model to obtain the characteristic vector of each source station;
calculating MMD distances between the target non-zero distribution vector and each source station feature vector, and optimizing the MMD distances through an MMD loss function;
based on the optimized MMD distance, each source site optimizes own model parameters.
7. A multi-site medical data analysis system based on federal learning, the system comprising:
the source site model determining unit is used for acquiring a plurality of source site data sets and obtaining a plurality of source site models based on local learning of each source site data set; the source site data set includes a number of sample data; the sample data are resting state functional magnetic resonance imaging images divided into a plurality of brain areas;
the system comprises a target site pseudo tag and a predicted value acquisition unit of a source site model, a prediction value acquisition unit and a prediction value generation unit, wherein the target site pseudo tag and the predicted value acquisition unit is used for transmitting model parameters of each source site model to a target site and acquiring pseudo tags of target site data sets corresponding to each source site and predicted values of each source site model; the pseudo tag of the target site data set is a prediction result obtained by predicting the target site data set according to the source site model; the predicted value of the source station model is a predicted probability obtained by predicting the target station data set according to the source station model; the prediction result represents the influence degree of pathological changes on diseases of each brain region in the resting state functional magnetic resonance imaging image;
the auxiliary data set determining unit is used for integrating the predicted values of the source station models to obtain a predicted value set of the source station model, and taking the predicted value set of the source station model as an auxiliary data set;
the auxiliary model building unit is used for building an initial model, training the initial model according to the auxiliary data set and obtaining a trained auxiliary model;
the target model determining unit is used for aggregating each source station model and each auxiliary model to obtain a target model;
the source station model optimizing unit is used for acquiring model parameters of a target model, and carrying out parameter optimization on each source station model according to the model parameters of the target model to obtain an optimized source station model; the optimized source site model is used for carrying out medical analysis on sample data to be detected.
8. The multi-site medical data analysis system based on federal learning according to claim 7, wherein the auxiliary model building unit specifically comprises:
the intermediate model determining subunit is used for training the initial model by adopting a contrast constraint method based on the auxiliary data set until the loss function converges to obtain an intermediate model;
and the auxiliary model determining subunit is used for optimizing the intermediate model by adopting a self-step learning strategy to obtain the trained auxiliary model.
9. The multi-site medical data analysis system based on federal learning of claim 8, wherein the auxiliary model determination subunit specifically comprises:
the prediction value acquisition module of the source site model is used for screening samples of the target site by adopting the credibility score to obtain target samples with the credibility score larger than a set threshold value, and obtaining the prediction value of each source site model according to the target samples with the credibility score larger than the set threshold value;
the final objective function determining module is used for combining the pseudo tag of the objective station and the loss function, optimizing the objective function and obtaining a final objective function; the final objective function is used to train the intermediate model.
10. The multi-site medical data analysis system based on federal learning according to claim 7, wherein the object model determining unit specifically comprises:
the federation weight determining unit is used for evaluating the quality of each source station based on the quality of the target station pseudo tag corresponding to each source station to obtain federation weight of each source station;
and the model aggregation subunit is used for aggregating the federal weight of each source site, each source site model and the auxiliary model by adopting a weighted average strategy to obtain the target model.
CN202210972939.9A 2022-08-15 2022-08-15 Multi-site medical data analysis method and system based on federal learning Active CN115310130B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210972939.9A CN115310130B (en) 2022-08-15 2022-08-15 Multi-site medical data analysis method and system based on federal learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210972939.9A CN115310130B (en) 2022-08-15 2022-08-15 Multi-site medical data analysis method and system based on federal learning

Publications (2)

Publication Number Publication Date
CN115310130A CN115310130A (en) 2022-11-08
CN115310130B true CN115310130B (en) 2023-11-17

Family

ID=83862353

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210972939.9A Active CN115310130B (en) 2022-08-15 2022-08-15 Multi-site medical data analysis method and system based on federal learning

Country Status (1)

Country Link
CN (1) CN115310130B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117496191B (en) * 2024-01-03 2024-03-29 南京航空航天大学 Data weighted learning method based on model collaboration

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112686385A (en) * 2021-01-07 2021-04-20 中国人民解放军国防科技大学 Multi-site three-dimensional image oriented federal deep learning method and system
CN112686388A (en) * 2020-12-10 2021-04-20 广州广电运通金融电子股份有限公司 Data set partitioning method and system under federated learning scene
CN113052333A (en) * 2021-04-02 2021-06-29 中国科学院计算技术研究所 Method and system for data analysis based on federal learning
CN113094758A (en) * 2021-06-08 2021-07-09 华中科技大学 Gradient disturbance-based federated learning data privacy protection method and system
DE112020000281T5 (en) * 2019-03-22 2021-10-14 International Business Machines Corporation COMBINING MODELS THAT HAVE RESPECTIVE TARGET CLASSES WITH DISTILLATION
CN113570069A (en) * 2021-07-28 2021-10-29 神谱科技(上海)有限公司 Model evaluation method for self-adaptive starting model training based on safe federal learning
EP3940604A1 (en) * 2020-07-09 2022-01-19 Nokia Technologies Oy Federated teacher-student machine learning
CN113962988A (en) * 2021-12-08 2022-01-21 东南大学 Power inspection image anomaly detection method and system based on federal learning
CN113989595A (en) * 2021-11-05 2022-01-28 西安交通大学 Federal multi-source domain adaptation method and system based on shadow model
WO2022043741A1 (en) * 2020-08-25 2022-03-03 商汤国际私人有限公司 Network training method and apparatus, person re-identification method and apparatus, storage medium, and computer program
CN114564743A (en) * 2022-02-18 2022-05-31 华中科技大学 Privacy protection transfer learning method applied to motor imagery brain-computer interface system
CN114897063A (en) * 2022-04-29 2022-08-12 中山大学 Indoor positioning method based on-line pseudo label semi-supervised learning and personalized federal learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210042645A1 (en) * 2019-08-06 2021-02-11 doc.ai, Inc. Tensor Exchange for Federated Cloud Learning

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE112020000281T5 (en) * 2019-03-22 2021-10-14 International Business Machines Corporation COMBINING MODELS THAT HAVE RESPECTIVE TARGET CLASSES WITH DISTILLATION
EP3940604A1 (en) * 2020-07-09 2022-01-19 Nokia Technologies Oy Federated teacher-student machine learning
WO2022043741A1 (en) * 2020-08-25 2022-03-03 商汤国际私人有限公司 Network training method and apparatus, person re-identification method and apparatus, storage medium, and computer program
CN112686388A (en) * 2020-12-10 2021-04-20 广州广电运通金融电子股份有限公司 Data set partitioning method and system under federated learning scene
CN112686385A (en) * 2021-01-07 2021-04-20 中国人民解放军国防科技大学 Multi-site three-dimensional image oriented federal deep learning method and system
CN113052333A (en) * 2021-04-02 2021-06-29 中国科学院计算技术研究所 Method and system for data analysis based on federal learning
CN113094758A (en) * 2021-06-08 2021-07-09 华中科技大学 Gradient disturbance-based federated learning data privacy protection method and system
CN113570069A (en) * 2021-07-28 2021-10-29 神谱科技(上海)有限公司 Model evaluation method for self-adaptive starting model training based on safe federal learning
CN113989595A (en) * 2021-11-05 2022-01-28 西安交通大学 Federal multi-source domain adaptation method and system based on shadow model
CN113962988A (en) * 2021-12-08 2022-01-21 东南大学 Power inspection image anomaly detection method and system based on federal learning
CN114564743A (en) * 2022-02-18 2022-05-31 华中科技大学 Privacy protection transfer learning method applied to motor imagery brain-computer interface system
CN114897063A (en) * 2022-04-29 2022-08-12 中山大学 Indoor positioning method based on-line pseudo label semi-supervised learning and personalized federal learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
联盟学习在生物医学大数据隐私保护中的原理与应用;窦佐超;陈峰;邓杰仁;陈如梵;郑灏;孙琪;谢康;沈百荣;王爽;;医学信息学杂志(05);全文 *

Also Published As

Publication number Publication date
CN115310130A (en) 2022-11-08

Similar Documents

Publication Publication Date Title
Zhang et al. Cervical image classification based on image segmentation preprocessing and a CapsNet network model
WO2022041307A1 (en) Method and system for constructing semi-supervised image segmentation framework
CN114188021B (en) Intelligent analysis system for children intussusception diagnosis based on multi-mode fusion
CN113344044B (en) Cross-species medical image classification method based on field self-adaption
Yue et al. Retinal vessel segmentation using dense U-net with multiscale inputs
CN115310130B (en) Multi-site medical data analysis method and system based on federal learning
CN108550151A (en) A kind of reversed domain adaptive approach based on dual training
Arun Prakash et al. Pediatric pneumonia diagnosis using stacked ensemble learning on multi-model deep CNN architectures
Ma et al. Attention-guided deep graph neural network for longitudinal Alzheimer’s disease analysis
CN118136239A (en) Chest medical image multi-label intelligent diagnosis algorithm based on multi-mode contrast learning
Wang et al. Automatic measurement of fetal head circumference using a novel GCN-assisted deep convolutional network
Rajput et al. A transfer learning-based brain tumor classification using magnetic resonance images
Yang et al. Retinal vessel segmentation based on an improved deep forest
Vafaeezadeh et al. CarpNet: Transformer for mitral valve disease classification in echocardiographic videos
Wang et al. Prototype early diagnostic model for invasive pulmonary aspergillosis based on deep learning and big data training
Wu et al. Application of artificial intelligence in anatomical structure recognition of standard section of fetal heart
CN114093507A (en) Skin disease intelligent classification method based on contrast learning in edge computing network
Guo et al. LLTO: towards efficient lesion localization based on template occlusion strategy in intelligent diagnosis
CN113011514A (en) Intracranial hemorrhage sub-type classification algorithm applied to CT image based on bilinear pooling
Li et al. IAS‐NET: Joint intraclassly adaptive GAN and segmentation network for unsupervised cross‐domain in neonatal brain MRI segmentation
CN116703850A (en) Medical image segmentation method based on field self-adaption
CN116739988A (en) Deep learning cerebral hemorrhage classification method based on multi-difficulty course learning
Cecchetti Why introduce machine learning to rural health care?
Wei et al. An improved image segmentation algorithm ct superpixel grid using active contour
Tan et al. Malaria Parasite Detection using Residual Attention U-Net

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant