WO2021143781A1 - 基于多源迁移学习的多中心协同癌症预后预测系统 - Google Patents

基于多源迁移学习的多中心协同癌症预后预测系统 Download PDF

Info

Publication number
WO2021143781A1
WO2021143781A1 PCT/CN2021/071827 CN2021071827W WO2021143781A1 WO 2021143781 A1 WO2021143781 A1 WO 2021143781A1 CN 2021071827 W CN2021071827 W CN 2021071827W WO 2021143781 A1 WO2021143781 A1 WO 2021143781A1
Authority
WO
WIPO (PCT)
Prior art keywords
center
model
source
target
data
Prior art date
Application number
PCT/CN2021/071827
Other languages
English (en)
French (fr)
Inventor
李劲松
田雨
陈伟国
马静
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Publication of WO2021143781A1 publication Critical patent/WO2021143781A1/zh
Priority to US17/543,738 priority Critical patent/US11456078B2/en

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/7264Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems
    • A61B5/7267Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems involving training the classification device
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/48Other medical applications
    • A61B5/4842Monitoring progression or stage of a disease
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7271Specific aspects of physiological measurement analysis
    • A61B5/7275Determining trends in physiological measurement data; Predicting development of a medical condition based on physiological measurements, e.g. determining a risk factor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H40/00ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
    • G16H40/20ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the management or administration of healthcare resources or facilities, e.g. managing hospital staff or surgery rooms
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Definitions

  • the invention belongs to the medical field and the machine learning field, and in particular relates to a multi-center collaborative cancer prognosis prediction system based on multi-source migration learning.
  • Cancer has a high mortality rate, and as its incidence continues to rise, it has become one of the main causes of human death.
  • High-quality cancer prognosis prediction can provide a basis for doctors' clinical decision-making, and is of great significance to cancer control and treatment.
  • the purpose of the present invention is to provide a multi-center collaborative cancer prognosis prediction system based on multi-source migration learning in view of the deficiencies of the prior art, which mainly solves the following technical problems:
  • the electronic medical record data resources of a single institution are limited. Although the patient size and total amount of medical record data are relatively large, but for the needs of specific disease prognosis research, the number of patients with clear prognostic outcome events (such as death, relapse, etc.) in a single institution is limited, so Restrict the establishment of high-quality prognostic prediction models for specific diseases;
  • a multi-center collaborative cancer prognosis prediction system based on multi-source migration learning the system including: a model parameter setting module, a data screening module and a multi-source migration learning module.
  • the model parameter setting module deployed in the management center, responsible for setting the cancer prognosis prediction model parameters, including cancer category (such as colorectal cancer), source center and target center, sample characteristics (such as age, gender, colorectal cancer grade, tissue Scientific classification, etc.), sample data preprocessing methods, prognostic indicators (such as five-year survival status);
  • cancer category such as colorectal cancer
  • source center and target center sample characteristics (such as age, gender, colorectal cancer grade, tissue Scientific classification, etc.)
  • sample characteristics such as age, gender, colorectal cancer grade, tissue Scientific classification, etc.
  • sample data preprocessing methods such as five-year survival status
  • the management center coordinates and manages the resources of each clinical center, and accepts user visits;
  • the source center is a clinical center that has labeled samples for a specific cancer category, and is responsible for training the source cancer prognosis prediction model;
  • the target center is a clinical center that has unlabeled samples for a specific cancer category, and is responsible for training the target cancer prognosis prediction model;
  • the clinical center is an institution that actually holds clinical data, and is responsible for sample data screening and cancer prognosis prediction model training.
  • the data screening module is arranged in the clinical center, and the management center transmits the set model parameters to each clinical center.
  • Each clinical center uses the data screening module to filter data, and queries the local database of the clinical center based on the model parameters for sample characteristics and
  • the sample data is preprocessed according to the set sample data preprocessing method, the source center gets the labeled sample set, and the target center gets the unlabeled sample set.
  • the multi-source migration learning module includes a source model training unit, a migration weight calculation unit, and a target model calculation unit;
  • the source model training unit is arranged in each source center, with K source centers denoted as S 1 , S 2 , S 3 ...S K , and the i-th source center trains the local source cancer prognosis prediction through its source model training unit Model And return the source model after training to the management center;
  • the migration weight calculation unit is arranged in the target center, and receives K source cancer prognostic prediction models sent by the management center. It is assumed that there are n T samples with or without label in the target center, and the i-th unlabeled sample is expressed as Use K source cancer prognostic prediction models to separately analyze samples Perform prognostic prediction and get the predicted label vector
  • ⁇ ′ is the ⁇ transpose
  • e is the unit vector
  • Wij represents the similarity between samples
  • the optimization problem is converted into a standard quadratic programming problem, and the migration weight ⁇ is obtained by solving this quadratic programming problem;
  • the target model calculation unit is arranged in the target center, the sample pseudo-label is obtained according to the migration weight ⁇ , the pseudo-label is used to train the target cancer prognosis prediction model in the target center, and the trained target model is returned to the management center.
  • the system also includes a model application module, which is arranged in the management center, receives the sample characteristics input by the user when the model parameters are set, calls the target model for cancer prognosis prediction, and presents the prediction results to the user. It can be numeric, table, graph, etc.
  • the cancer prognosis prediction model may adopt a logistic regression model, a support vector machine model, a decision tree model, a neural network model, and the like.
  • the similarity between samples Wij may be cosine similarity, Gaussian similarity, and the like.
  • sample data preprocessing method includes missing value processing, dummy variable processing, normalization processing, and the like.
  • sample characteristics include demographic information, physiological parameters, and cancer pathological examination information (such as age, gender, colorectal cancer grade, histological classification, etc.) extracted from the patient's electronic medical record.
  • cancer pathological examination information such as age, gender, colorectal cancer grade, histological classification, etc.
  • the present invention uses multi-source migration learning to solve the problem of the heterogeneity of data between the source center and the target center; uses multi-source migration learning to solve the problem of insufficient label data in the target center, considering the heterogeneity of multi-center data Build a more accurate prediction model under the premise of sex.
  • the original data of each institution is complementary and shared during the model training process to avoid leakage of patient privacy.
  • Figure 1 is a framework diagram of the system distribution of the present invention
  • Figure 2 is a data flow diagram: the rounded rectangle is the management center operation, and the right-angled rectangle is the clinical center operation.
  • the present invention provides a multi-center collaborative cancer prognosis prediction system based on multi-source migration learning.
  • the system includes: a model parameter setting module, a data screening module, and a multi-source migration learning module.
  • the model parameter setting module is arranged in the management center and is responsible for setting the cancer prognosis prediction model parameters.
  • the cancer category is set as colorectal cancer
  • the four source centers are set as S 1 , S 2 , S 3 , and S 4 respectively.
  • Set the target center as T
  • set the sample characteristics as age, gender, colorectal cancer grade, histological classification, number of positive lymph nodes, cancer tissue size, platelet count, and set the sample data preprocessing method to perform the missing values of all sample features
  • the management center coordinates and manages the resources of each clinical center, and accepts user visits;
  • the source center is a clinical center that has labeled samples for a specific cancer category, and is responsible for training the source cancer prognosis prediction model;
  • the target center is a clinical center that has unlabeled samples for a specific cancer category, and is responsible for training the target cancer prognosis prediction model;
  • the clinical center is an institution that actually holds clinical data, and is responsible for sample data screening and cancer prognosis prediction model training;
  • is the model coefficient
  • X is the sample feature vector
  • the data screening module is arranged in the clinical center, and the management center transmits the set model parameters to each clinical center.
  • Each clinical center uses the data screening module to filter data, and queries the local database of the clinical center based on the model parameters for sample characteristics and Prognostic indicator data, the sample data is preprocessed according to the set sample data preprocessing method, the source center gets the labeled sample set, and the target center gets the unlabeled sample set;
  • the multi-source migration learning module includes a source model training unit, a migration weight calculation unit, and a target model calculation unit;
  • the source model training unit is arranged in each source center, and the 4 source centers are denoted as S 1 , S 2 , S 3 , S 4 , and the i-th source center trains a local source cancer prognosis prediction model through its source model training unit And return the source model after training to the management center;
  • the migration weight calculation unit is arranged in the target center, and receives the 4 source cancer prognosis prediction models sent by the management center. It is assumed that there are 936 unlabeled samples in the target center, and the i-th unlabeled sample is expressed as Use 4 source cancer prognostic prediction models to analyze samples separately Perform prognostic prediction and get the predicted label vector
  • Pair prediction label vector The weighted sum of the 4 predicted labels in the sample is obtained Pseudo label
  • ⁇ ′ is the ⁇ transpose
  • e is the unit vector
  • W ij represents the similarity between samples, which is calculated by cosine similarity
  • the optimization problem is converted into a standard quadratic programming problem, and the migration weight ⁇ is obtained by solving this quadratic programming problem.
  • the target model calculation unit is arranged in the target center, the sample pseudo-label is obtained according to the migration weight ⁇ , the pseudo-label is used to train the target cancer prognosis prediction model in the target center, and the trained target model is returned to the management center.
  • the model application module is arranged in the management center.
  • the user inputs age, gender, colorectal cancer grade, histological classification, number of positive lymph nodes, cancer tissue size, platelet count data, and calls the target model for cancer. Prognosis prediction, and present the predicted five-year survival status to users.
  • the proposed migration learning in the present invention is mainly used to break through the original machine learning method's assumption that the model training and test data need to have the same feature space and the same distribution.
  • the system of the present invention utilizes multi-source migration learning to deal with the problem of insufficient model generalization ability when there are differences between the multi-source data set of the prediction model training and the target data set of the model application (edge difference, probability distribution difference).

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Pathology (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Veterinary Medicine (AREA)
  • Biophysics (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Animal Behavior & Ethology (AREA)
  • Molecular Biology (AREA)
  • Surgery (AREA)
  • Theoretical Computer Science (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • Physiology (AREA)
  • Evolutionary Computation (AREA)
  • General Business, Economics & Management (AREA)
  • Business, Economics & Management (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

提供了一种基于多源迁移学习的多中心协同癌症预后预测系统。该系统包括模型参数设置模块、数据筛选模块和多源迁移学习模块;模型参数设置模块负责设置癌症预后预测模型参数;数据筛选模块布置于临床中心,管理中心将设置好的模型参数传输到各临床中心,各临床中心依据模型参数从本地数据库中查询样本特征与预后指标数据,对数据进行预处理;多源迁移学习模块包括源模型训练、迁移权重计算和目标模型计算单元。利用多源迁移学习解决源中心与目标中心之间数据存在异质性的问题及目标中心标签数据不足问题,在考虑多中心数据异质性的前提下构建更加精准的预测模型。同时,模型训练过程中各机构的原始数据互补共享,避免患者隐私泄露。

Description

基于多源迁移学习的多中心协同癌症预后预测系统 技术领域
本发明属于医疗领域及机器学习领域,尤其涉及一种基于多源迁移学习的多中心协同癌症预后预测系统。
背景技术
癌症死亡率高,随着其发病率的不断攀升,已经成为人类死亡的主要原因之一。高质量的癌症预后预测可以为医生的临床决策提供依据,对于癌症的控制和治疗意义重大。
传统预后预测基于专家临床经验(如TNM模型),缺乏循证支持。随着医疗信息技术,特别是电子病历、医疗大数据分析挖掘等技术发展,数据驱动预后预测模型越来越受到关注。这些预测模型需要大规模临床数据,但针对单一病种,单家机构往往缺少足够标签数据,不足以支撑模型训练,模型效果差,需要多中心协同构建预后预测模型。
现有技术方案通常将多家机构的数据进行汇总后训练通用模型。因不同机构之间数据存在异质性(主要体现在边缘分布、条件概率分布差异性上),训练得到的通用模型的泛化能力较差,当目标机构数据与训练数据有较高异质性时模型表现往往不佳,只有在目标机构中积累一定数量的有标签样本后,利用本地有标签样本对通用模型进行校准才能获得较好的性能。目前尚缺乏将模型训练与应用环境有机整合的机制。
无论是直接利用本地有标签样本训练模型还是利用本地有标签样本对通用模型进行校正都对本地有标签样本的数量有一定要求。在缺少本地标签的情况下,现有方法难以应用。且大规模数据需要多家机构共同参与,存在患者隐私泄露风险。
发明内容
本发明的目的在于针对现有技术的不足,提供一种基于多源迁移学习的多中心协同癌症预后预测系统,主要解决如下技术问题:
1.单一机构电子病历数据资源有限,虽然患者规模以及病历数据总量较大,但是面向特定疾病预后研究需要,单一机构中有明确预后结局事件(如死亡、复发等)的患者数量有限,从而限制了特定疾病构建高质量预后预测模型的建立;
2.缺乏对于模型泛化能力的研究,现有方法构建的模型(特别是统计模型)在与训练数据集具有相近似特征分布的数据集上可以得到较好的预测性能表现,但是在与训练环境具有不同边缘概率分布、条件概率分布差异的数据集上的表现往往不佳。
本发明的目的是通过以下技术方案来实现的:一种基于多源迁移学习的多中心协同癌症 预后预测系统,该系统包括:模型参数设置模块、数据筛选模块和多源迁移学习模块。
所述模型参数设置模块:布置于管理中心,负责设置癌症预后预测模型参数,包括癌症类别(如结直肠癌)、源中心与目标中心、样本特征(如年龄、性别、结直肠癌分级、组织学分类等)、样本数据预处理方法、预后指标(如五年生存状态);
所述管理中心对各临床中心的资源进行协调管理,接受用户访问;
所述源中心为针对特定癌症类别,拥有有标签样本的临床中心,负责源癌症预后预测模型训练;
所述目标中心为针对特定癌症类别,拥有无标签样本的临床中心,负责目标癌症预后预测模型训练;
所述临床中心为实际持有临床数据的机构,负责样本数据筛选和癌症预后预测模型训练。
所述数据筛选模块:布置于临床中心,管理中心将设置好的模型参数传输到各临床中心,各临床中心利用数据筛选模块筛选数据,依据模型参数从该临床中心的本地数据库中查询样本特征与预后指标数据,依据设定的样本数据预处理方法对样本数据进行预处理,源中心得到有标签样本集,目标中心得到无标签样本集。
所述多源迁移学习模块包括源模型训练单元、迁移权重计算单元和目标模型计算单元;
所述源模型训练单元布置于各个源中心,设有K个源中心记为S 1,S 2,S 3…S K,第i个源中心通过其源模型训练单元训练本地的源癌症预后预测模型
Figure PCTCN2021071827-appb-000001
并将训练完成的源模型回传至管理中心;
所述迁移权重计算单元布置于目标中心,接收管理中心发送的K个源癌症预后预测模型,设目标中心有无标签样本n T个,第i个无标签样本表示为
Figure PCTCN2021071827-appb-000002
利用K个源癌症预后预测模型分别对样本
Figure PCTCN2021071827-appb-000003
进行预后预测,得到预测标签向量
Figure PCTCN2021071827-appb-000004
Figure PCTCN2021071827-appb-000005
对预测标签向量
Figure PCTCN2021071827-appb-000006
中的K个预测标签加权求和得到样本
Figure PCTCN2021071827-appb-000007
的伪标签
Figure PCTCN2021071827-appb-000008
Figure PCTCN2021071827-appb-000009
其中
Figure PCTCN2021071827-appb-000010
表示各源模型的迁移权重,可基于目标中心样本数据上的平滑假设(样本间的相似度越大,伪标签的距离越小),寻找使得目标中心样本集中的两个样本之间的差异最小的权重,表示为如下优化问题:
Figure PCTCN2021071827-appb-000011
其中θ′为θ转置,e为单位向量,W ij表示样本间的相似度;
将以上优化问题转换为:
Figure PCTCN2021071827-appb-000012
其中H S是一个n T×K矩阵,L T表示与目标中心相关的图拉普拉斯算子,可由L T=D-W计算得到,其中W是目标中心样本的相似度矩阵,D是由
Figure PCTCN2021071827-appb-000013
计算得到的对角矩阵;
由此,将优化问题转换为一个标准的二次规划问题,求解此二次规划问题得到迁移权重θ;
所述目标模型计算单元布置于目标中心,根据迁移权重θ得到样本伪标签,利用伪标签在目标中心进行目标癌症预后预测模型训练,并将训练完成的目标模型回传至管理中心。
进一步地,该系统还包括模型应用模块,所述模型应用模块布置于管理中心,接收模型参数设置时用户输入的样本特征,调用目标模型进行癌症预后预测,并将预测结果呈献给用户,呈现方式可为数值,表格,图形等。
进一步地,所述癌症预后预测模型可以采用逻辑回归模型、支持向量机模型、决策树模型、神经网络模型等。
进一步地,所述样本间的相似度W ij可以为余弦相似度、高斯相似度等。
进一步地,所述样本数据预处理方法包括缺失值处理、哑变量处理、归一化处理等。
进一步地,所述样本特征包括从患者电子病历中提取的人口统计学信息、生理参数、癌症病理检查信息(如年龄、性别、结直肠癌分级、组织学分类等)。
本发明的有益效果是:本发明利用多源迁移学习解决源中心与目标中心之间数据存在异质性的问题;利用多源迁移学习解决目标中心标签数据不足问题,在考虑多中心数据异质性的前提下构建更加精准的预测模型。同时,模型训练过程中各机构的原始数据互补共享,避免患者隐私泄露。
附图说明
图1为本发明系统分布框架图;
图2为数据流图:圆角矩形为管理中心操作,直角矩形为临床中心操作。
具体实施方式
下面结合附图和具体实施例对本发明作进一步详细说明。
如图1所示,本发明提供的一种基于多源迁移学习的多中心协同癌症预后预测系统,该系统包括:模型参数设置模块、数据筛选模块和多源迁移学习模块。
所述模型参数设置模块:布置于管理中心,负责设置癌症预后预测模型参数,本实施例中设置癌症类别为结直肠癌,设置4个源中心分别为S 1,S 2,S 3,S 4,设置目标中心为T,设置样本特征为年龄、性别、结直肠癌分级、组织学分类、阳性淋巴结个数、癌组织大小、血小板 计数,设置样本数据预处理方法为对所有样本特征缺失值进行均值填补和对样本特征中的分类特征进行哑变量处理,设置预后指标为五年生存状态;
所述管理中心对各临床中心的资源进行协调管理,接受用户访问;
所述源中心为针对特定癌症类别,拥有有标签样本的临床中心,负责源癌症预后预测模型训练;
所述目标中心为针对特定癌症类别,拥有无标签样本的临床中心,负责目标癌症预后预测模型训练;
所述临床中心为实际持有临床数据的机构,负责样本数据筛选和癌症预后预测模型训练;
本实施例中癌症预后预测模型为逻辑回归模型:
Figure PCTCN2021071827-appb-000014
其中β为模型系数,X为样本特征向量,
Figure PCTCN2021071827-appb-000015
为预测结果。
所述数据筛选模块:布置于临床中心,管理中心将设置好的模型参数传输到各临床中心,各临床中心利用数据筛选模块筛选数据,依据模型参数从该临床中心的本地数据库中查询样本特征与预后指标数据,依据设定的样本数据预处理方法对样本数据进行预处理,源中心得到有标签样本集,目标中心得到无标签样本集;
所述多源迁移学习模块包括源模型训练单元、迁移权重计算单元和目标模型计算单元;
所述源模型训练单元布置于各个源中心,4个源中心记为S 1,S 2,S 3,S 4,第i个源中心通过其源模型训练单元训练本地的源癌症预后预测模型
Figure PCTCN2021071827-appb-000016
并将训练完成的源模型回传至管理中心;
所述迁移权重计算单元布置于目标中心,接收管理中心发送的4个源癌症预后预测模型,设目标中心有无标签样本936个,第i个无标签样本表示为
Figure PCTCN2021071827-appb-000017
利用4个源癌症预后预测模型分别对样本
Figure PCTCN2021071827-appb-000018
进行预后预测,得到预测标签向量
Figure PCTCN2021071827-appb-000019
Figure PCTCN2021071827-appb-000020
对预测标签向量
Figure PCTCN2021071827-appb-000021
中的4个预测标签加权求和得到样本
Figure PCTCN2021071827-appb-000022
的伪标签
Figure PCTCN2021071827-appb-000023
Figure PCTCN2021071827-appb-000024
其中
Figure PCTCN2021071827-appb-000025
表示各源模型的迁移权重,可基于目标中心样本数据上的平滑假设(样本间的相似度越大,伪标签的距离越小),寻找使得目标中心样本集中的两个样本之间的差异最小的权重,表示为如下优化问题:
Figure PCTCN2021071827-appb-000026
其中θ′为θ转置,e为单位向量,W ij表示样本间的相似度,通过余弦相似度计算;
将以上优化问题转换为:
Figure PCTCN2021071827-appb-000027
其中H S是一个936×4的矩阵,L T表示与目标中心相关的图拉普拉斯算子,可由L T=D-W计算得到,其中W是目标中心样本的相似度矩阵,D是由
Figure PCTCN2021071827-appb-000028
计算得到的对角矩阵;
由此,将优化问题转换为一个标准的二次规划问题,求解此二次规划问题得到迁移权重θ。
所述目标模型计算单元布置于目标中心,根据迁移权重θ得到样本伪标签,利用伪标签在目标中心进行目标癌症预后预测模型训练,并将训练完成的目标模型回传至管理中心。
本实施例中模型应用模块布置于管理中心,接收模型参数设置时用户输入年龄、性别、结直肠癌分级、组织学分类、阳性淋巴结个数、癌组织大小、血小板计数数据,调用目标模型进行癌症预后预测,并将预测的五年生存状态呈献给用户。
本发明中迁移学习的提出主要用来突破原有机器学习方法对于模型训练与测试数据需要具有相同特征空间以及相同分布的假设限制。本发明系统利用多源迁移学习应对预测模型训练的多来源数据集与模型应用的目标数据集存在差异性时(边缘差异、概率分布差异)的模型泛化能力不足问题。
以上仅为本发明的实施实例,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内,不经过创造性劳动所作的任何修改、等同替换、改进等,均包含在本发明的保护范围内。

Claims (6)

  1. 一种基于多源迁移学习的多中心协同癌症预后预测系统,其特征在于,该系统包括:模型参数设置模块、数据筛选模块和多源迁移学习模块。
    所述模型参数设置模块:布置于管理中心,负责设置癌症预后预测模型参数,包括癌症类别、源中心与目标中心、样本特征、样本数据预处理方法、预后指标;
    所述管理中心对各临床中心的资源进行协调管理,接受用户访问;
    所述源中心为针对特定癌症类别,拥有有标签样本的临床中心,负责源癌症预后预测模型训练;
    所述目标中心为针对特定癌症类别,拥有无标签样本的临床中心,负责目标癌症预后预测模型训练;
    所述临床中心为实际持有临床数据的机构,负责样本数据筛选和癌症预后预测模型训练。
    所述数据筛选模块:布置于临床中心,管理中心将设置好的模型参数传输到各临床中心,各临床中心利用数据筛选模块筛选数据,依据模型参数从该临床中心的本地数据库中查询样本特征与预后指标数据,依据设定的样本数据预处理方法对样本数据进行预处理,源中心得到有标签样本集,目标中心得到无标签样本集。
    所述多源迁移学习模块包括源模型训练单元、迁移权重计算单元和目标模型计算单元;
    所述源模型训练单元布置于各个源中心,设有K个源中心记为S 1,S 2,S 3…S K,第i个源中心通过其源模型训练单元训练本地的源癌症预后预测模型
    Figure PCTCN2021071827-appb-100001
    并将训练完成的源模型回传至管理中心;
    所述迁移权重计算单元布置于目标中心,接收管理中心发送的K个源癌症预后预测模型,设目标中心有无标签样本n T个,第i个无标签样本表示为
    Figure PCTCN2021071827-appb-100002
    利用K个源癌症预后预测模型分别对样本
    Figure PCTCN2021071827-appb-100003
    进行预后预测,得到预测标签向量
    Figure PCTCN2021071827-appb-100004
    Figure PCTCN2021071827-appb-100005
    对预测标签向量
    Figure PCTCN2021071827-appb-100006
    中的K个预测标签加权求和得到样本
    Figure PCTCN2021071827-appb-100007
    的伪标签
    Figure PCTCN2021071827-appb-100008
    Figure PCTCN2021071827-appb-100009
    其中
    Figure PCTCN2021071827-appb-100010
    表示各源模型的迁移权重,可基于目标中心样本数据上的平滑假设,寻找使得目标中心样本集中的两个样本之间的差异最小的权重,表示为如下优化问题:
    Figure PCTCN2021071827-appb-100011
    其中θ′为θ转置,e为单位向量,W ij表示样本间的相似度;
    将以上优化问题转换为:
    Figure PCTCN2021071827-appb-100012
    其中H S是一个n T×K矩阵,L T表示与目标中心相关的图拉普拉斯算子,可由L T=D-W计算得到,其中W是目标中心样本的相似度矩阵,D是由
    Figure PCTCN2021071827-appb-100013
    计算得到的对角矩阵;
    由此,将优化问题转换为一个标准的二次规划问题,求解此二次规划问题得到迁移权重θ;
    所述目标模型计算单元布置于目标中心,根据迁移权重θ得到样本伪标签,利用伪标签在目标中心进行目标癌症预后预测模型训练,并将训练完成的目标模型回传至管理中心。
  2. 根据权利要求1所述的一种基于多源迁移学习的多中心协同癌症预后预测系统,其特征在于,该系统还包括模型应用模块,所述模型应用模块布置于管理中心,接收模型参数设置时用户输入的样本特征,调用目标模型进行癌症预后预测,并将预测结果呈献给用户。
  3. 根据权利要求1所述的一种基于多源迁移学习的多中心协同癌症预后预测系统,其特征在于,所述癌症预后预测模型可以采用逻辑回归模型、支持向量机模型、决策树模型、神经网络模型等。
  4. 根据权利要求1所述的一种基于多源迁移学习的多中心协同癌症预后预测系统,其特征在于,所述样本间的相似度W ij可以为余弦相似度、高斯相似度等。
  5. 根据权利要求1所述的一种基于多源迁移学习的多中心协同癌症预后预测系统,其特征在于,所述样本数据预处理方法包括缺失值处理、哑变量处理、归一化处理等。
  6. 根据权利要求1所述的一种基于多源迁移学习的多中心协同癌症预后预测系统,其特征在于,所述样本特征包括从患者电子病历中提取的人口统计学信息、生理参数、癌症病理检查信息等。
PCT/CN2021/071827 2020-01-14 2021-01-14 基于多源迁移学习的多中心协同癌症预后预测系统 WO2021143781A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/543,738 US11456078B2 (en) 2020-01-14 2021-12-07 Multi-center synergetic cancer prognosis prediction system based on multi-source migration learning

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010038230.2A CN111261299B (zh) 2020-01-14 2020-01-14 基于多源迁移学习的多中心协同癌症预后预测系统
CN202010038230.2 2020-01-14

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/543,738 Continuation US11456078B2 (en) 2020-01-14 2021-12-07 Multi-center synergetic cancer prognosis prediction system based on multi-source migration learning

Publications (1)

Publication Number Publication Date
WO2021143781A1 true WO2021143781A1 (zh) 2021-07-22

Family

ID=70948784

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/071827 WO2021143781A1 (zh) 2020-01-14 2021-01-14 基于多源迁移学习的多中心协同癌症预后预测系统

Country Status (3)

Country Link
US (1) US11456078B2 (zh)
CN (1) CN111261299B (zh)
WO (1) WO2021143781A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111261299B (zh) 2020-01-14 2022-02-22 之江实验室 基于多源迁移学习的多中心协同癌症预后预测系统
CN111610768B (zh) * 2020-06-10 2021-03-19 中国矿业大学 基于相似度多源域迁移学习策略的间歇过程质量预测方法
CN112669986B (zh) * 2020-12-30 2023-09-26 华南师范大学 基于相似大数据深度学习的传染病协同预测方法和机器人

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897545A (zh) * 2017-01-05 2017-06-27 浙江大学 一种基于深度置信网络的肿瘤预后预测系统
WO2018143540A1 (ko) * 2017-02-02 2018-08-09 사회복지법인 삼성생명공익재단 인공신경망을 이용한 위암의 예후 예측 방법, 장치 및 프로그램
CN108520780A (zh) * 2018-03-07 2018-09-11 中国科学院计算技术研究所 一种基于迁移学习的医学数据处理和系统
CN108922628A (zh) * 2018-04-23 2018-11-30 华北电力大学 一种基于动态Cox模型的乳腺癌预后生存率预测方法
CN109902421A (zh) * 2019-03-08 2019-06-18 山东大学齐鲁医院 一种宫颈癌预后评估方法、系统、存储介质及计算机设备
CN110391022A (zh) * 2019-07-25 2019-10-29 东北大学 一种基于多阶段迁移的深度学习乳腺癌病理图像细分诊断方法
CN111261299A (zh) * 2020-01-14 2020-06-09 之江实验室 基于多源迁移学习的多中心协同癌症预后预测系统

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7805385B2 (en) * 2006-04-17 2010-09-28 Siemens Medical Solutions Usa, Inc. Prognosis modeling from literature and other sources
CN107273922A (zh) * 2017-06-02 2017-10-20 云南大学 一种面向多源实例迁移学习的样本筛选和权重计算方法
CN108281183A (zh) * 2018-01-30 2018-07-13 重庆大学 基于卷积神经网络和迁移学习的宫颈涂片图像诊断系统
WO2019183596A1 (en) * 2018-03-23 2019-09-26 Memorial Sloan Kettering Cancer Center Systems and methods for multiple instance learning for classification and localization in biomedical imagining
WO2019200404A2 (en) * 2018-04-13 2019-10-17 Grail, Inc. Multi-assay prediction model for cancer detection
CN108846444B (zh) * 2018-06-23 2022-02-01 重庆大学 面向多源数据挖掘的多阶段深度迁移学习方法
WO2020023671A1 (en) * 2018-07-24 2020-01-30 Protocol Intelligence, Inc. Methods and systems for treating cancer and predicting and optimizing treatment outcomes in individual cancer patients
CN109034080B (zh) * 2018-08-01 2021-10-22 桂林电子科技大学 多源域自适应的人脸识别方法
US20200258601A1 (en) * 2018-10-17 2020-08-13 Tempus Labs Targeted-panel tumor mutational burden calculation systems and methods
EP3867410A4 (en) * 2018-10-18 2022-07-13 MedImmune, LLC METHODS OF DETERMINING A TREATMENT FOR CANCER PATIENTS
WO2020132499A2 (en) * 2018-12-21 2020-06-25 Grail, Inc. Systems and methods for using fragment lengths as a predictor of cancer
US11705226B2 (en) * 2019-09-19 2023-07-18 Tempus Labs, Inc. Data based cancer research and treatment systems and methods
CN110660478A (zh) * 2019-09-18 2020-01-07 西安交通大学 一种基于迁移学习的癌症图像预测判别方法和系统
US11896349B2 (en) * 2019-12-09 2024-02-13 Case Western Reserve University Tumor characterization and outcome prediction through quantitative measurements of tumor-associated vasculature

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897545A (zh) * 2017-01-05 2017-06-27 浙江大学 一种基于深度置信网络的肿瘤预后预测系统
WO2018143540A1 (ko) * 2017-02-02 2018-08-09 사회복지법인 삼성생명공익재단 인공신경망을 이용한 위암의 예후 예측 방법, 장치 및 프로그램
CN108520780A (zh) * 2018-03-07 2018-09-11 中国科学院计算技术研究所 一种基于迁移学习的医学数据处理和系统
CN108922628A (zh) * 2018-04-23 2018-11-30 华北电力大学 一种基于动态Cox模型的乳腺癌预后生存率预测方法
CN109902421A (zh) * 2019-03-08 2019-06-18 山东大学齐鲁医院 一种宫颈癌预后评估方法、系统、存储介质及计算机设备
CN110391022A (zh) * 2019-07-25 2019-10-29 东北大学 一种基于多阶段迁移的深度学习乳腺癌病理图像细分诊断方法
CN111261299A (zh) * 2020-01-14 2020-06-09 之江实验室 基于多源迁移学习的多中心协同癌症预后预测系统

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HU MAN-MAN;CHEN XU;SUN YU-ZHONG;SHEN XI;WANG XIAO-QING;YU TIAN-YANG;MEI YU-DONG;XIAO LI;CHENG WEI;YANG JIE;YANG YAN: "A Disease Prediction Model Based on Dynamic Sampling and Transfer Learning", CHINESE JOURNAL OF COMPUTERS, vol. 42, no. 10, 31 December 2019 (2019-12-31), pages 2339 - 2354, XP055829627, DOI: 10.11897/sp.j.2016.2019.02339 *
RITA CHATTOPADHYAY ; QIAN SUN ; WEI FAN ; IAN DAVIDSON ; SETHURAMAN PANCHANATHAN ; JIEPING YE: "Multi-Source Domain Adaptation and Its Application to Early Detection of Fatigue", ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA (TKDD), vol. 6, no. 4, 18 December 2012 (2012-12-18), pages 1 - 26, XP058009961, ISSN: 1556-4681, DOI: 10.1145/2382577.2382582 *

Also Published As

Publication number Publication date
CN111261299B (zh) 2022-02-22
US11456078B2 (en) 2022-09-27
US20220093258A1 (en) 2022-03-24
CN111261299A (zh) 2020-06-09

Similar Documents

Publication Publication Date Title
Zhang et al. Cervical image classification based on image segmentation preprocessing and a CapsNet network model
WO2021143781A1 (zh) 基于多源迁移学习的多中心协同癌症预后预测系统
Jiang et al. A large group linguistic Z-DEMATEL approach for identifying key performance indicators in hospital performance management
Jha et al. Mutual information based hybrid model and deep learning for acute lymphocytic leukemia detection in single cell blood smear images
CN111863237A (zh) 一种基于深度学习的移动端疾病智能辅助诊断系统
CN108198621A (zh) 一种基于神经网络的数据库数据综合诊疗决策方法
CN108320807A (zh) 一种鼻咽癌人工智能辅助诊疗决策云系统
Agbley et al. Multimodal melanoma detection with federated learning
Aslam et al. Neurological Disorder Detection Using OCT Scan Image of Eye
CN108335756A (zh) 鼻咽癌数据库及基于所述数据库的综合诊疗决策方法
CN108206056A (zh) 一种鼻咽癌人工智能辅助诊疗决策终端
Tosun et al. Histomapr™: An explainable ai (xai) platform for computational pathology solutions
Babu et al. An explainable deep learning approach for oral cancer detection
Andrew et al. Machine-learning algorithm to predict multidisciplinary team treatment recommendations in the management of basal cell carcinoma
Waweru et al. Deep learning in skin lesion analysis towards cancer detection
Liu et al. A new classification method for diagnosing COVID-19 pneumonia based on joint CNN features of chest X-ray images and parallel pyramid MLP-mixer module
CN108320797A (zh) 一种鼻咽癌数据库及基于所述数据库的综合诊疗决策方法
Yoldemir Artificial intelligence and women’s health
Liu et al. Beyond COVID-19 diagnosis: prognosis with hierarchical graph representation learning
CN108335748A (zh) 一种鼻咽癌人工智能辅助诊疗决策服务器集群
CN114998203A (zh) 一种基于人工智能的职业性尘肺病精准诊断系统及方法
Nour Artificial intelligence (AI) for improving performance at the cutting edge of medical imaging
Xu et al. Preparing for the AI Era Under the Digital Health Framework
Babu et al. Federated Learning for Digital Pathology: A Pilot Study
Raj Enhancing Thyroid Cancer Diagnostics Through Hybrid Machine Learning and Metabolomics Approaches.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21740906

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21740906

Country of ref document: EP

Kind code of ref document: A1