WO2021203796A1 - Disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis - Google Patents

Disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis Download PDF

Info

Publication number
WO2021203796A1
WO2021203796A1 PCT/CN2021/073136 CN2021073136W WO2021203796A1 WO 2021203796 A1 WO2021203796 A1 WO 2021203796A1 CN 2021073136 W CN2021073136 W CN 2021073136W WO 2021203796 A1 WO2021203796 A1 WO 2021203796A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
loss
prediction
survival
model
Prior art date
Application number
PCT/CN2021/073136
Other languages
French (fr)
Chinese (zh)
Inventor
李劲松
池胜强
田雨
周天舒
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Publication of WO2021203796A1 publication Critical patent/WO2021203796A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/20ICT specially adapted for the handling or processing of medical references relating to practices or guidelines

Definitions

  • the invention belongs to the technical field of medical treatment and machine learning, and in particular relates to a disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis.
  • Disease prognosis prediction analysis can provide clinicians with prognostic information for disease treatment, help formulate treatment plans, increase disease cure rate, improve patient prognostic quality of life, and effectively reduce disease burden, which is of great significance for disease control and treatment.
  • Survival analysis is a commonly used data analysis method in the prediction of disease prognosis, which is used to analyze and predict the time of occurrence of an event. In medicine, it plays a key role in determining the course of treatment, developing new drugs, preventing adverse drug reactions and improving hospital procedures.
  • deep neural networks, convolutional neural networks, long- and short-term memory networks and other deep learning network structures have begun to increase in the application of disease prognosis prediction.
  • some advanced machine learning strategies are gradually being applied to survival analysis methods based on deep learning, including active learning, transfer learning, and multi-task learning to improve the performance of disease prognosis prediction.
  • Censored data are common in disease prognosis data. Censored data are not missing data, but incomplete data that can only provide prognostic information from the beginning to the censored time, and cannot provide complete information from the beginning to the occurrence of the event.
  • Existing deep learning-based methods may not make full use of censored data; or in the case of making full use of censored data, they cannot effectively solve the time-dependent phenomenon of features; or the generalization ability of the model is insufficient; or the model is interpretable Poor sex.
  • the existing methods based on multi-task learning cannot make full use of censored data.
  • the purpose of the present invention is to provide a disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis in view of the deficiencies of the prior art.
  • the present invention is based on a deep neural network model, transforms the survival analysis problem into a multi-task learning model composed of a semi-supervised learning problem of multi-time sequence point survival probability prediction; considering the censored data and the non-increasing trend of survival probability in survival analysis, it is proposed to use
  • the semi-supervised loss function and the ranking loss function fit the data, and can deal with traditional survival analysis problems and survival analysis problems considering competitive risks.
  • it provides a method for evaluating the importance of features, and visualizes the time dependence and nonlinear effects of features.
  • the deep neural network structure in the model contains multiple layers of nonlinear transformation units, which can fit the nonlinear effects of features.
  • the model directly models survival probability, does not rely on proportional hazards assumptions, can fit time-dependent effects, and has better explanatory properties.
  • the model makes full use of complete data and censored data through logarithmic loss function and semi-supervised loss function; utilizes the non-increasing trend of survival probability through sorting loss function; realizes automatic feature selection and prevents model overfitting through L1 and L2 loss functions .
  • the model realizes data sharing between multiple prediction tasks through multi-task learning at multiple time sequence points, and realizes mutual constraints between multiple prediction tasks at the same time, and improves the generalization ability of the model.
  • a disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis including: a data acquisition module for acquiring disease prognosis data; A data preprocessing module for missing value processing and normalization processing; a prediction model building module for modeling disease prognosis data; a prediction result display module for displaying data prediction results; in the prediction model building module
  • a survival analysis method based on deep semi-supervised multi-task learning, the specific steps are as follows:
  • N is the number of samples and M is the number of features.
  • the input of the deep neural network is the feature X of the data set, the output label is Y, and each output layer corresponds to each of Y y, that is, each output layer corresponds to event prediction tasks at different times.
  • the deep neural network can make predictions for the same task at K different times.
  • the objective function of the prediction model is composed of five parts: log loss, L1 loss, L2 loss, semi-supervised loss and ranking loss:
  • the model uses the logarithmic loss to punish the wrong classification to measure the accuracy of the classifier.
  • the label be y, y ⁇ 0,1 ⁇ .
  • the parameter ⁇ is estimated by the maximum likelihood estimation method, and the likelihood function is:
  • l is the sample number label
  • p (X i; ⁇ ) is the posterior probability of the sample X i.
  • the event prediction at each time point is regarded as a multi-classification problem.
  • X i ; ⁇ ), where k 1, 2,...,C, and C is the number of all possible outcomes.
  • the parameter ⁇ is estimated by the maximum likelihood estimation method, and the corresponding log loss function is:
  • unlabeled data For unlabeled data, the use of unlabeled data is realized by adding an entropy-constrained regularization term to the objective function.
  • the event state is a random variable that obeys the Bernoulli distribution and the parameter is p. Its entropy is defined as follows:
  • u is the number of unlabeled samples
  • p is the probability of occurrence of the event. If the category of unlabeled data is determined, the entropy constraint regularization term will be small.
  • the non-increasing trend of survival probability is constrained by adding a ranking loss to the objective function.
  • the ranking loss is defined as follows:
  • p i,p (y i 1
  • X i ; ⁇ ) ⁇ pi ,q (y i 1
  • X i ; ⁇ ), Otherwise, a penalty will be imposed on the probability of occurrence of this pair of events; I(pi ,p (y i 1
  • X i ; ⁇ )>pi ,q (y i 1
  • X i ; ⁇ )) is the indicator function, When p i,p (y i 1
  • X i ; ⁇ )>pi ,q (y i 1
  • the semi-supervised multi-task survival analysis model based on deep learning that is, the objective function of the prediction model is:
  • l( ⁇ ) is the logarithmic loss
  • L1( ⁇ ) is the L1 loss
  • L2( ⁇ ) is the L2 loss
  • ⁇ ( ⁇ ) is the semi-supervised loss
  • R( ⁇ ) is the ranking loss
  • ⁇ 1 , ⁇ 2 , ⁇ 3 , ⁇ 4 are the parameters that control the strength of the regular term.
  • the step (2) transforms the original survival analysis problem into a multi-task learning problem through the process of converting the label information into a vector.
  • the hidden layer parameters in the deep neural network adopt a hard sharing mechanism, thereby reducing the risk of overfitting.
  • step (4) for the deep semi-supervised multi-task learning problem of survival analysis problem transition, there are two important features: unlabeled data caused by censoring and a non-increasing trend of survival probability.
  • semi-supervised learning is performed by using entropy-constrained regularization. Aiming at the non-increasing trend of survival probabilities at different time points, sorting loss is introduced to constrain the survival probabilities of different output layers.
  • L1 loss is introduced into the objective function to realize automatic feature selection, and L2 loss is introduced to avoid overfitting.
  • the prediction result display module is used for feature importance evaluation, and visually displays the time dependence and nonlinear effects of features.
  • the specific steps for calculating the importance of a feature F are as follows:
  • the prediction result display module visually displays the influence of the characteristics on the prognosis by drawing the predicted cumulative incidence curves corresponding to different characteristics. To draw the predicted cumulative incidence curve corresponding to a certain feature F, the specific steps are as follows:
  • All possible values of feature F are: x F,1 ,x F,2 ,...,x F,v ,...,x F,V , where V is the number of all possible values of feature F.
  • x i,o are the values of all the features except the feature F in the i-th data.
  • variable value range is divided into R equal divisions, and the values of all the cut points are used for cumulative incidence estimation and curve drawing, reducing the amount of calculation, R Determine according to the specific characteristic value range.
  • the present invention is based on a deep neural network model, and converts the survival analysis problem into a multi-task learning model composed of a semi-supervised learning problem of survival probability prediction at multiple time series points.
  • the deep neural network structure can fit the nonlinear effects of features.
  • the model directly models survival probability, does not rely on proportional hazards assumptions, can fit time-dependent effects, and has better explanatory properties.
  • the model realizes data sharing between multiple prediction tasks through multi-task learning at multiple time sequence points, and realizes mutual constraints between multiple prediction tasks at the same time, and improves the generalization ability of the model. At the same time, it provides a method for evaluating the importance of features, and visualizes the time dependence and nonlinear effects of features.
  • Figure 1 is a structural diagram of the disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis of the present invention
  • Figure 2 is a schematic diagram of data set label conversion
  • Figure 3 is a diagram of the neural network structure.
  • the censored data in this application is: If at the specified end time, the data without a result event is called censored data, and the time from the starting point to the censored is called censoring time.
  • the time-dependent phenomenon is: Regardless of the baseline risk, at any point in time, the risk of an event in an exposed individual relative to an exposed individual is constant; the phenomenon that does not meet the above assumptions is considered a characteristic pair The prognosis of the disease is time-dependent.
  • the risk of competition is: during the follow-up of the disease prognosis, the patient’s events other than the event of concern did not occur, that is, other events "competed” for the occurrence of the event of concern, and these events are called competitive risks; the competition risk is only Exist in the survival analysis problem where there are multiple end-point events, but only one end-point event will occur at any given time.
  • a disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis includes: a data acquisition module for acquiring disease prognosis data; and processing missing values for disease prognosis data And a normalized data preprocessing module; a prediction model building module for modeling disease prognosis data; a prediction result display module for visually displaying data prediction results; the prediction model building module uses
  • the survival analysis method of deep semi-supervised multi-task learning its realization principle is as follows:
  • N is the number of samples and M is the number of features.
  • An example of the transformation of data set labels is shown in Figure 2.
  • the label of the converted data set can be expressed as:
  • the original survival analysis problem is transformed into a multi-task learning problem.
  • the input of the deep neural network is the feature X of the data set
  • the output label is Y
  • each output layer corresponds to each y in Y, namely Each output layer corresponds to event prediction tasks at different times.
  • Figure 3 shows a deep neural network with K output layers. If the output k refers to the prediction of the task at time T k , then the network can make predictions for the same task at K different times.
  • the hidden layer parameters in the network adopt a hard sharing mechanism. The hard sharing mechanism reduces the risk of overfitting. Intuitively speaking, the more tasks learn at the same time, the more common features of the model can be captured by the model, so that the risk of overfitting on each task is smaller.
  • the model uses the logarithmic loss to punish the wrong classification to measure the accuracy of the classifier.
  • the label be y, y ⁇ 0,1 ⁇ .
  • the parameter ⁇ is estimated by the maximum likelihood estimation method, and the likelihood function is:
  • l is the sample number label
  • p (X i; ⁇ ) is the posterior probability of the sample X i.
  • L1 loss The definition of L1 loss is as follows:
  • L1 loss that is, adding the sum of the absolute values of all the weight parameters ⁇ to the objective function can make more ⁇ zero and realize automatic feature selection.
  • L2 loss The definition of L2 loss is as follows:
  • unlabeled data For unlabeled data, the use of unlabeled data can be realized by adding an entropy-constrained regularization term to the objective function.
  • the event state is a random variable that obeys the Bernoulli distribution and the parameter is p. Its entropy is defined as follows:
  • u is the number of unlabeled samples
  • p is the probability of occurrence of the event. If the category of unlabeled data is determined, the entropy constraint regularization term will be small.
  • the non-increasing trend of survival probability is constrained by adding a ranking loss to the objective function.
  • the ranking loss is defined as follows:
  • p i,p (y i 1
  • X i ; ⁇ ) ⁇ pi ,q (y i 1
  • X i ; ⁇ ), Otherwise, a penalty will be imposed on the probability of occurrence of this pair of events; I(pi ,p (y i 1
  • X i ; ⁇ )>pi ,q (y i 1
  • X i ; ⁇ )) is the indicator function, When p i,p (y i 1
  • X i ; ⁇ )>pi ,q (y i 1
  • the semi-supervised multi-task survival analysis model based on deep learning that is, the objective function of the prediction model is:
  • l( ⁇ ) is the logarithmic loss
  • L1( ⁇ ) is the L1 loss
  • L2( ⁇ ) is the L2 loss
  • ⁇ ( ⁇ ) is the semi-supervised loss
  • R( ⁇ ) is the ranking loss
  • ⁇ 1 , ⁇ 2 , ⁇ 3 , ⁇ 4 are the parameters that control the strength of the regular term.
  • All possible values of feature F are: x F,1 ,x F,2 ,...,x F,v ,...,x F,V , where V is the number of all possible values of feature F.
  • x i,o are the values of all the features except the feature F in the i-th data.
  • step 2) Combine what you got in step 2) Draw as a curve.
  • the value range of the variable can be divided into R equal parts, and the values of all cut points are used for cumulative incidence estimation and curve drawing to reduce the amount of calculation. R is usually determined according to the specific characteristic value range.
  • This application uses the deep neural network structure to fit the nonlinear effect of the data; according to the dimensionality of the input data, the length of the survival time, and the accuracy of the model, the deep neural network structure can be flexibly expanded; the model directly models the survival probability without relying on The proportional hazard hypothesis can fit the time-dependent effects of features and has better interpretability; through the logarithmic loss function and semi-supervised loss function, it makes full use of the complete data and censored data; through the ranking loss function, the survival probability is not used.
  • Incremental law through the L1 and L2 loss functions, automatic feature selection and prevention of model overfitting are realized; the model can realize data sharing between multiple prediction tasks through multi-task learning at multiple time series points, and between multiple prediction tasks at the same time
  • the model can deal with traditional survival analysis problems and survival analysis problems considering competitive risks; it provides a method for evaluating the importance of features based on deep learning models; and visualizes the prognostic effects of features Time dependence and non-linear effects.

Abstract

A disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis, comprising a data acquisition module, a data preprocessing module, and a prediction model construction module. The system, by using a deep neural network model as a basis, converts a survival analysis problem into a multi-task learning model composed of a semi-supervised learning problem of multi-time-sequence-point survival probability prediction; the model directly models a survival probability, does not depend on proportional risk hypothesis, can fit a time-dependent effect, and has better interpretability; it is proposed that a semi-supervised loss function and a sorting loss function are utilized to fit data, complete data and censored data are fully utilized, and traditional survival analysis problems and survival analysis problems considering competition risks can be solved; according to the model, by means of multi-task learning of multiple time sequence points, data sharing among multiple prediction tasks is achieved, mutual constraint among the multiple prediction tasks is achieved, and the generalization ability of the model is improved.

Description

一种基于深度半监督多任务学习生存分析的疾病预后预测系统A disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis 技术领域Technical field
本发明属于医疗及机器学习技术领域,尤其涉及一种基于深度半监督多任务学习生存分析的疾病预后预测系统。The invention belongs to the technical field of medical treatment and machine learning, and in particular relates to a disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis.
背景技术Background technique
疾病预后预测分析可以给临床医生提供用于疾病治疗的预后信息,帮助治疗方案的制定,提高疾病治愈率,改善患者预后生活质量,有效降低疾病负担,对于疾病的控制和治疗意义重大。生存分析是疾病预后预测中常用的数据分析方法,用于分析和预测事件发生的时间。在医学上,它在确定治疗过程、开发新药、预防药物不良反应和改进医院流程方面起着关键作用。近来,随着深度学习模型的兴起和训练技术的改进,深度神经网络、卷积神经网络、长短期记忆网络等深度学习网络结构在疾病预后预测中的应用研究开始增多。此外,一些高级的机器学习策略也逐渐被应用在基于深度学习的生存分析方法中,包括主动学习、迁移学习和多任务学习,提升了疾病预后预测的性能。Disease prognosis prediction analysis can provide clinicians with prognostic information for disease treatment, help formulate treatment plans, increase disease cure rate, improve patient prognostic quality of life, and effectively reduce disease burden, which is of great significance for disease control and treatment. Survival analysis is a commonly used data analysis method in the prediction of disease prognosis, which is used to analyze and predict the time of occurrence of an event. In medicine, it plays a key role in determining the course of treatment, developing new drugs, preventing adverse drug reactions and improving hospital procedures. Recently, with the rise of deep learning models and the improvement of training technology, deep neural networks, convolutional neural networks, long- and short-term memory networks and other deep learning network structures have begun to increase in the application of disease prognosis prediction. In addition, some advanced machine learning strategies are gradually being applied to survival analysis methods based on deep learning, including active learning, transfer learning, and multi-task learning to improve the performance of disease prognosis prediction.
疾病预后数据中普遍存在删失数据,删失数据并非缺失数据,而是仅能提供起点到删失时间的预后信息,不能提供起点到事件发生的完整信息的不完整数据。现有的基于深度学习的方法,或不能充分利用删失数据;或在充分利用删失数据的情况下,不能有效解决特征的时间依赖现象;或模型的泛化能力不足;或模型的可解释性差。现有的基于多任务学习的方法,不能充分利用删失数据。Censored data are common in disease prognosis data. Censored data are not missing data, but incomplete data that can only provide prognostic information from the beginning to the censored time, and cannot provide complete information from the beginning to the occurrence of the event. Existing deep learning-based methods may not make full use of censored data; or in the case of making full use of censored data, they cannot effectively solve the time-dependent phenomenon of features; or the generalization ability of the model is insufficient; or the model is interpretable Poor sex. The existing methods based on multi-task learning cannot make full use of censored data.
发明内容Summary of the invention
本发明的目的在于针对现有技术的不足,提供一种基于深度半监督多任务学习生存分析的疾病预后预测系统。The purpose of the present invention is to provide a disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis in view of the deficiencies of the prior art.
本发明以深度神经网络模型为基础,将生存分析问题转化为多时序点生存概率预测的半监督学习问题组成的多任务学习模型;考虑生存分析中删失数据和生存概率非递增趋势,提出利用半监督损失函数和排序损失函数对数据进行拟合,可以处理传统的生存分析问题和考虑竞争风险的生存分析问题。同时,提供了特征重要性的评估方法,并用可视化的方式展现特征的时间依赖和非线性效应。The present invention is based on a deep neural network model, transforms the survival analysis problem into a multi-task learning model composed of a semi-supervised learning problem of multi-time sequence point survival probability prediction; considering the censored data and the non-increasing trend of survival probability in survival analysis, it is proposed to use The semi-supervised loss function and the ranking loss function fit the data, and can deal with traditional survival analysis problems and survival analysis problems considering competitive risks. At the same time, it provides a method for evaluating the importance of features, and visualizes the time dependence and nonlinear effects of features.
模型中的深度神经网络结构包含了多层非线性变换单元层,可以拟合特征的非线性效应。模型直接对生存概率建模,不依赖比例风险假设,可以拟合时间依赖效应,也具有更好的解释性。模型通过对数损失函数和半监督损失函数,充分利用了完全数据和删失数据;通过排序损失函数利用生存概率非递增趋势;通过L1和L2损失函数,实现特征自动选择和防止模型过拟合。模型通过多时序点的多任务学习,实现多个预测任务之间的数据共享,同时实现多个预测任务之间的相互约束,提升模型的泛化能力。The deep neural network structure in the model contains multiple layers of nonlinear transformation units, which can fit the nonlinear effects of features. The model directly models survival probability, does not rely on proportional hazards assumptions, can fit time-dependent effects, and has better explanatory properties. The model makes full use of complete data and censored data through logarithmic loss function and semi-supervised loss function; utilizes the non-increasing trend of survival probability through sorting loss function; realizes automatic feature selection and prevents model overfitting through L1 and L2 loss functions . The model realizes data sharing between multiple prediction tasks through multi-task learning at multiple time sequence points, and realizes mutual constraints between multiple prediction tasks at the same time, and improves the generalization ability of the model.
本发明的目的是通过以下技术方案来实现的:一种基于深度半监督多任务学习生存分析的疾病预后预测系统,包括:用于获取疾病预后数据的数据获取模块;用于对疾病预后数据进行缺失值处理和归一化处理的数据预处理模块;用于对疾病预后数据进行建模的预测模型构建模块;用于将数据预测结果进行展示的预测结果展示模块;所述预测模型构建模块中采用基于深度半监督多任务学习的生存分析方法,具体步骤如下:The purpose of the present invention is achieved through the following technical solutions: a disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis, including: a data acquisition module for acquiring disease prognosis data; A data preprocessing module for missing value processing and normalization processing; a prediction model building module for modeling disease prognosis data; a prediction result display module for displaying data prediction results; in the prediction model building module Using a survival analysis method based on deep semi-supervised multi-task learning, the specific steps are as follows:
(1)在预后数据生存分析中,给定的数据集记为:D={(X 1,T 11),(X 2,T 22),…,(X i,T ii),…,(X N,T NN)}。(X i,T ii)表示一条数据实例,其中X i为第i条数据特征向量;δ i为第i条数据的删失指示变量,当δ i=1时,表示该数据为非删失数据,即观测到了事件的发生,当δ i=0时,表示该数据为删失数据,即没有观测到事件的发生;T i表示第i条数据的生存时间。对于非删失数据,T i等于观察到的生存时间O i;对于删失数据,T i等于删失时间C i(1) In the prognostic data survival analysis, the given data set is recorded as: D={(X 1 ,T 11 ),(X 2 ,T 22 ),…,(X i ,T ii ),…,(X N ,T NN )}. (X i, T i, δ i) represents an instance of data, where X i is the i th data feature vector; [delta] i is censored indicator variable i-th data, when δ i = 1, it indicates that the data is non-censored data, i.e. observed occurrence of an event, when δ i = 0, indicates that the data is censored data, i.e., the occurrence of an event not observed; survival time T i represents the i-th data. For uncensored data, T i is equal to the observed survival time O i ; for censored data, T i is equal to the censoring time C i .
Figure PCTCN2021073136-appb-000001
Figure PCTCN2021073136-appb-000001
数据集的特征可以表示为:The characteristics of the data set can be expressed as:
Figure PCTCN2021073136-appb-000002
Figure PCTCN2021073136-appb-000002
其中,N是样本数量,M是特征数量。Among them, N is the number of samples and M is the number of features.
数据集的标签可以表示为:The label of the data set can be expressed as:
Y={(T 1, δ 1),(T 2, δ 2),…,(T i, δ i),…,(T N, δ N)} Y={(T 1 , δ 1 ),(T 2 , δ 2 ),…,(T i , δ i ),…,(T N , δ N )}
(2)将生存时间看作多个时间点,将每个样本的原始标签信息转化为一个K维的生存状态向量,其中K=max(T i),i=1,2,…,N,是所有样本中的最大生存时间。生存状态向量中的每个元素表示该样本在这一时间点的事件发生、不发生或未知。转化后的数据集标签可以表示为: (2) Regard the survival time as multiple time points, and convert the original label information of each sample into a K-dimensional survival state vector, where K=max(T i ), i=1, 2,...,N, Is the maximum survival time in all samples. Each element in the survival state vector represents the occurrence, non-occurrence, or unknown of the sample at this point in time. The label of the converted data set can be expressed as:
Figure PCTCN2021073136-appb-000003
Figure PCTCN2021073136-appb-000003
(3)构建深度神经网络,该深度神经网络具有一个输入层、多个输出层,该深度神经网络的输入为数据集的特征X,输出标签为Y,每个输出层对应Y中的每一个y,即每个输出层对 应不同时间的事件预测任务。该深度神经网络可以对相同任务在K个不同时间做出预测。(3) Construct a deep neural network with one input layer and multiple output layers. The input of the deep neural network is the feature X of the data set, the output label is Y, and each output layer corresponds to each of Y y, that is, each output layer corresponds to event prediction tasks at different times. The deep neural network can make predictions for the same task at K different times.
(4)构建预测模型,预测模型的目标函数由对数损失、L1损失、L2损失、半监督损失和排序损失五个部分组成:(4) Construct a prediction model. The objective function of the prediction model is composed of five parts: log loss, L1 loss, L2 loss, semi-supervised loss and ranking loss:
1)对数损失1) Log loss
针对有标签数据,对于不考虑竞争风险的二分类问题,模型利用对数损失通过惩罚错误的分类,衡量分类器的准确性。记标签为y,y∈{0,1}。通过极大似然估计法来估计参数θ,似然函数为:For the labeled data, for the binary classification problem that does not consider the competition risk, the model uses the logarithmic loss to punish the wrong classification to measure the accuracy of the classifier. Let the label be y, y∈{0,1}. The parameter θ is estimated by the maximum likelihood estimation method, and the likelihood function is:
Figure PCTCN2021073136-appb-000004
Figure PCTCN2021073136-appb-000004
其中,l为有标签样本数量,p(X i;θ)为样本X i的后验概率。对似然函数取对数,得到对数似然函数,即对数损失函数: Where, l is the sample number label, p (X i; θ) is the posterior probability of the sample X i. Take the logarithm of the likelihood function to obtain the log likelihood function, that is, the log loss function:
Figure PCTCN2021073136-appb-000005
Figure PCTCN2021073136-appb-000005
即令每个样本属于其真实标记的概率越大越好。That is, the greater the probability that each sample belongs to its true label, the better.
对于考虑竞争风险的生存分析问题,把每个时间点的事件预测看作一个多分类问题。假设在给定X i时,y的条件概率分布为p(y i=k|X i;θ),其中,k=1,2,…,C,C是所有可能出现的结局数量。通过极大似然估计法来估计参数θ,对应的对数损失函数为: For the survival analysis problem considering the competitive risk, the event prediction at each time point is regarded as a multi-classification problem. Suppose that when X i is given, the conditional probability distribution of y is p(y i =k|X i ; θ), where k=1, 2,...,C, and C is the number of all possible outcomes. The parameter θ is estimated by the maximum likelihood estimation method, and the corresponding log loss function is:
Figure PCTCN2021073136-appb-000006
Figure PCTCN2021073136-appb-000006
其中,I{y i=k}是指示函数,当y i=k时,I{y i=k}=1;否则,I{y i=k}=0。 Among them, I{y i =k} is an indicator function. When y i =k, I{y i =k}=1; otherwise, I{y i =k}=0.
2)L1损失:2) L1 loss:
L1(θ)=‖θ‖L1(θ)=‖θ‖
3)L2损失3) L2 loss
L2(θ)=‖θ‖ 2 L2(θ)=‖θ‖ 2
4)半监督损失4) Semi-supervised loss
针对无标签数据,通过给目标函数添加一个熵约束的正则化项实现对无标签数据的利用。For unlabeled data, the use of unlabeled data is realized by adding an entropy-constrained regularization term to the objective function.
对于不考虑竞争风险的二分类问题,事件状态是一个服从伯努利分布,参数为p的随机变量,其熵定义如下:For the binary classification problem that does not consider the competition risk, the event state is a random variable that obeys the Bernoulli distribution and the parameter is p. Its entropy is defined as follows:
H(p)=-plog p-(1-p)log(1-p)H(p)=-plog p-(1-p)log(1-p)
则对于无标签数据,熵约束正则化定义如下:For unlabeled data, the entropy constraint regularization is defined as follows:
Figure PCTCN2021073136-appb-000007
Figure PCTCN2021073136-appb-000007
其中,u为无标签样本数量,p为事件发生的概率。如果无标签数据的类别是确定的,则熵约束正则化项会很小。Among them, u is the number of unlabeled samples, and p is the probability of occurrence of the event. If the category of unlabeled data is determined, the entropy constraint regularization term will be small.
对于考虑竞争风险的多分类问题,无标签数据的熵约束正则化定义如下:For multi-classification problems considering competitive risks, the entropy constraint regularization of unlabeled data is defined as follows:
Figure PCTCN2021073136-appb-000008
Figure PCTCN2021073136-appb-000008
5)排序损失5) Sorting loss
对生存概率的非递增趋势,通过给目标函数添加一个排序损失进行约束。排序损失定义如下:The non-increasing trend of survival probability is constrained by adding a ranking loss to the objective function. The ranking loss is defined as follows:
Figure PCTCN2021073136-appb-000009
Figure PCTCN2021073136-appb-000009
其中,p i,p(y i=1|X i;θ)表示第i个样本在时间p发生死亡事件的概率。即当时间p<q时,第i个样本事件发生的概率应满足p i,p(y i=1|X i;θ)<p i,q(y i=1|X i;θ),否则,就对这对事件发生概率施加惩罚;I(p i,p(y i=1|X i;θ)>p i,q(y i=1|X i;θ))是指示函数,当p i,p(y i=1|X i;θ)>p i,q(y i=1|X i;θ)时,I=1;否则,I=0。 Among them, p i,p (y i = 1|X i ; θ) represents the probability of a death event in the i-th sample at time p. That is, when time p<q, the probability of occurrence of the i-th sample event should satisfy p i,p (y i =1|X i ;θ)<pi ,q (y i =1|X i ;θ), Otherwise, a penalty will be imposed on the probability of occurrence of this pair of events; I(pi ,p (y i = 1|X i ; θ)>pi ,q (y i =1|X i ; θ)) is the indicator function, When p i,p (y i =1|X i ;θ)>pi ,q (y i =1|X i ;θ), I=1; otherwise, I=0.
综上,基于深度学习的半监督多任务生存分析模型,即预测模型的目标函数为:In summary, the semi-supervised multi-task survival analysis model based on deep learning, that is, the objective function of the prediction model is:
L total(θ)=l(θ)+λ 1L1(θ)+λ 2L2(θ)+λ 3Ω(θ)+λ 4R(θ) L total (θ)=l(θ)+λ 1 L1(θ)+λ 2 L2(θ)+λ 3 Ω(θ)+λ 4 R(θ)
其中,l(θ)是对数损失,L1(θ)是L1损失,L2(θ)是L2损失,Ω(θ)是半监督损失,R(θ)是排序损失,λ 1234是控制正则项强度的参数。 Among them, l(θ) is the logarithmic loss, L1(θ) is the L1 loss, L2(θ) is the L2 loss, Ω(θ) is the semi-supervised loss, R(θ) is the ranking loss, λ 12 , λ 34 are the parameters that control the strength of the regular term.
利用疾病数据进行模型训练,得到模型的参数θ,从而确定预测模型。对于新的疾病数据,利用预测模型进行预测,得到疾病预后的预测结果。Use disease data for model training to obtain model parameters θ to determine the prediction model. For new disease data, predictive models are used to predict the prognosis of the disease.
进一步地,所述步骤(2)通过将标签信息转化为向量的过程,将原始的生存分析问题转化为多任务学习问题。Further, the step (2) transforms the original survival analysis problem into a multi-task learning problem through the process of converting the label information into a vector.
进一步地,所述步骤(3)中,深度神经网络中的隐藏层参数采用硬共享机制,从而降低过拟合的风险。Further, in the step (3), the hidden layer parameters in the deep neural network adopt a hard sharing mechanism, thereby reducing the risk of overfitting.
进一步地,所述步骤(4)中,对于生存分析问题转变的深度半监督多任务学习问题,存在两个重要的特征:删失导致的无标签数据和生存概率的非递增趋势。针对删失导致的无标签数据,利用熵约束正则化来进行半监督学习。针对不同时间点生存概率的非递增趋势,引入排序损失对不同输出层的生存概率进行约束。同时,目标函数中引入L1损失实现特征的自动选择,引入L2损失避免过拟合。Further, in the step (4), for the deep semi-supervised multi-task learning problem of survival analysis problem transition, there are two important features: unlabeled data caused by censoring and a non-increasing trend of survival probability. For unlabeled data caused by censoring, semi-supervised learning is performed by using entropy-constrained regularization. Aiming at the non-increasing trend of survival probabilities at different time points, sorting loss is introduced to constrain the survival probabilities of different output layers. At the same time, L1 loss is introduced into the objective function to realize automatic feature selection, and L2 loss is introduced to avoid overfitting.
进一步地,所述预测结果展示模块供特征重要性评估,并用可视化的方式展现特征的时 间依赖和非线性效应。计算某个特征F重要性的具体步骤如下:Further, the prediction result display module is used for feature importance evaluation, and visually displays the time dependence and nonlinear effects of features. The specific steps for calculating the importance of a feature F are as follows:
1)选择相应的测试数据计算模型预测误差,记为error1。1) Select the corresponding test data to calculate the model prediction error and record it as error1.
2)随机对测试数据中所有样本的特征F加入噪声干扰,再次计算模型预测误差,记为error2。对于连续型变量,随机增加一个服从正态分布N(0,σ∈)的噪声扰动,其中,σ是特征F的标准差,∈是一个很小的常量。对于离散型变量,x F→x F*(1-s)+(1-x F)*s,其中,s是服从伯努利分布的噪声扰动,x F为特征F的值。 2) Randomly add noise interference to the feature F of all samples in the test data, calculate the model prediction error again, and record it as error2. For continuous variables, randomly add a noise disturbance that obeys the normal distribution N(0,σ∈), where σ is the standard deviation of the feature F, and ∈ is a small constant. For discrete variables, x F → x F *(1-s)+(1-x F )*s, where s is the noise disturbance that obeys the Bernoulli distribution, and x F is the value of the characteristic F.
3)计算两次预测误差的差值e:e=error2-error1。3) Calculate the difference e between the two prediction errors: e=error2-error1.
4)重复1-3步n次。4) Repeat steps 1-3 n times.
5)特征F的重要性计算公式如下:5) The calculation formula for the importance of feature F is as follows:
Figure PCTCN2021073136-appb-000010
Figure PCTCN2021073136-appb-000010
如果加入随机噪声后,测试数据准确率大幅度下降,说明这个特征对于样本的预测结果有很大影响,进而说明重要程度比较高。If random noise is added, the accuracy of the test data drops drastically, indicating that this feature has a great influence on the prediction results of the sample, and further indicating that the importance is relatively high.
进一步地,所述预测结果展示模块通过绘制不同特征对应的预测累积发生率曲线,可视化展示特征对预后的影响。绘制某个特征F对应的预测累积发生率曲线,具体步骤如下:Further, the prediction result display module visually displays the influence of the characteristics on the prognosis by drawing the predicted cumulative incidence curves corresponding to different characteristics. To draw the predicted cumulative incidence curve corresponding to a certain feature F, the specific steps are as follows:
1)特征F的所有可能取值为:x F,1,x F,2,…,x F,v,…,x F,V,其中,V为特征F所有可能取值的数量。 1) All possible values of feature F are: x F,1 ,x F,2 ,...,x F,v ,...,x F,V , where V is the number of all possible values of feature F.
2)令特征F的取值为x F=x F,v,v=1,2,…,V,保持其他特征的取值不变,计算模型预测累积发生率的平均值: 2) Let the value of feature F be x F = x F, v , v = 1, 2, ..., V, keep the values of other features unchanged, and calculate the average value of the cumulative occurrence rate predicted by the model:
Figure PCTCN2021073136-appb-000011
Figure PCTCN2021073136-appb-000011
其中,
Figure PCTCN2021073136-appb-000012
是所有数据的模型预测输出平均值,
Figure PCTCN2021073136-appb-000013
是第i条数据的模型预测输出,x i,o是第i条数据中除了特征F以外的其他所有特征的值。
in,
Figure PCTCN2021073136-appb-000012
Is the average value of model prediction output for all data,
Figure PCTCN2021073136-appb-000013
Is the model prediction output of the i-th data, and x i,o are the values of all the features except the feature F in the i-th data.
3)将第2)步中得到的
Figure PCTCN2021073136-appb-000014
绘制成曲线。
3) Combine what you got in step 2)
Figure PCTCN2021073136-appb-000014
Draw as a curve.
进一步地,预测累积发生率曲线绘制过程中,对于连续型变量,将变量取值范围平均分成R等分,取所有的切分点的值进行累积发生率估计和曲线绘制,减少计算量,R根据具体的特征取值范围进行确定。Further, in the process of drawing the predictive cumulative incidence curve, for continuous variables, the variable value range is divided into R equal divisions, and the values of all the cut points are used for cumulative incidence estimation and curve drawing, reducing the amount of calculation, R Determine according to the specific characteristic value range.
本发明的有益效果是:The beneficial effects of the present invention are:
本发明以深度神经网络模型为基础,将生存分析问题转化为多时序点生存概率预测的半监督学习问题组成的多任务学习模型。深度神经网络结构可以拟合特征的非线性效应。模型直接对生存概率建模,不依赖比例风险假设,可以拟合时间依赖效应,也具有更好的解释性。The present invention is based on a deep neural network model, and converts the survival analysis problem into a multi-task learning model composed of a semi-supervised learning problem of survival probability prediction at multiple time series points. The deep neural network structure can fit the nonlinear effects of features. The model directly models survival probability, does not rely on proportional hazards assumptions, can fit time-dependent effects, and has better explanatory properties.
考虑生存分析中删失数据和生存概率非递增趋势,提出利用半监督损失函数和排序损失函数对数据进行拟合,充分利用了完全数据和删失数据,可以处理传统的生存分析问题和考 虑竞争风险的生存分析问题。模型通过多时序点的多任务学习,实现多个预测任务之间的数据共享,同时实现多个预测任务之间的相互约束,提升模型的泛化能力。同时,提供了特征重要性的评估方法,并用可视化的方式展现特征的时间依赖和非线性效应。Considering the non-increasing trend of censored data and survival probability in survival analysis, it is proposed to use semi-supervised loss function and ranking loss function to fit the data, making full use of complete data and censored data, which can deal with traditional survival analysis problems and consider competition The question of risk survival analysis. The model realizes data sharing between multiple prediction tasks through multi-task learning at multiple time sequence points, and realizes mutual constraints between multiple prediction tasks at the same time, and improves the generalization ability of the model. At the same time, it provides a method for evaluating the importance of features, and visualizes the time dependence and nonlinear effects of features.
附图说明Description of the drawings
图1为本发明基于深度半监督多任务学习生存分析的疾病预后预测系统结构图;Figure 1 is a structural diagram of the disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis of the present invention;
图2为数据集标签转换示意图;Figure 2 is a schematic diagram of data set label conversion;
图3为神经网络结构图。Figure 3 is a diagram of the neural network structure.
具体实施方式Detailed ways
为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图对本发明的具体实施方式做详细的说明。In order to make the above objectives, features and advantages of the present invention more obvious and understandable, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
在下面的描述中阐述了很多具体细节以便于充分理解本发明,但是本发明还可以采用其他不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本发明内涵的情况下做类似推广,因此本发明不受下面公开的具体实施例的限制。In the following description, many specific details are explained in order to fully understand the present invention, but the present invention can also be implemented in other ways different from those described here, and those skilled in the art can do so without departing from the connotation of the present invention. Similar promotion, therefore, the present invention is not limited by the specific embodiments disclosed below.
本申请的删失数据为:如果在规定的结束时间,没有出现结果事件的数据称之为删失数据,从起点到删失的时间称为删失时间。时间依赖现象为:不论基线风险如何,在任何时间点上,存在某一暴露的个体相对不存在该暴露的个体发生事件的风险是恒定的;特征不符合上述假设的现象,就被认为特征对疾病预后的影响存在时间依赖。竞争风险为:在疾病预后随访期间,患者因发生关注事件之外的其他事件,导致关注的事件没有发生,即其他事件“竞争”了关注事件的发生,称这些事件为竞争风险;竞争风险只存在于有多个终点事件,但在任何给定时间只会发生一个终点事件的生存分析问题中。The censored data in this application is: If at the specified end time, the data without a result event is called censored data, and the time from the starting point to the censored is called censoring time. The time-dependent phenomenon is: Regardless of the baseline risk, at any point in time, the risk of an event in an exposed individual relative to an exposed individual is constant; the phenomenon that does not meet the above assumptions is considered a characteristic pair The prognosis of the disease is time-dependent. The risk of competition is: during the follow-up of the disease prognosis, the patient’s events other than the event of concern did not occur, that is, other events "competed" for the occurrence of the event of concern, and these events are called competitive risks; the competition risk is only Exist in the survival analysis problem where there are multiple end-point events, but only one end-point event will occur at any given time.
如图1所示,本申请提出的一种基于深度半监督多任务学习生存分析的疾病预后预测系统,包括:用于获取疾病预后数据的数据获取模块;用于对疾病预后数据进行缺失值处理和归一化处理的数据预处理模块;用于对疾病预后数据进行建模的预测模型构建模块;用于将数据预测结果进行可视化展示的预测结果展示模块;所述预测模型构建模块中采用基于深度半监督多任务学习的生存分析方法,其实现原理如下所述:As shown in Figure 1, a disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis proposed by the present application includes: a data acquisition module for acquiring disease prognosis data; and processing missing values for disease prognosis data And a normalized data preprocessing module; a prediction model building module for modeling disease prognosis data; a prediction result display module for visually displaying data prediction results; the prediction model building module uses The survival analysis method of deep semi-supervised multi-task learning, its realization principle is as follows:
(1)在预后数据生存分析中,给定的数据集记为:D={(X 1,T 11),(X 2,T 22),…,(X i,T ii),…,(X N,T NN)}。(X i,T ii)表示一条数据实例,其中X i为第i条数据特征向量;δ i为第i条数据的删失指示变量,当δ i=1时,表示该数据为非删失数据,即观测到了事件的发生,当δ i=0时,表示该数据为删失数据,即没有观测到事件的发生;T i表示第i条数据的生存时 间。对于非删失数据,T i等于观察到的生存时间O i;对于删失数据,T i等于删失时间C i(1) In the prognostic data survival analysis, the given data set is recorded as: D={(X 1 ,T 11 ),(X 2 ,T 22 ),…,(X i ,T ii ),…,(X N ,T NN )}. (X i, T i, δ i) represents an instance of data, where X i is the i th data feature vector; [delta] i is censored indicator variable i-th data, when δ i = 1, it indicates that the data is non-censored data, i.e. observed occurrence of an event, when δ i = 0, indicates that the data is censored data, i.e., the occurrence of an event not observed; survival time T i represents the i-th data. For uncensored data, T i is equal to the observed survival time O i ; for censored data, T i is equal to the censoring time C i .
Figure PCTCN2021073136-appb-000015
Figure PCTCN2021073136-appb-000015
数据集的特征可以表示为:The characteristics of the data set can be expressed as:
Figure PCTCN2021073136-appb-000016
Figure PCTCN2021073136-appb-000016
其中,N是样本数量,M是特征数量。Among them, N is the number of samples and M is the number of features.
数据集的标签可以表示为:The label of the data set can be expressed as:
Y={(T 1, δ 1),(T 2, δ 2),…,(T i, δ i),…,(T N, δ N)} Y={(T 1 , δ 1 ),(T 2 , δ 2 ),…,(T i , δ i ),…,(T N , δ N )}
(2)本发明将生存时间看作多个时间点,而不是连续变量。由此,我们可以把每个样本的原始标签信息转化为一个K维的生存状态向量,其中K=max(T i),i=1,2,…,N,是所有样本中的最大生存时间。生存状态向量中的每个元素表示该样本在这一时间点的事件发生(值为1)、不发生(值为0)或未知(值为2)。数据集标签的转换示例如图2所示。转化后的数据集标签可以表示为: (2) The present invention regards survival time as multiple time points, rather than continuous variables. Therefore, we can transform the original label information of each sample into a K-dimensional survival state vector, where K=max(T i ), i=1, 2,...,N is the maximum survival time in all samples . Each element in the survival state vector represents the occurrence of the event (value 1), non-occurrence (value 0) or unknown (value 2) of the sample at this point in time. An example of the transformation of data set labels is shown in Figure 2. The label of the converted data set can be expressed as:
Figure PCTCN2021073136-appb-000017
Figure PCTCN2021073136-appb-000017
通过将标签信息转化为向量的过程,将原始的生存分析问题转化为了多任务学习问题。Through the process of transforming label information into vectors, the original survival analysis problem is transformed into a multi-task learning problem.
(3)利用一个具有一个输入层、多个输出层的深度神经网络,该深度神经网络的输入为数据集的特征X,输出标签为Y,每个输出层对应Y中的每一个y,即每个输出层对应不同时间的事件预测任务。图3展示的是一个有K个输出层的深度神经网络,如果输出k指的是任务在时间T k的预测,那么该网络可以对相同任务在K个不同时间做出预测。网络中的隐藏层参数采用硬共享机制。硬共享机制降低了过拟合的风险。直观来讲,越多任务同时学习,模型就能捕捉到越多任务的共性特征表示,从而使在每个任务上的过拟合风险越小。 (3) Use a deep neural network with one input layer and multiple output layers. The input of the deep neural network is the feature X of the data set, the output label is Y, and each output layer corresponds to each y in Y, namely Each output layer corresponds to event prediction tasks at different times. Figure 3 shows a deep neural network with K output layers. If the output k refers to the prediction of the task at time T k , then the network can make predictions for the same task at K different times. The hidden layer parameters in the network adopt a hard sharing mechanism. The hard sharing mechanism reduces the risk of overfitting. Intuitively speaking, the more tasks learn at the same time, the more common features of the model can be captured by the model, so that the risk of overfitting on each task is smaller.
(4)目标函数定义(4) Objective function definition
对于生存分析问题转变的深度半监督多任务学习问题,存在两个重要的特征:删失导致的无标签数据和生存概率的非递增趋势。针对这两个问题,需要设计合适的约束条件来处理。针对删失导致的无标签数据,我们利用熵约束正则化来进行半监督学习。如果无标签数据的类别是确定的,则熵约束正则化项会很小。考虑到不同时间点生存概率的非递增趋势,我们引入了排序损失,对不同输出层的生存概率进行约束。同时,目标函数中引入L1损失实现特征的自动选择,引入L2损失避免过拟合。模型的目标函数由对数损失、L1损失、L2损失、半监督损失和排序损失五个部分组成。For the deep semi-supervised multi-task learning problem of survival analysis problem transformation, there are two important characteristics: unlabeled data caused by censoring and the non-increasing trend of survival probability. For these two problems, it is necessary to design appropriate constraints to deal with. For unlabeled data caused by censoring, we use entropy-constrained regularization for semi-supervised learning. If the category of unlabeled data is determined, the entropy constraint regularization term will be small. Taking into account the non-increasing trend of survival probabilities at different time points, we introduce a sorting loss to constrain the survival probabilities of different output layers. At the same time, L1 loss is introduced into the objective function to realize automatic feature selection, and L2 loss is introduced to avoid overfitting. The objective function of the model consists of five parts: log loss, L1 loss, L2 loss, semi-supervised loss and ranking loss.
1)对数损失1) Log loss
针对有标签数据,对于不考虑竞争风险的二分类问题,模型利用对数损失通过惩罚错误的分类,衡量分类器的准确性。记标签为y,y∈{0,1}。通过极大似然估计法来估计参数θ,似然函数为:For the labeled data, for the binary classification problem that does not consider the competition risk, the model uses the logarithmic loss to punish the wrong classification to measure the accuracy of the classifier. Let the label be y, y∈{0,1}. The parameter θ is estimated by the maximum likelihood estimation method, and the likelihood function is:
Figure PCTCN2021073136-appb-000018
Figure PCTCN2021073136-appb-000018
其中,l为有标签样本数量,p(X i;θ)为样本X i的后验概率。对似然函数取对数,得到对数似然函数,即对数损失函数: Where, l is the sample number label, p (X i; θ) is the posterior probability of the sample X i. Take the logarithm of the likelihood function to obtain the log likelihood function, that is, the log loss function:
Figure PCTCN2021073136-appb-000019
Figure PCTCN2021073136-appb-000019
即令每个样本属于其真实标记的概率越大越好。That is, the greater the probability that each sample belongs to its true label, the better.
对于考虑竞争风险的生存分析问题,我们把每个时间点的事件预测看作一个多分类问题。假设在给定X i时,y的条件概率分布为p(y i=k|X i;θ),其中,k=1,2,…,C,C是所有可能出现的结局数量。这个用于解决y∈{1,2,…,C}的分类问题的模型是对二分类模型的一种扩展,其参数也可以通过极大似然估计法来求解,对应的对数损失函数为: For the survival analysis problem considering the competitive risk, we regard the event prediction at each time point as a multi-classification problem. Suppose that when X i is given, the conditional probability distribution of y is p(y i =k|X i ; θ), where k=1, 2,...,C, and C is the number of all possible outcomes. This model used to solve the classification problem of y∈{1,2,...,C} is an extension of the two-class model, and its parameters can also be solved by the maximum likelihood estimation method, and the corresponding log loss function for:
Figure PCTCN2021073136-appb-000020
Figure PCTCN2021073136-appb-000020
其中,I{y i=k}是指示函数,当y i=k时,I{y i=k}=1;否则,I{y i=k}=0。 Among them, I{y i =k} is an indicator function. When y i =k, I{y i =k}=1; otherwise, I{y i =k}=0.
2)L1损失2) L1 loss
L1损失的定义如下:The definition of L1 loss is as follows:
L1(θ)=‖θ‖L1(θ)=‖θ‖
L1损失,即在目标函数中增加所有权重参数θ的绝对值之和,可以使更多θ为零,实现特征的自动选择。L1 loss, that is, adding the sum of the absolute values of all the weight parameters θ to the objective function can make more θ zero and realize automatic feature selection.
3)L2损失3) L2 loss
L2损失的定义如下:The definition of L2 loss is as follows:
L2(θ)=‖θ‖ 2 L2(θ)=‖θ‖ 2
L2损失,即在目标函数中增加所有权重参数θ的平方之和,使所有θ尽可能趋于零,避免过拟合。L2 loss, that is, adding the sum of the squares of all weight parameters θ in the objective function, so that all θ tends to zero as much as possible to avoid over-fitting.
4)半监督损失4) Semi-supervised loss
对无标签数据,通过给目标函数添加一个熵约束的正则化项可以实现对无标签数据的利用。对于不考虑竞争风险的二分类问题,事件状态是一个服从伯努利分布,参数为p的随机变量,其熵定义如下:For unlabeled data, the use of unlabeled data can be realized by adding an entropy-constrained regularization term to the objective function. For the binary classification problem that does not consider the competition risk, the event state is a random variable that obeys the Bernoulli distribution and the parameter is p. Its entropy is defined as follows:
H(p)=-plog p-(1-p)log(1-p)H(p)=-plog p-(1-p)log(1-p)
则对于无标签数据,熵约束正则化定义如下:For unlabeled data, the entropy constraint regularization is defined as follows:
Figure PCTCN2021073136-appb-000021
Figure PCTCN2021073136-appb-000021
其中,u为无标签样本数量,p为事件发生的概率。如果无标签数据的类别是确定的,则熵约束正则化项会很小。Among them, u is the number of unlabeled samples, and p is the probability of occurrence of the event. If the category of unlabeled data is determined, the entropy constraint regularization term will be small.
对于考虑竞争风险的多分类问题,无标签数据的熵约束正则化定义如下:For multi-classification problems considering competitive risks, the entropy constraint regularization of unlabeled data is defined as follows:
Figure PCTCN2021073136-appb-000022
Figure PCTCN2021073136-appb-000022
5)排序损失5) Sorting loss
对生存概率的非递增趋势,通过给目标函数添加一个排序损失进行约束。排序损失定义如下:The non-increasing trend of survival probability is constrained by adding a ranking loss to the objective function. The ranking loss is defined as follows:
Figure PCTCN2021073136-appb-000023
Figure PCTCN2021073136-appb-000023
其中,p i,p(y i=1|X i;θ)表示第i个样本在时间p发生死亡事件的概率。即当时间p<q时,第i个样本事件发生的概率应满足p i,p(y i=1|X i;θ)<p i,q(y i=1|X i;θ),否则,就对这对事件发生概率施加惩罚;I(p i,p(y i=1|X i;θ)>p i,q(y i=1|X i;θ))是指示函数,当p i,p(y i=1|X i;θ)>p i,q(y i=1|X i;θ)时,I=1;否则,I=0。 Among them, p i,p (y i = 1|X i ; θ) represents the probability of a death event in the i-th sample at time p. That is, when time p<q, the probability of occurrence of the i-th sample event should satisfy p i,p (y i =1|X i ;θ)<pi ,q (y i =1|X i ;θ), Otherwise, a penalty will be imposed on the probability of occurrence of this pair of events; I(pi ,p (y i = 1|X i ; θ)>pi ,q (y i =1|X i ; θ)) is the indicator function, When p i,p (y i =1|X i ;θ)>pi ,q (y i =1|X i ;θ), I=1; otherwise, I=0.
综上,基于深度学习的半监督多任务生存分析模型,即预测模型的目标函数为:In summary, the semi-supervised multi-task survival analysis model based on deep learning, that is, the objective function of the prediction model is:
L total(θ)=l(θ)+λ 1L1(θ)+λ 2L2(θ)+λ 3Ω(θ)+λ 4R(θ) L total (θ)=l(θ)+λ 1 L1(θ)+λ 2 L2(θ)+λ 3 Ω(θ)+λ 4 R(θ)
其中,l(θ)是对数损失,L1(θ)是L1损失,L2(θ)是L2损失,Ω(θ)是半监督损失,R(θ)是排序损失,λ 1234是控制正则项强度的参数。 Among them, l(θ) is the logarithmic loss, L1(θ) is the L1 loss, L2(θ) is the L2 loss, Ω(θ) is the semi-supervised loss, R(θ) is the ranking loss, λ 12 , λ 34 are the parameters that control the strength of the regular term.
利用疾病数据进行模型训练,得到模型的参数θ,从而确定预测模型。对于新的疾病数据,利用预测模型进行预测,得到疾病预后的预测结果。Use disease data for model training to obtain model parameters θ to determine the prediction model. For new disease data, predictive models are used to predict the prognosis of the disease.
(5)特征重要性(5) Feature importance
计算某个特征F的重要性,具体步骤如下:To calculate the importance of a feature F, the specific steps are as follows:
1)选择相应的测试数据计算模型预测误差,记为error1。1) Select the corresponding test data to calculate the model prediction error and record it as error1.
2)随机对测试数据中所有样本的特征F加入噪声干扰(可以随机改变样本在特征F处的值),再次计算模型预测误差,记为error2。对于连续型变量,随机增加一个服从正态分布N(0,σ∈)的噪声扰动,其中,σ是特征F的标准差,∈是一个很小的常量。对于离散型变量,x F→x F*(1-s)+(1-x F)*s,其中,s是服从伯努利分布的噪声扰动,x F为特征F的值。 2) Randomly add noise interference to the feature F of all samples in the test data (you can randomly change the value of the sample at the feature F), calculate the model prediction error again, and record it as error2. For continuous variables, randomly add a noise disturbance that obeys the normal distribution N(0,σ∈), where σ is the standard deviation of the feature F, and ∈ is a small constant. For discrete variables, x F → x F *(1-s)+(1-x F )*s, where s is the noise disturbance that obeys the Bernoulli distribution, and x F is the value of the characteristic F.
3)计算两次预测误差的差值e:e=error2-error1。3) Calculate the difference e between the two prediction errors: e=error2-error1.
4)重复1-3步n次,n通常取500次以上。4) Repeat steps 1-3 n times, n usually takes more than 500 times.
5)特征F的重要性计算公式如下:5) The calculation formula for the importance of feature F is as follows:
Figure PCTCN2021073136-appb-000024
Figure PCTCN2021073136-appb-000024
这个数值之所以能够说明特征的重要性是因为,如果加入随机噪声后,测试数据准确率大幅度下降(即error2增大),说明这个特征对于样本的预测结果有很大影响,进而说明重要程度比较高。The reason why this value can explain the importance of the feature is that if random noise is added, the accuracy of the test data will be greatly reduced (that is, the error2 will increase), indicating that this feature has a great impact on the prediction results of the sample, and thus the degree of importance Relatively high.
(6)特征对预后影响的可视化(6) Visualization of the influence of features on prognosis
通过绘制不同特征对应的预测累积发生率曲线,可视化展示特征对预后的影响。绘制某个特征F对应的预测累积发生率曲线,具体步骤如下:By drawing the predicted cumulative incidence curve corresponding to different characteristics, the visual display of the influence of the characteristics on the prognosis. To draw the predicted cumulative incidence curve corresponding to a certain feature F, the specific steps are as follows:
1)特征F的所有可能取值为:x F,1,x F,2,…,x F,v,…,x F,V,其中,V为特征F所有可能取值的数量。 1) All possible values of feature F are: x F,1 ,x F,2 ,...,x F,v ,...,x F,V , where V is the number of all possible values of feature F.
2)令特征F的取值为x F=x F,v,v=1,2,…,V,保持其他特征的取值不变,计算模型预测累积发生率的平均值: 2) Let the value of feature F be x F = x F, v , v = 1, 2, ..., V, keep the values of other features unchanged, and calculate the average value of the cumulative occurrence rate predicted by the model:
Figure PCTCN2021073136-appb-000025
Figure PCTCN2021073136-appb-000025
其中,
Figure PCTCN2021073136-appb-000026
是所有数据的模型预测输出平均值,
Figure PCTCN2021073136-appb-000027
是第i条数据的模型预测输出,x i,o是第i条数据中除了特征F以外的其他所有特征的值。
in,
Figure PCTCN2021073136-appb-000026
Is the average value of model prediction output for all data,
Figure PCTCN2021073136-appb-000027
Is the model prediction output of the i-th data, and x i,o are the values of all the features except the feature F in the i-th data.
3)将第2)步中得到的
Figure PCTCN2021073136-appb-000028
绘制成曲线。对于连续型变量,可以将变量取值范围平均分成R等分,取所有的切分点的值进行累积发生率估计和曲线绘制,减少计算量,R通常根据具体的特征取值范围进行确定。
3) Combine what you got in step 2)
Figure PCTCN2021073136-appb-000028
Draw as a curve. For continuous variables, the value range of the variable can be divided into R equal parts, and the values of all cut points are used for cumulative incidence estimation and curve drawing to reduce the amount of calculation. R is usually determined according to the specific characteristic value range.
本申请利用深度神经网络结构,拟合数据的非线性作用;根据输入数据的维度、生存时间的长度、模型的准确度,可以灵活扩展深度神经网络结构;模型直接对生存概率建模,不依赖比例风险假设,可以拟合特征的时间依赖效应,也具有更好的解释性;通过对数损失函数和半监督损失函数,充分利用了完全数据和删失数据;通过排序损失函数利用生存概率非递增规律;通过L1和L2损失函数,实现特征自动选择和防止模型过拟合;模型通过多时序点的多任务学习,实现多个预测任务之间的数据共享,同时实现多个预测任务之间的相互约束,提升模型的泛化能力;模型可以处理传统的生存分析问题和考虑竞争风险的生存分析问题;提供了基于深度学习模型的特征重要性评估方法;用可视化的方式展现特征对预后的时间依赖和非线性效应。This application uses the deep neural network structure to fit the nonlinear effect of the data; according to the dimensionality of the input data, the length of the survival time, and the accuracy of the model, the deep neural network structure can be flexibly expanded; the model directly models the survival probability without relying on The proportional hazard hypothesis can fit the time-dependent effects of features and has better interpretability; through the logarithmic loss function and semi-supervised loss function, it makes full use of the complete data and censored data; through the ranking loss function, the survival probability is not used. Incremental law; through the L1 and L2 loss functions, automatic feature selection and prevention of model overfitting are realized; the model can realize data sharing between multiple prediction tasks through multi-task learning at multiple time series points, and between multiple prediction tasks at the same time The model can deal with traditional survival analysis problems and survival analysis problems considering competitive risks; it provides a method for evaluating the importance of features based on deep learning models; and visualizes the prognostic effects of features Time dependence and non-linear effects.
以上所述仅是本发明的优选实施方式,虽然本发明已以较佳实施例披露如上,然而并非用以限定本发明。任何熟悉本领域的技术人员,在不脱离本发明技术方案范围情况下,都可利用上述揭示的方法和技术内容对本发明技术方案做出许多可能的变动和修饰,或修改为等 同变化的等效实施例。因此,凡是未脱离本发明技术方案的内容,依据本发明的技术实质对以上实施例所做的任何的简单修改、等同变化及修饰,均仍属于本发明技术方案保护的范围内。The above are only the preferred embodiments of the present invention. Although the present invention has been disclosed as above in preferred embodiments, it is not intended to limit the present invention. Anyone familiar with the art, without departing from the scope of the technical solution of the present invention, can use the methods and technical content disclosed above to make many possible changes and modifications to the technical solution of the present invention, or modify it into equivalent changes. Examples. Therefore, all simple modifications, equivalent changes and modifications made to the above embodiments based on the technical essence of the present invention without departing from the technical solution of the present invention still fall within the protection scope of the technical solution of the present invention.

Claims (7)

  1. 一种基于深度半监督多任务学习生存分析的疾病预后预测系统,其特征在于,包括:用于获取疾病预后数据的数据获取模块;用于对疾病预后数据进行缺失值处理和归一化处理的数据预处理模块;用于对疾病预后数据进行建模的预测模型构建模块;用于将数据预测结果进行展示的预测结果展示模块;所述预测模型构建模块中采用基于深度半监督多任务学习的生存分析方法,具体步骤如下:A disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis, which is characterized in that it includes: a data acquisition module for acquiring disease prognosis data; Data preprocessing module; prediction model building module for modeling disease prognosis data; prediction result display module for displaying data prediction results; the prediction model building module adopts deep semi-supervised multi-task learning Survival analysis method, the specific steps are as follows:
    (1)在预后数据生存分析中,给定的数据集记为:D={(X 1,T 11),(X 2,T 22),…,(X i,T ii),…,(X N,T NN)}。(X i,T ii)表示一条数据实例,其中X i为第i条数据特征向量;δ i为第i条数据的删失指示变量,当δ i=1时,表示该数据为非删失数据,即观测到了事件的发生,当δ i=0时,表示该数据为删失数据,即没有观测到事件的发生;T i表示第i条数据的生存时间。对于非删失数据,T i等于观察到的生存时间O i;对于删失数据,T i等于删失时间C i(1) In the prognostic data survival analysis, the given data set is recorded as: D={(X 1 ,T 11 ),(X 2 ,T 22 ),…,(X i ,T ii ),…,(X N ,T NN )}. (X i, T i, δ i) represents an instance of data, where X i is the i th data feature vector; [delta] i is censored indicator variable i-th data, when δ i = 1, it indicates that the data is non-censored data, i.e. observed occurrence of an event, when δ i = 0, indicates that the data is censored data, i.e., the occurrence of an event not observed; survival time T i represents the i-th data. For uncensored data, T i is equal to the observed survival time O i ; for censored data, T i is equal to the censoring time C i .
    Figure PCTCN2021073136-appb-100001
    Figure PCTCN2021073136-appb-100001
    数据集的特征可以表示为:The characteristics of the data set can be expressed as:
    Figure PCTCN2021073136-appb-100002
    Figure PCTCN2021073136-appb-100002
    其中,N是样本数量,M是特征数量。Among them, N is the number of samples and M is the number of features.
    数据集的标签可以表示为:The label of the data set can be expressed as:
    Y={(T 11),(T 22),…,(T ii),…,(T NN)} Y={(T 11 ),(T 22 ),…,(T ii ),…,(T NN )}
    (2)将生存时间看作多个时间点,将每个样本的原始标签信息转化为一个K维的生存状态向量,其中K=max(T i),i=1,2,…,N,是所有样本中的最大生存时间。生存状态向量中的每个元素表示该样本在这一时间点的事件发生、不发生或未知。转化后的数据集标签可以表示为: (2) Regard the survival time as multiple time points, and convert the original label information of each sample into a K-dimensional survival state vector, where K=max(T i ), i=1, 2,...,N, Is the maximum survival time in all samples. Each element in the survival state vector represents the occurrence, non-occurrence, or unknown of the sample at this point in time. The label of the converted data set can be expressed as:
    Figure PCTCN2021073136-appb-100003
    Figure PCTCN2021073136-appb-100003
    (3)构建深度神经网络,该深度神经网络具有一个输入层、多个输出层,该深度神经网络的输入为数据集的特征X,输出标签为Y,每个输出层对应Y中的每一个y,即每个输出层对应不同时间的事件预测任务。该深度神经网络可以对相同任务在K个不同时间做出预测。(3) Construct a deep neural network with one input layer and multiple output layers. The input of the deep neural network is the feature X of the data set, the output label is Y, and each output layer corresponds to each of Y y, that is, each output layer corresponds to event prediction tasks at different times. The deep neural network can make predictions for the same task at K different times.
    (4)构建预测模型,预测模型的目标函数由对数损失、L1损失、L2损失、半监督损失和排序损失五个部分组成:(4) Construct a prediction model. The objective function of the prediction model is composed of five parts: logarithmic loss, L1 loss, L2 loss, semi-supervised loss and ranking loss:
    1)对数损失1) Log loss
    针对有标签数据,对于不考虑竞争风险的二分类问题,模型利用对数损失通过惩罚错误的分类,衡量分类器的准确性。记标签为y,y∈{0,1}。通过极大似然估计法来估计参数θ,似然函数为:For the labeled data, for the binary classification problem that does not consider the competition risk, the model uses the logarithmic loss to punish the wrong classification to measure the accuracy of the classifier. Let the label be y, y∈{0,1}. The parameter θ is estimated by the maximum likelihood estimation method, and the likelihood function is:
    Figure PCTCN2021073136-appb-100004
    Figure PCTCN2021073136-appb-100004
    其中,l为有标签样本数量,p(X i;θ)为样本X i的后验概率。对似然函数取对数,得到对数似然函数,即对数损失函数: Where, l is the sample number label, p (X i; θ) is the posterior probability of the sample X i. Take the logarithm of the likelihood function to obtain the log likelihood function, that is, the log loss function:
    Figure PCTCN2021073136-appb-100005
    Figure PCTCN2021073136-appb-100005
    即令每个样本属于其真实标记的概率越大越好。That is, the greater the probability that each sample belongs to its true label, the better.
    对于考虑竞争风险的生存分析问题,把每个时间点的事件预测看作一个多分类问题。假设在给定X i时,y的条件概率分布为p(y i=k|X i;θ),其中,k=1,2,…,C,C是所有可能出现的结局数量。通过极大似然估计法来估计参数θ,对应的对数损失函数为: For the survival analysis problem considering the competitive risk, the event prediction at each time point is regarded as a multi-classification problem. Suppose that when X i is given, the conditional probability distribution of y is p(y i =k|X i ; θ), where k=1, 2,...,C, and C is the number of all possible outcomes. The parameter θ is estimated by the maximum likelihood estimation method, and the corresponding log loss function is:
    Figure PCTCN2021073136-appb-100006
    Figure PCTCN2021073136-appb-100006
    其中,I{y i=k}是指示函数,当y i=k时,I{y i=k}=1;否则,I{y i=k}=0。 Among them, I{y i =k} is an indicator function. When y i =k, I{y i =k}=1; otherwise, I{y i =k}=0.
    2)L1损失:2) L1 loss:
    L1(θ)=‖θ‖L1(θ)=‖θ‖
    3)L2损失3) L2 loss
    L2(θ)=‖θ‖ 2 L2(θ)=‖θ‖ 2
    4)半监督损失4) Semi-supervised loss
    针对无标签数据,通过给目标函数添加一个熵约束的正则化项实现对无标签数据的利用。For unlabeled data, the use of unlabeled data is realized by adding an entropy-constrained regularization term to the objective function.
    对于不考虑竞争风险的二分类问题,事件状态是一个服从伯努利分布,参数为p的随机变量,其熵定义如下:For the binary classification problem that does not consider the competition risk, the event state is a random variable that obeys the Bernoulli distribution and the parameter is p. Its entropy is defined as follows:
    H(p)=-plog p-(1-p)log(1-p)H(p)=-plog p-(1-p)log(1-p)
    则对于无标签数据,熵约束正则化定义如下:For unlabeled data, the entropy constraint regularization is defined as follows:
    Figure PCTCN2021073136-appb-100007
    Figure PCTCN2021073136-appb-100007
    其中,u为无标签样本数量,p为事件发生的概率。如果无标签数据的类别是确定的,则熵约束正则化项会很小。Among them, u is the number of unlabeled samples, and p is the probability of occurrence of the event. If the category of unlabeled data is determined, the entropy constraint regularization term will be small.
    对于考虑竞争风险的多分类问题,无标签数据的熵约束正则化定义如下:For multi-classification problems considering competitive risks, the entropy constraint regularization of unlabeled data is defined as follows:
    Figure PCTCN2021073136-appb-100008
    Figure PCTCN2021073136-appb-100008
    5)排序损失5) Sorting loss
    对生存概率的非递增趋势,通过给目标函数添加一个排序损失进行约束。排序损失定义如下:The non-increasing trend of survival probability is constrained by adding a ranking loss to the objective function. The ranking loss is defined as follows:
    Figure PCTCN2021073136-appb-100009
    Figure PCTCN2021073136-appb-100009
    其中,p i,p(y i=1|X i;θ)表示第i个样本在时间p发生死亡事件的概率。即当时间p<q时,第i个样本事件发生的概率应满足p i,p(y i=1|X i;θ)<p i,q(y i=1|X i;θ),否则,就对这对事件发生概率施加惩罚;I(p i,p(y i=1|X i;θ)>p i,q(y i=1|X i;θ))是指示函数,当p i,p(y i=1|X i;θ)>p i,q(y i=1|X i;θ)时,I=1;否则,I=0。 Among them, p i,p (y i = 1|X i ; θ) represents the probability of a death event in the i-th sample at time p. That is, when time p<q, the probability of occurrence of the i-th sample event should satisfy p i,p (y i =1|X i ;θ)<pi ,q (y i =1|X i ;θ), Otherwise, a penalty will be imposed on the probability of occurrence of this pair of events; I(pi ,p (y i = 1|X i ; θ)>pi ,q (y i =1|X i ; θ)) is the indicator function, When p i,p (y i =1|X i ;θ)>pi ,q (y i =1|X i ;θ), I=1; otherwise, I=0.
    综上,基于深度学习的半监督多任务生存分析模型,即预测模型的目标函数为:In summary, the semi-supervised multi-task survival analysis model based on deep learning, that is, the objective function of the prediction model is:
    L total(θ)=l(θ)+λ 1L1(θ)+λ 2L2(θ)+λ 3Ω(θ)+λ 4R(θ) L total (θ)=l(θ)+λ 1 L1(θ)+λ 2 L2(θ)+λ 3 Ω(θ)+λ 4 R(θ)
    其中,l(θ)是对数损失,L1(θ)是L1损失,L2(θ)是L2损失,Ω(θ)是半监督损失,R(θ)是排序损失,λ 1234是控制正则项强度的参数。 Among them, l(θ) is the logarithmic loss, L1(θ) is the L1 loss, L2(θ) is the L2 loss, Ω(θ) is the semi-supervised loss, R(θ) is the ranking loss, λ 12 , λ 34 are the parameters that control the strength of the regular term.
    利用疾病数据进行模型训练,得到模型的参数θ,从而确定预测模型。对于新的疾病数据,利用预测模型进行预测,得到疾病预后的预测结果。Use disease data for model training to obtain model parameters θ to determine the prediction model. For new disease data, predictive models are used to predict the prognosis of the disease.
  2. 根据权利要求1所述的一种基于深度半监督多任务学习生存分析的疾病预后预测系统,其特征在于,所述步骤(2)通过将标签信息转化为向量的过程,将原始的生存分析问题转化为多任务学习问题。A disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis according to claim 1, wherein the step (2) transforms the original survival analysis problem through the process of converting label information into a vector Transform into a multi-task learning problem.
  3. 根据权利要求1所述的一种基于深度半监督多任务学习生存分析的疾病预后预测系统,其特征在于,所述步骤(3)中,深度神经网络中的隐藏层参数采用硬共享机制,从而降低过拟合的风险。The disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis according to claim 1, characterized in that, in the step (3), the hidden layer parameters in the deep neural network adopt a hard sharing mechanism, thereby Reduce the risk of overfitting.
  4. 根据权利要求1所述的一种基于深度半监督多任务学习生存分析的疾病预后预测系统,其特征在于,所述步骤(4)中,对于生存分析问题转变的深度半监督多任务学习问题,存在两个重要的特征:删失导致的无标签数据和生存概率的非递增趋势。针对删失导致的无标签数据,利用熵约束正则化来进行半监督学习。针对不同时间点生存概率的非递增趋势,引入排序损失对不同输出层的生存概率进行约束。同时,目标函数中引入L1损失实现特征的自动选择,引入L2损失避免过拟合。The disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis according to claim 1, characterized in that, in the step (4), for the deep semi-supervised multi-task learning problem transformed by the survival analysis problem, There are two important features: unlabeled data caused by censoring and non-increasing trends in survival probabilities. For unlabeled data caused by censoring, semi-supervised learning is performed by using entropy-constrained regularization. Aiming at the non-increasing trend of survival probabilities at different time points, sorting loss is introduced to constrain the survival probabilities of different output layers. At the same time, L1 loss is introduced into the objective function to realize automatic feature selection, and L2 loss is introduced to avoid overfitting.
  5. 根据权利要求1所述的一种基于深度半监督多任务学习生存分析的疾病预后预测系统, 其特征在于,所述预测结果展示模块供特征重要性评估,并用可视化的方式展现特征的时间依赖和非线性效应。计算某个特征F重要性的具体步骤如下:The disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis according to claim 1, wherein the prediction result display module is used for feature importance evaluation, and the time dependence and time dependence of features are displayed in a visual manner. Non-linear effects. The specific steps for calculating the importance of a feature F are as follows:
    1)选择相应的测试数据计算模型预测误差,记为error1。1) Select the corresponding test data to calculate the model prediction error and record it as error1.
    2)随机对测试数据中所有样本的特征F加入噪声干扰,再次计算模型预测误差,记为error2。对于连续型变量,随机增加一个服从正态分布N(0,σε)的噪声扰动,其中,σ是特征F的标准差,ε是一个很小的常量。对于离散型变量,x F→x F*(1-s)+(1-x F)*s,其中,s是服从伯努利分布的噪声扰动,x F为特征F的值。 2) Randomly add noise interference to the feature F of all samples in the test data, calculate the model prediction error again, and record it as error2. For continuous variables, randomly add a noise disturbance that obeys the normal distribution N(0,σε), where σ is the standard deviation of the feature F, and ε is a small constant. For discrete variables, x F → x F *(1-s)+(1-x F )*s, where s is the noise disturbance that obeys the Bernoulli distribution, and x F is the value of the characteristic F.
    3)计算两次预测误差的差值e:e=error2-error1。3) Calculate the difference e between the two prediction errors: e=error2-error1.
    4)重复1-3步n次。4) Repeat steps 1-3 n times.
    5)特征F的重要性计算公式如下:5) The calculation formula for the importance of feature F is as follows:
    Figure PCTCN2021073136-appb-100010
    Figure PCTCN2021073136-appb-100010
    如果加入随机噪声后,测试数据准确率大幅度下降,说明这个特征对于样本的预测结果有很大影响,进而说明重要程度比较高。If random noise is added, the accuracy of the test data drops significantly, indicating that this feature has a great influence on the prediction results of the sample, and further indicating that the importance is relatively high.
  6. 根据权利要求5所述的一种基于深度半监督多任务学习生存分析的疾病预后预测系统,其特征在于,所述预测结果展示模块通过绘制不同特征对应的预测累积发生率曲线,可视化展示特征对预后的影响。绘制某个特征F对应的预测累积发生率曲线,具体步骤如下:The disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis according to claim 5, wherein the prediction result display module visually displays the characteristic pairs by drawing predicted cumulative incidence curves corresponding to different characteristics The impact of prognosis. To draw the predicted cumulative incidence curve corresponding to a certain feature F, the specific steps are as follows:
    1)特征F的所有可能取值为:x F,1,x F,2,…,x F,v,…,x F,V,其中,V为特征F所有可能取值的数量。 1) All possible values of feature F are: x F,1 ,x F,2 ,...,x F,v ,...,x F,V , where V is the number of all possible values of feature F.
    2)令特征F的取值为x F=x F,v,v=1,2,…,V,保持其他特征的取值不变,计算模型预测累积发生率的平均值: 2) Let the value of feature F be x F = x F, v , v = 1, 2, ..., V, keep the values of other features unchanged, and calculate the average value of the cumulative occurrence rate predicted by the model:
    Figure PCTCN2021073136-appb-100011
    Figure PCTCN2021073136-appb-100011
    其中,
    Figure PCTCN2021073136-appb-100012
    是所有数据的模型预测输出平均值,
    Figure PCTCN2021073136-appb-100013
    是第i条数据的模型预测输出,x i,o是第i条数据中除了特征F以外的其他所有特征的值。
    in,
    Figure PCTCN2021073136-appb-100012
    Is the average value of model prediction output for all data,
    Figure PCTCN2021073136-appb-100013
    Is the model prediction output of the i-th data, and x i,o are the values of all the features except the feature F in the i-th data.
    3)将第2)步中得到的
    Figure PCTCN2021073136-appb-100014
    绘制成曲线。
    3) Combine what you got in step 2)
    Figure PCTCN2021073136-appb-100014
    Draw as a curve.
  7. 根据权利要求6所述的一种基于深度半监督多任务学习生存分析的疾病预后预测系统,其特征在于,预测累积发生率曲线绘制过程中,对于连续型变量,将变量取值范围平均分成R等分,取所有的切分点的值进行累积发生率估计和曲线绘制,减少计算量,R根据具体的特征取值范围进行确定。The disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis according to claim 6, characterized in that, in the process of drawing the predictive cumulative incidence curve, for continuous variables, the variable value range is equally divided into R Divide equally, take the values of all the cut points to estimate the cumulative incidence and draw the curve to reduce the amount of calculation. R is determined according to the specific characteristic value range.
PCT/CN2021/073136 2020-04-09 2021-01-21 Disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis WO2021203796A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010273957.9A CN111640510A (en) 2020-04-09 2020-04-09 Disease prognosis prediction system based on deep semi-supervised multitask learning survival analysis
CN202010273957.9 2020-04-09

Publications (1)

Publication Number Publication Date
WO2021203796A1 true WO2021203796A1 (en) 2021-10-14

Family

ID=72331086

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/073136 WO2021203796A1 (en) 2020-04-09 2021-01-21 Disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis

Country Status (2)

Country Link
CN (1) CN111640510A (en)
WO (1) WO2021203796A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114141366A (en) * 2021-12-31 2022-03-04 杭州电子科技大学 Cerebral apoplexy rehabilitation assessment auxiliary analysis method based on voice multitask learning
CN114566289A (en) * 2022-04-26 2022-05-31 之江实验室 Disease prediction system based on multi-center clinical data anti-cheating analysis
CN114821337A (en) * 2022-05-20 2022-07-29 武汉大学 Semi-supervised SAR image building area extraction method based on time phase consistency pseudo-label
CN115184054A (en) * 2022-05-30 2022-10-14 深圳技术大学 Mechanical equipment semi-supervised fault detection and analysis method, device, terminal and medium
CN115458158A (en) * 2022-09-23 2022-12-09 深圳大学 Acute kidney injury prediction system for sepsis patient
CN116072298A (en) * 2023-04-06 2023-05-05 之江实验室 Disease prediction system based on hierarchical marker distribution learning
CN116206755A (en) * 2023-05-06 2023-06-02 之江实验室 Disease detection and knowledge discovery device based on neural topic model
CN116504423A (en) * 2023-06-26 2023-07-28 北京大学 Drug effectiveness evaluation method
CN116564524A (en) * 2023-06-30 2023-08-08 之江实验室 Pseudo tag evolution trend regular prognosis prediction device
CN116832285A (en) * 2023-09-01 2023-10-03 吉林大学 Breathing machine operation abnormity monitoring and early warning system based on cloud platform
CN116959715A (en) * 2023-09-18 2023-10-27 之江实验室 Disease prognosis prediction system based on time sequence evolution process explanation
CN117558414A (en) * 2023-11-23 2024-02-13 之江实验室 System, electronic device and medium for predicting early recurrence of multi-tasking hepatocellular carcinoma

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111640510A (en) * 2020-04-09 2020-09-08 之江实验室 Disease prognosis prediction system based on deep semi-supervised multitask learning survival analysis
TWI810510B (en) * 2021-01-04 2023-08-01 鴻海精密工業股份有限公司 Method and device for processing multi-modal data, electronic device, and storage medium
CN112819768B (en) * 2021-01-26 2022-06-17 复旦大学 DCNN-based survival analysis method for cancer full-field digital pathological section
CN112906994B (en) * 2021-04-19 2023-04-07 拉扎斯网络科技(上海)有限公司 Order meal delivery time prediction method and device, electronic equipment and storage medium
CN113314218B (en) * 2021-06-22 2022-12-23 浙江大学 Dynamic survival analysis equipment containing competition risk based on comparison
CN115620902A (en) * 2021-07-15 2023-01-17 华为云计算技术有限公司 Method and device for predicting survival risk rate
CN115565669B (en) * 2022-10-11 2023-05-16 电子科技大学 Cancer survival analysis method based on GAN and multitask learning
CN116403714B (en) * 2023-04-07 2024-01-26 大连市中心医院 Cerebral apoplexy END risk prediction model building method and device, END risk prediction system, electronic equipment and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897545A (en) * 2017-01-05 2017-06-27 浙江大学 A kind of tumor prognosis forecasting system based on depth confidence network
CN107944479A (en) * 2017-11-16 2018-04-20 哈尔滨工业大学 Disease forecasting method for establishing model and device based on semi-supervised learning
CN108053398A (en) * 2017-12-19 2018-05-18 南京信息工程大学 A kind of melanoma automatic testing method of semi-supervised feature learning
CN108564039A (en) * 2018-04-16 2018-09-21 北京工业大学 A kind of epileptic seizure prediction method generating confrontation network based on semi-supervised deep layer
CN110556178A (en) * 2018-05-30 2019-12-10 西门子医疗有限公司 decision support system for medical therapy planning
CN110580695A (en) * 2019-08-07 2019-12-17 深圳先进技术研究院 multi-mode three-dimensional medical image fusion method and system and electronic equipment
US10559386B1 (en) * 2019-04-02 2020-02-11 Kpn Innovations, Llc Methods and systems for an artificial intelligence support network for vibrant constituional guidance
CN111640510A (en) * 2020-04-09 2020-09-08 之江实验室 Disease prognosis prediction system based on deep semi-supervised multitask learning survival analysis

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897545A (en) * 2017-01-05 2017-06-27 浙江大学 A kind of tumor prognosis forecasting system based on depth confidence network
CN107944479A (en) * 2017-11-16 2018-04-20 哈尔滨工业大学 Disease forecasting method for establishing model and device based on semi-supervised learning
CN108053398A (en) * 2017-12-19 2018-05-18 南京信息工程大学 A kind of melanoma automatic testing method of semi-supervised feature learning
CN108564039A (en) * 2018-04-16 2018-09-21 北京工业大学 A kind of epileptic seizure prediction method generating confrontation network based on semi-supervised deep layer
CN110556178A (en) * 2018-05-30 2019-12-10 西门子医疗有限公司 decision support system for medical therapy planning
US10559386B1 (en) * 2019-04-02 2020-02-11 Kpn Innovations, Llc Methods and systems for an artificial intelligence support network for vibrant constituional guidance
CN110580695A (en) * 2019-08-07 2019-12-17 深圳先进技术研究院 multi-mode three-dimensional medical image fusion method and system and electronic equipment
CN111640510A (en) * 2020-04-09 2020-09-08 之江实验室 Disease prognosis prediction system based on deep semi-supervised multitask learning survival analysis

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HASSANZADEH HAMID REZA; PHAN JOHN H.; WANG MAY D.: "A semi-supervised method for predicting cancer survival using incomplete clinical data", 2015 37TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), IEEE, 25 August 2015 (2015-08-25), pages 210 - 213, XP032810166, DOI: 10.1109/EMBC.2015.7318337 *
HOU LI, GUI WEI: "Research on infrared breast cancer detection method based on semi-supervised ladder network", INFORMATION TECHNOLOGY AND INFORMATIZATION - XINXI JISHU YU XINXIHUA, SHANDONG DIANZI XUEHUI, CN, no. 6, 25 June 2018 (2018-06-25), CN , pages 179 - 182, XP055856783, ISSN: 1672-9528, DOI: 10.3969/j.issn.1672-9528.2018.06.056 *
SHENGQIANG CHI: "Doctoral Dissertation", 25 April 2019, ZHEJIANG UNIVERSITY, CN, article SHENGQIANG CHI: "Study on Machine Learning-based Colorectal Cancer Prognosis Model and Its Generalization", pages: 1 - 122, XP055856778, DOI: 10.27461/d.cnki.gzjdx.2019.000967 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114141366A (en) * 2021-12-31 2022-03-04 杭州电子科技大学 Cerebral apoplexy rehabilitation assessment auxiliary analysis method based on voice multitask learning
CN114141366B (en) * 2021-12-31 2024-03-26 杭州电子科技大学 Auxiliary analysis method for cerebral apoplexy rehabilitation evaluation based on voice multitasking learning
CN114566289A (en) * 2022-04-26 2022-05-31 之江实验室 Disease prediction system based on multi-center clinical data anti-cheating analysis
CN114821337A (en) * 2022-05-20 2022-07-29 武汉大学 Semi-supervised SAR image building area extraction method based on time phase consistency pseudo-label
CN114821337B (en) * 2022-05-20 2024-04-16 武汉大学 Semi-supervised SAR image building area extraction method based on phase consistency pseudo tag
CN115184054A (en) * 2022-05-30 2022-10-14 深圳技术大学 Mechanical equipment semi-supervised fault detection and analysis method, device, terminal and medium
CN115184054B (en) * 2022-05-30 2022-12-27 深圳技术大学 Mechanical equipment semi-supervised fault detection and analysis method, device, terminal and medium
CN115458158A (en) * 2022-09-23 2022-12-09 深圳大学 Acute kidney injury prediction system for sepsis patient
CN115458158B (en) * 2022-09-23 2023-09-15 深圳大学 Acute kidney injury prediction system for sepsis patient
CN116072298B (en) * 2023-04-06 2023-08-15 之江实验室 Disease prediction system based on hierarchical marker distribution learning
CN116072298A (en) * 2023-04-06 2023-05-05 之江实验室 Disease prediction system based on hierarchical marker distribution learning
CN116206755B (en) * 2023-05-06 2023-08-22 之江实验室 Disease detection and knowledge discovery device based on neural topic model
CN116206755A (en) * 2023-05-06 2023-06-02 之江实验室 Disease detection and knowledge discovery device based on neural topic model
CN116504423A (en) * 2023-06-26 2023-07-28 北京大学 Drug effectiveness evaluation method
CN116504423B (en) * 2023-06-26 2023-09-26 北京大学 Drug effectiveness evaluation method
CN116564524A (en) * 2023-06-30 2023-08-08 之江实验室 Pseudo tag evolution trend regular prognosis prediction device
CN116564524B (en) * 2023-06-30 2023-10-03 之江实验室 Pseudo tag evolution trend regular prognosis prediction device
CN116832285A (en) * 2023-09-01 2023-10-03 吉林大学 Breathing machine operation abnormity monitoring and early warning system based on cloud platform
CN116832285B (en) * 2023-09-01 2023-11-07 吉林大学 Breathing machine operation abnormity monitoring and early warning system based on cloud platform
CN116959715A (en) * 2023-09-18 2023-10-27 之江实验室 Disease prognosis prediction system based on time sequence evolution process explanation
CN116959715B (en) * 2023-09-18 2024-01-09 之江实验室 Disease prognosis prediction system based on time sequence evolution process explanation
CN117558414A (en) * 2023-11-23 2024-02-13 之江实验室 System, electronic device and medium for predicting early recurrence of multi-tasking hepatocellular carcinoma

Also Published As

Publication number Publication date
CN111640510A (en) 2020-09-08

Similar Documents

Publication Publication Date Title
WO2021203796A1 (en) Disease prognosis prediction system based on deep semi-supervised multi-task learning survival analysis
WO2022160902A1 (en) Anomaly detection method for large-scale multivariate time series data in cloud environment
Jiang et al. Modified genetic algorithm-based feature selection combined with pre-trained deep neural network for demand forecasting in outpatient department
CN109659033B (en) Chronic disease state of an illness change event prediction device based on recurrent neural network
CN113040711B (en) Cerebral apoplexy incidence risk prediction system, equipment and storage medium
Sisodia et al. Stock market analysis and prediction for NIFTY50 using LSTM Deep Learning Approach
CN111144542A (en) Oil well productivity prediction method, device and equipment
CN113486578B (en) Method for predicting residual life of equipment in industrial process
Farbmacher et al. An explainable attention network for fraud detection in claims management
Jiang et al. A hybrid intelligent model for acute hypotensive episode prediction with large-scale data
CN111626785A (en) CNN-LSTM network fund price prediction method based on attention combination
Moreira et al. Evolutionary radial basis function network for gestational diabetes data analytics
CN116340796B (en) Time sequence data analysis method, device, equipment and storage medium
Li et al. Multi-task spatio-temporal augmented net for industry equipment remaining useful life prediction
Liu et al. An explainable knowledge distillation method with XGBoost for ICU mortality prediction
US20220405640A1 (en) Learning apparatus, classification apparatus, learning method, classification method and program
Wang et al. Time-series forecasting of mortality rates using transformer
CN115660871A (en) Medical clinical process unsupervised modeling method, computer device, and storage medium
CN114529063A (en) Financial field data prediction method, device and medium based on machine learning
Guan et al. A new hybrid deep learning model for monthly oil prices forecasting
Lv et al. Multi-feature generation network-based imputation method for industrial data with high missing rate
Alshenawy et al. A COMPARATIVE STUDY OF STATISTICAL AND INTELLIGENT CLASSIFICATION MODELS FOR PREDICTING DIABETES
Shimizu et al. Model interpretation using improved local regression with variable importance
Yang et al. Housing price mathematical prediction method through Big Data analysis and improved linear regression model
CN113724069B (en) Deep learning-based pricing method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21784817

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21784817

Country of ref document: EP

Kind code of ref document: A1