WO2021120936A1 - 一种基于多任务学习模型的慢病预测系统 - Google Patents

一种基于多任务学习模型的慢病预测系统 Download PDF

Info

Publication number
WO2021120936A1
WO2021120936A1 PCT/CN2020/128427 CN2020128427W WO2021120936A1 WO 2021120936 A1 WO2021120936 A1 WO 2021120936A1 CN 2020128427 W CN2020128427 W CN 2020128427W WO 2021120936 A1 WO2021120936 A1 WO 2021120936A1
Authority
WO
WIPO (PCT)
Prior art keywords
chronic disease
data
model
disease prediction
physical examination
Prior art date
Application number
PCT/CN2020/128427
Other languages
English (en)
French (fr)
Inventor
吴健
姜晓红
应豪超
冯芮苇
刘雪晨
曹燕
Original Assignee
浙江大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江大学 filed Critical 浙江大学
Priority to US17/623,555 priority Critical patent/US20220254493A1/en
Publication of WO2021120936A1 publication Critical patent/WO2021120936A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Definitions

  • the invention belongs to the field of medical artificial intelligence, and particularly relates to a chronic disease prediction system based on a multi-task learning model.
  • Chronic disease also called chronic disease
  • chronic disease is a common latent disease with long duration, including diabetes, cardiovascular disease, cancer and respiratory diseases.
  • the number of patients with chronic diseases has been increasing rapidly.
  • the causes of chronic diseases are complex and require continuous treatment. Therefore, chronic diseases bring harm to people's health and life, and their mortality and treatment burden continue to increase. If early detection and intervention of chronic diseases can be carried out, these problems can be effectively alleviated.
  • the Chinese patent document with the publication number CN107153774A discloses the construction of a hyperbolic model of chronic disease risk assessment and a disease prediction system using this model. It relies on the vertical health management data of more than 20 health management centers in Shandong province to build a Shandong multi-center health management vertical Observe the cohort, explore the role of genetics, environment, personal lifestyle, health intervention factors, etc. in the occurrence, development and outcome of major chronic diseases, and establish a hyperbolic model and disease risk assessment for various chronic diseases applicable to the health checkup population in Shandong province Predictive system and provide scientific basis for health intervention of chronic diseases.
  • the other is to use some methods to analyze the data of electronic health records and other data collected through inspections, including anthropometric characteristics (age, gender, body mass index, etc.) and physiological records (including blood routine, blood sugar, urine routine, etc.).
  • anthropometric characteristics age, gender, body mass index, etc.
  • physiological records including blood routine, blood sugar, urine routine, etc.
  • the Chinese patent document with the publication number CN107007284A discloses a multi-disease chronic disease information management system, including a database, an application server, several hospital clients and patient clients; the database stores the patient’s various medical examination data, doctor’s recommendations, and various The health data reference range of the examination items and the health status evaluation indicators of various chronic diseases; the application server obtains various physical examination data and corresponding health data reference of the designated patient in the database according to the first query instruction sent by the hospital/patient client Scope, health evaluation indicators of various chronic diseases, and doctor’s recommendations, obtain chronic disease evaluation results, and return the chronic disease evaluation results of the current designated patient and the above-mentioned various data to the hospital/patient client.
  • the present invention provides a chronic disease prediction system based on a multi-task learning model, which can predict multiple chronic diseases at the same time by using the potential connections that may exist between multiple chronic diseases.
  • a chronic disease prediction system based on a multi-task learning model comprising a computer memory, a computer processor, and a computer program stored in the computer memory and executable on the computer processor, and the computer memory stores training
  • a good chronic disease prediction model which is composed of a shared layer convolutional neural network and multiple chronic disease branch networks;
  • the physical examination records to be predicted are input into the shared layer convolutional neural network of the chronic disease prediction model for feature extraction, and feature maps are obtained;
  • the obtained feature maps are input into each chronic disease branch network respectively, and feature extraction and prediction are performed respectively, and the chronic disease prediction results are obtained.
  • the structure of the shared layer convolutional neural network is: firstly, the convolutional layer is shared by multi-layer tasks, and then 3 and 6 convolution kernels with a size of 3 ⁇ 3 are used for feature extraction, and the step size of the convolution kernel Set to 1;
  • Each chronic disease branch network is equipped with 2 convolutional layers.
  • Each convolutional layer is feature extraction by 9 and 12 convolution kernels.
  • the step length of the convolution kernel is designed to be 2 and 1 respectively; finally, each convolutional layer has 9 and 12 convolution kernels for feature extraction.
  • Each branch passes through two fully connected layers with 32 nodes and a softmax layer to get the final output.
  • the training process of the chronic disease prediction model is as follows:
  • the data encoding method includes a content encoding strategy and a spatial encoding strategy, where the content encoding strategy is used to realize the unification of the numerical types of the data, The spatial coding strategy is used to realize the unification of the data format of the input model;
  • the physical examination data used in the present invention is csv format data, and can also be structured data in other formats for the patient's physical examination record.
  • Each csv data corresponds to a patient's physical examination record, and each csv record includes multiple physical examination index items.
  • In the process of model training there may be some patients whose multiple physical examination index items are missing, which will result in larger errors and poor results in model training. Therefore, in this step, we have eliminated these data records.
  • some physical examination index items are missing in multiple patients, which will also lead to poor performance in the model training process. Therefore, these index items are eliminated.
  • the preprocessing includes: correlation analysis of various indicators in the physical examination data, missing value statistics, elimination of data with a single record missing value exceeding a certain percentage from the perspective of physical examination records, and elimination of all records from the perspective of data indicators For data indicators with missing values exceeding a certain percentage, the missing data in the physical examination records are grouped according to age and the missing values are filled.
  • a 5-fold cross-validation method is used to divide the data set into groups, thereby averaging the results of 5 different groupings to reduce the variance and reduce the sensitivity of the model's performance to the data division.
  • the specific process of the 5-fold cross-validation method is as follows:
  • Non-repetitive sampling The sample data is randomly divided into 5 parts, each of which has the same or similar number of samples; one of them is selected as the test set each time, and the remaining 4 parts are used as the training set for model training. Repeat 5 times to make 5 groups Different training set and validation set group. In this way, each subset has a chance to serve as the validation set, and the rest as the training set.
  • the described content coding strategy takes the following two specific operations:
  • One-hot encoding to encode continuous variables in physical examination records into categorical variables as input.
  • the physical examination record after the content encoding is a one-dimensional vector, and all variables in the one-dimensional vector are analyzed for the correlation between each other; according to the sum of the correlations between a certain variable and all other variables, it is sorted in descending order; after descending sorting All the variables of are arranged in sequence to form a two-dimensional vector, which is used as the input data of the network.
  • the specific process of using the training set to train the chronic disease prediction model is as follows:
  • Input a set of training sets, and respectively go through the feature extraction of the potential correlation shared layer, feature extraction for a single chronic disease, and output the prediction results;
  • the model stops updating and outputs the result
  • the above training process also includes: after the training set of each group is trained, input the validation set of the group into the model to obtain the corresponding classification result; average the loss values obtained from all the validation sets as the performance evaluation of the model. To find the optimal parameters.
  • Model performance evaluation includes the prediction accuracy of multiple single disease types.
  • the present invention has the following beneficial effects:
  • the present invention builds a chronic disease prediction system based on a multi-task learning model. First, it preprocesses the data of physical examination records, and encodes the content and structure of the data. Then, a multi-task learning model is designed, and the multi-task sharing layer is used to realize multiple diseases. Perform feature extraction for possible potential connections, and then perform separate feature extraction and final prediction through a single-task branch designed for a single chronic disease. This can realize simultaneous prediction of multiple chronic diseases and make full use of multiple chronic diseases. Potential associations that may exist. In the training process, the 5-fold cross-validation method is used to train the model. After multiple iterations, the model can achieve a relatively stable effect and high accuracy.
  • FIG. 1 is a schematic diagram of a preprocessing flow of a physical examination record used in an embodiment of the present invention
  • Figure 2 is a schematic diagram of a 5-fold cross-validation method used in an embodiment of the present invention.
  • Figure 3 is a flow chart of the overall framework of the network model proposed by the present invention.
  • Figure 4 is an implementation method of a content encoding strategy used in an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of the network structure of a chronic disease prediction model used in an embodiment of the present invention.
  • Fig. 6 is the result of model prediction in the embodiment of the present invention.
  • a chronic disease prediction system based on a multi-task learning model including a computer memory, a computer processor, and a computer program that is stored in the computer memory and can be executed on the computer processor, and a trained chronic disease prediction model is stored in the computer memory ,
  • the chronic disease prediction model is composed of a shared layer convolutional neural network and multiple chronic disease branch networks; the following steps are implemented when the computer processor executes the computer program:
  • the physical examination records to be predicted are first input into the shared layer convolutional neural network of the chronic disease prediction model for feature extraction, and feature maps are obtained; then the obtained feature maps are input into each chronic disease branch network separately, Feature extraction and prediction to get the results of chronic disease prediction.
  • sample data records and preprocess them. Obtain sample data sets from 5 partner hospitals. The sample data sets contain 48,953 physical examination records. A single physical examination record includes up to 55 physical examination data. Each physical examination item has a different range of reference values. There are also some outliers, and each record is carefully marked by more than 3 professional doctors to distinguish patients with hypertension, diabetes, both or normal.
  • the acquired sample data set is preprocessed, and the data is eliminated based on feature correlation and feature lack.
  • the Pearson correlation coefficient is mainly used to calculate the correlation between features. For paired variables with a Pearson coefficient greater than 0.8, the feature that has a large amount of missing data among the variable pairs is eliminated. In addition, for all patients, if the amount of missing features is greater than 0.2, the patient's data is also discarded. After the elimination of the data, a total of 13,358 physical examination records, 49 physical examination indicators, the missing value of the data variable in each of the data are less than 0.2.
  • age is one of the risk factors for hypertension and diabetes. Therefore, age is an important grouping basis for filling in missing values.
  • the patients are first grouped according to age, and divided into 7 groups in total. Then, for a certain feature to be filled, the mode of the feature value in the group is selected for filling.
  • the specific steps of data set preprocessing are shown in Figure 1.
  • the spatial mapping method For the 49 index items in each record, first use the one-hot encoding method in the content encoding strategy to encode the data of the value bit text corresponding to the index item.
  • the encoding method is shown in Figure 4.
  • use the spatial coding strategy to map 49 index items to a 7 ⁇ 7 matrix as the input of the network model, as shown in the left part of Figure 3.
  • the spatial mapping method here complies with the method described in the present invention.
  • the correlation between each of the 49 index items is calculated, and the correlations between a certain index and all other indexes are arranged in descending order, and Furthermore, the one-dimensional index sequence is mapped to the two-dimensional space, and the h-th value of the 49 indexes is mapped to the i-th, j-th position m ij of the matrix M.
  • the same mapping method is maintained, that is, an index in a set of experiments is mapped to a fixed position in all samples to ensure subsequent correlation analysis).
  • the chronic disease prediction model of the present invention takes a two-dimensional vector as input.
  • a shared layer convolutional neural network shared by multiple diseases is first designed to extract features of potential correlations that may exist in multiple diseases;
  • the feature maps after common feature extraction are used for feature extraction and prediction respectively for each branch of different kinds of chronic diseases.
  • This embodiment constructs a network model for two specific diseases of diabetes and hypertension, and performs feature extraction and disease prediction for the two diseases.
  • the training data set in the first group of data encoded in the above step S03 is input into the model on an individual basis, that is, each input data is data of a two-dimensional matrix containing a physical examination record.
  • the data is input into the model for feature extraction and prediction.
  • the detailed structure of the model is shown in Figure 5.
  • 3 and 6 convolution kernels with a size of 3 ⁇ 3 are used for feature extraction, and the step size of the convolution kernel is set to 1.
  • the feature extraction of diabetes physical examination data and the feature extraction of hypertension physical examination data are carried out respectively.
  • Each branch is designed with two convolutional layers, and each convolutional layer consists of 9 and Twelve convolution kernels perform feature extraction, and the step lengths of the convolution kernels are designed to be 2 and 1, respectively.
  • the two branches of predicting diabetes and hypertension respectively pass through two fully connected layers with 32 nodes and a softmax layer to obtain the final output.
  • Each branch is based on the features extracted by the model to determine whether the patient has diabetes or hypertension, branch 1 is relative to hypertension, and branch 2 is relative to diabetes.
  • the discriminant result output by the model and the annotation corresponding to the physical examination record marked by the expert in step 1 are used to calculate the loss through the cross-entropy loss function.
  • the sum of the loss values of the two branches is used as the loss function of the entire model to optimize the model.
  • the prediction accuracy rate for hypertension can reach 73%, and the prediction accuracy rate for diabetes can reach 82%.
  • the AUC index can reach 79% and more than 85%, which has great advantages and better results compared with the single-task model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Primary Health Care (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

一种基于多任务学习模型的慢病预测系统,包括计算机存储器、计算机处理器以及存储在所述计算机存储器中并可在所述计算机处理器上执行的计算机程序,所述计算机存储器中存有训练好的慢病预测模型,所述的慢病预测模型由共享层卷积神经网络和多个慢病分支网络组成;所述计算机处理器执行所述计算机程序时实现以下步骤:将待预测的体检记录经过预处理后先输入慢病预测模型的共享层卷积神经网络中进行特征提取,得到特征图;然后将得到的特征图分别输入每个慢病分支网络,分别进行特征提取和预测,得到慢病预测结果。该系统可以同时对多种慢性病进行预测。

Description

一种基于多任务学习模型的慢病预测系统 技术领域
本发明属于医学人工智能领域,尤其是涉及一种基于多任务学习模型的慢病预测系统。
背景技术
慢性病(也称慢病)是一类潜伏性、病程长的常见疾病,包括糖尿病、心血管疾病、癌症和呼吸系统疾病等。近年来,慢性病的患者数量还在迅速增加。一般来说,慢性病的病因复杂,需要持续的治疗。因此,慢性病给人们身体、生活带来的危害,其死亡率和治疗负担不断增加。如果能够对慢性病进行早期的发现和干预,能够有效缓解这些问题。
目前,已经有一些方法尝试进行及早的慢性病的发现和治疗。这些方法普遍可以分为两大类:一类是侧重研究包含人们生活习惯和人口统计学变量的数据,找出可能导致某种慢性病的身体装潢或者生活习惯,从而对慢性病进行预防。
如公开号为CN107153774A的中国专利文献公开了慢性病风险评估双曲线模型的构建及应用该模型的疾病预测系统,其依托山东省20多家健康管理中心的纵向健康管理数据构建山东多中心健康管理纵向观察队列,探讨遗传、环境、个人生活方式、健康干预因素等在重大慢性病发生、发展和转归过程中的作用,建立适用于山东省健康体检人群的各种慢性病的风险评估双曲线模型和疾病预测系统,并为慢性病的健康干预提供科学依据。
另一类是通过一些方法来分析电子健康档案的数据和其他通过检查收集的数据,包括人体测量特征(年龄、性别、体重指数等)和生理记录(包括血常规、血糖、尿常规等),通过寻找医学指标和慢性病之间的联系,从而发现某种疾病的危险因素,从而对其进行预测。与此同时,一些研究已经探索了常见的危险因素以及一些常见慢性病之间的潜在联系。
如公开号为CN107007284A的中国专利文献公开了一种多病种慢性病信息管理系统,包括数据库、应用服务器、若干个医院客户端和患者客户端;数据库存储患者的各项体检数据、医生建议、各检查项的健康数据参考范围及各种慢性病的健康状态评估指标;应用服务器根据医院/患者客户端发来的第一查询指令,在数据库中获取指定患者的各项体检数据及相应的健康数据参考范围、各种慢性病的健康状态评估指标、医生建议,得出慢性病评估结果,并将当前指定患者的慢性病评估结果及上述各种数据返回给医院/患者客户端。
然而,目前仍然没有一种运用这些可能存在的慢性病之间的潜在联系,对多种慢性病进行同时预测的方法。
发明内容
本发明提供了一种基于多任务学习模型的慢病预测系统,可以通过运用多种慢性病之间可能存在的潜在联系,同时对多种慢性病进行预测。
一种基于多任务学习模型的慢病预测系统,包括计算机存储器、计算机处理器以及存储在所述计算机存储器中并可在所述计算机处理器上执行的计算机程序,所述计算机存储器中存有训练好的慢病预测模型,所述的慢病预测模型由共享层卷积神经网络和多个慢病分支网络组成;
所述计算机处理器执行所述计算机程序时实现以下步骤:
将待预测的体检记录经过预处理后输入慢病预测模型的共享层卷积神经网络中进行特征提取,特得到特征图;
将得到的特征图分别输入每个慢病分支网络,分别进行特征提取和预测,得到慢病预测结果。
所述共享层卷积神经网络的结构为:首先通过多层任务共享的卷积层,然后分别使用3个和6个尺寸为3×3的卷积核进行特征提取,卷积核的步长设置为1;
每个慢病分支网络分别设有2个卷积层,每个卷积层分别由9个和12个卷积核进行特征提取,卷积核的步长分别设计为2和1;最后,每个分支分别依次通过两个节点数为32的全连接层和一个softmax层得到最后的输出。
所述慢性病预测模型的训练过程如下:
获取慢性病检查相关的体检数据作为样本数据,经过预处理之后,标注标签,标注后的样本数据通过5折交叉验证方法分成训练集和验证集;
设计针对体检数据中结构化数据的数据编码方法,得到慢病预测模型的输入数据;所述数据编码方法包括内容编码策略和空间编码策略,其中,内容编码策略用于实现数据的数值类型统一,空间编码策略用于实现输入模型的数据格式统一;
搭建基于多任务学习的慢病预测模型,使用深度学习方法对编码的结构化数据进行特征提取和分类,同时输出多种慢病的预测结果;
使用训练集对所述慢病预测模型进行训练,根据模型的预测结果与标签的重合度对模型的参数进行调整,直至模型收敛。
本发明中所使用的体检数据为csv格式数据,也可以是其他格式的结构化数据,针对患者的体检记录。每条csv数据对应一个患者的体检记录,每条csv记录中包括多个体检指标项目。在模型训练过程中,可能有一些患者的多项体检指标项目是缺失的,这会导致模型训练的误差较大、效果较差,因此,在这个步骤中,我们对这些数据记录进行了剔除。同时,有 一些体检指标项在多个患者中是缺失的,这同样会导致模型训练过程中表现较差,因此,对这些指标项进行了剔除。
具体来说,所述的预处理包括:对于体检数据中的各项指标进行相关性分析、缺失值统计,从体检记录角度剔除单条记录缺失值超过一定比例的数据,从数据指标角度剔除所有记录中缺失值超过一定比例的数据指标,对体检记录中的缺失数据,根据年龄分组并进行缺失值填充。
具体来讲,首先根据患者的年龄进行分组,分别对每个组内数据的缺失项根据该项在该组内的平均值或者众数进行填充。
为了提升模型性能的稳定性,选用5折交叉验证方法,通过将数据集进行分组,从而对5个不同分组训练的结果进行平均来减少方差,降低模型的性能对数据的划分的敏感性。5折交叉验证方法的具体过程如下:
不重复抽样将样本数据随机分为5份,每份数据样本数量相等或相接近;每一次挑选其中1份作为测试集,剩余4份作为训练集用于模型训练,重复5次,制作5组不同的训练集和验证集组。这样每个子集都有一次机会作为验证集,其余集合作为训练集。
所述的内容编码策略采取如下两种具体操作:
使用标签编码方式将体检记录中的文本信息编码成数值信息;
使用One-hot编码方式将体检记录中的连续变量编码成类别变量,作为输入。
所述的空间编码策略具体操作过程如下:
内容编码之后的体检记录是一维向量,对一维向量中的所有变量进行两两之间相关性分析;根据某个变量和其他所有变量之间的相关性之和进行降序排序;降序排列之后的所有变量依次排列,形成二维向量,作为网络的输入数据。
使用训练集对所述慢病预测模型进行训练的具体过程如下:
输入一组训练集,分别经过潜在相关性的共享层特征提取、针对单种慢性病的特征提取,输出预测结果;
将输出的预测结果与数据所对应的标签进行对比,运用ACC函数作为当前模型的损失,并回传到模型中,对所述模型中的参数进行更新;
当达到设定的ACC阈值或者达到指定迭代次数时,所述模型停止更新,输出结果;
使用上述方法依次输入剩余的训练集进行训练,直至模型收敛。
上述训练过程还包括:在每组的训练集训练结束后,将该组的验证集输入模型,得到对应的分类结果;将所有验证集得到的损失值求平均值,作为模型的性能评估,用于寻找最优参数。模型性能评估包括对多种单一病种的预测准确率。
与现有技术相比,本发明具有以下有益效果:
本发明搭建了基于多任务学习模型的慢病预测系统,首先对体检记录的数据预处理、数据的内容编码和结构编码,然后设计多任务学习模型,利用多任务共享层实现多种疾病之间可能存在的潜在联系进行特征提取,再通过针对单种慢病设计的单任务分支进行分别的特征提取和最终预测,能够实现多种慢病的同时预测,并能够充分运用多种慢病之间可能存在的潜在关联。在训练过程中,使用5折交叉验证的方法,对模型进行训练,经过多次迭代之后模型能够达到一个较为稳定的效果和较高的准确率。
附图说明
图1为本发明实施例所使用的体检记录预处理流程示意图;
图2为本发明实施例中采用的5折交叉验证方法示意图;
图3为本发明提出的网络模型的整体框架流程图;
图4为本发明实施例中所使用的内容编码策略的实施方法;
图5为本发明实施例中所使用的慢病预测模型的网络结构示意图;
图6为本发明实施例中模型预测的结果。
具体实施方式
下面结合附图和实施例对本发明做进一步详细描述,需要指出的是,以下所述实施例旨在便于对本发明的理解,而对其不起任何限定作用。
一种基于多任务学习模型的慢病预测系统,包括计算机存储器、计算机处理器以及存储在计算机存储器中并可在计算机处理器上执行的计算机程序,计算机存储器中存有训练好的慢病预测模型,慢病预测模型由共享层卷积神经网络和多个慢病分支网络组成;计算机处理器执行计算机程序时实现以下步骤:
将待预测的体检记录经过预处理后先输入慢病预测模型的共享层卷积神经网络中进行特征提取,特得到特征图;然后将得到的特征图分别输入每个慢病分支网络,分别进行特征提取和预测,得到慢病预测结果。
下面从模型的构建、训练及验证过程进行详细介绍。
S01,建立样本数据集。
获取体检数据记录并进行预处理,从5家合作医院获得样本数据集,样本数据集共包含48953条体检记录,单条体检记录中最多包括55项体检数据,各个体检项有不同范围的参考值,也有一些异常值,每条记录都被3个以上专业医生共同进行了精细标注,区分患者属于高血压、糖尿病、二者都有或者正常。
S02,数据集预处理。
将获取的样本数据集依据进行预处理,依据特征相关性和特征缺失进行数据剔除。首先,对全部的55个指标进行指标之间相关性分析。考虑到指标的数量和发明中所述的数据编码方式,为了对每条记录保留尽可能多的有用信息,同时尽量不增加冗余信息,对其中的一些变量进行剔除。依据各个指标的值所对应的变量类型,主要使用皮尔逊相关性系数进行对 特征之间的相关性进行计算。对于皮尔逊系数大于0.8的成对变量,将变量对当中数据缺失量较大的一个特征进行剔除。此外,对于所有患者,如果其特征缺失量大于0.2,该患者的数据也被舍弃。经过剔除后的数据共13358条体检记录,49种体检指标,每条中数据变量中数值的缺失量都小于0.2。
然后,对这些体检记录依据年龄分组进行缺失数据填充。研究表明,年龄是高血压和糖尿病的危险因素之一。因此,将年龄作为缺失值填充的一个重要的分组依据。对于数据集不同类别的数据,首先根据年龄对患者进行分组,总共分为7组。然后针对待填充的某个特征,选取该组当中特征值的众数进行填充。数据集预处理的具体步骤如图1所示。
将上述样本数据集近似平均分成5份进行五折交叉验证,其中每份样本数量分布为[2672,2672,2672,2671,2671],并分别标记为[E 1,E 2,E 3,E 4,E 5],分别进行五次模型的训练和预测,记作1 st iteration、2 nd iteration···,所述具体的五折交叉验证方法的过程如图2所示,其中,Training folds表示训练集,Test folds表示验证集。
S03,数据的编码。
对于每条记录中的49个指标项,首先利用内容编码策略中的one-hot编码方式,对其中指标项对应的值位文字的数据进行编码,编码方式如图4所示。然后,利用空间编码策略将49个指标项映射到一个7×7的矩阵,作为网络模型的输入,如图3的左边部分所示。这里的空间映射方法遵照本发明中所述的方法,首先对49个指标项分别计算两两之间的相关性,并按照某个指标和其他所有指标的相关性之和的大小降序排列,并进而将一维的指标序列映射到二维空间中,49个指标中第h个值映射到矩阵M的第i,j个位置m ij。(在一组实验中,保持相同的映射方式,即一组实 验中某个indexes在全部样本中都被映射到固定位置,以保证后续的相关性分析)。
S04,多任务学习模型(慢病预测模型)的构建。
本发明的慢病预测模型,将二维向量作为输入,如图3所示,首先设计多种疾病共享的共享层卷积神经网络,对多种疾病中可能存在的潜在相关性进行特征提取;经过共同特征提取之后的特征图分别通过针对不同种慢性病的各个分支,分别进行特征提取和预测。
本实施例构建了针对糖尿病、高血压这两种特定疾病的网络模型,对两种疾病进行特征提取和患病预测。上述步骤S03编码后的第I组数据中的训练数据集,以个体为单位输入到模型中,即每个输入数据是包含一条体检记录的二维矩阵的数据。数据输入模型中进行特征提取和预测,模型的细节结构如图5所示。首先通过两层任务共享的卷积层,分别使用3个和6个尺寸为3×3的卷积核进行特征提取,卷积核的步长设置为1。然后,通过模型中的任务特定分支,分别进行糖尿病体检数据的特征提取和高血压体检数据的特征提取,每个分支分别以此设计2个卷积层,每个卷积层分别由9个和12个卷积核进行特征提取,卷积核的步长分别设计为2和1。最后,预测糖尿病和高血压两种疾病的两个分支分别依次通过两个节点数为32的全连接层和一个softmax层得到最后的输出。每个分支分别依据模型提取到的特征对于患者是否患有糖尿病和高血压进行判别,分支1相对于高血压,分支2相对于糖尿病。模型输出的判别结果和步骤1中专家标注的体检记录对应的标注通交叉熵损失函数进行损失计算,两个分支的损失值加和作为整个模型的损失函数,用来优化模型。
S05,测试集数据预测。
将对应的第Ⅰ组数据测试数据集中的数据输入到步骤S04训练得到的收敛的基于多任务学习的慢病预测模型中,得到其对应的预测结果,并 对该组所有的测试数据进行ACC(预测准确率)的计算,分别计算对于高血压的预测准确率和对于糖尿病的预测准确率。
S06,五折交叉验证。
对步骤S04、S05重复五次完成五折交叉验证,得到五个测试数据集上的预测准确率(分别针对高血压和糖尿病),将这些预测准确率求平均值,作为参数和模型的性能评估,从而寻找最优参数。
如图6所示,本发明的模型在训练后,针对高血压的预测准确率能够达到73%,针对糖尿病的预测准确率能够达到82%。并且AUC指标能够达到79%和85%以上,相比单任务模型有很大的优势和更好的效果。
以上所述的实施例对本发明的技术方案和有益效果进行了详细说明,应理解的是以上所述仅为本发明的具体实施例,并不用于限制本发明,凡在本发明的原则范围内所做的任何修改、补充和等同替换,均应包含在本发明的保护范围之内。

Claims (9)

  1. 一种基于多任务学习模型的慢病预测系统,包括计算机存储器、计算机处理器以及存储在所述计算机存储器中并可在所述计算机处理器上执行的计算机程序,其特征在于,所述计算机存储器中存有训练好的慢病预测模型,所述的慢病预测模型由共享层卷积神经网络和多个慢病分支网络组成;
    所述计算机处理器执行所述计算机程序时实现以下步骤:
    将待预测的体检记录经过预处理后先输入慢病预测模型的共享层卷积神经网络中进行特征提取,特得到特征图;
    然后将得到的特征图分别输入每个慢病分支网络,分别进行特征提取和预测,得到慢病预测结果。
  2. 根据权利要求1所述的基于多任务学习模型的慢病预测系统,其特征在于,所述共享层卷积神经网络的结构为:首先通过多层任务共享的卷积层,然后分别使用3个和6个尺寸为3×3的卷积核进行特征提取,卷积核的步长设置为1;
    每个慢病分支网络分别设有2个卷积层,每个卷积层分别由9个和12个卷积核进行特征提取,卷积核的步长分别设计为2和1;最后,每个分支分别依次通过两个节点数为32的全连接层和一个softmax层得到最后的输出。
  3. 根据权利要求1所述的基于多任务学习模型的慢病预测系统,其特征在于,所述慢性病预测模型的训练过程如下:
    获取慢性病检查相关的体检数据作为样本数据,经过预处理之后,标注标签,标注后的样本数据通过5折交叉验证方法分成训练集和验证集;
    设计针对体检数据中结构化数据的数据编码方法,得到慢病预测模型 的输入数据;所述数据编码方法包括内容编码策略和空间编码策略,其中,内容编码策略用于实现数据的数值类型统一,空间编码策略用于实现输入模型的数据格式统一;
    搭建基于多任务学习的慢病预测模型,使用深度学习方法对编码的结构化数据进行特征提取和分类,同时输出多种慢病的预测结果;
    使用训练集对所述慢病预测模型进行训练,根据模型的预测结果与标签的重合度对模型的参数进行调整,直至模型收敛。
  4. 根据权利要求3所述的基于多任务学习模型的慢病预测系统,其特征在于,所述的预处理包括:对于体检数据中的各项指标进行相关性分析、缺失值统计,从体检记录角度剔除单条记录缺失值超过一定比例的数据,从数据指标角度剔除所有记录中缺失值超过一定比例的数据指标,对体检记录中的缺失数据,根据年龄分组并进行缺失值填充。
  5. 根据权利要求3所述的基于多任务学习模型的慢病预测系统,其特征在于,所述的5折交叉验证方法具体过程如下:
    不重复抽样将样本数据随机分为5份,每份数据样本数量相等或相接近;每一次挑选其中1份作为测试集,剩余4份作为训练集用于模型训练,重复5次,制作5组不同的训练集和验证集组。
  6. 根据权利要求3所述的基于多任务学习模型的慢病预测系统,其特征在于,所述的内容编码策略采取如下两种具体操作:
    使用标签编码方式将体检记录中的文本信息编码成数值信息;
    使用One-hot编码方式将体检记录中的文本信息编码成数值信息,作为输入。
  7. 根据权利要求3所述的基于多任务学习模型的慢病预测系统,其特征在于,所述的空间编码策略具体操作过程如下:
    内容编码之后的体检记录是一维向量,对一维向量中的所有变量进行 两两之间相关性分析;根据某个变量和其他所有变量之间的相关性之和进行降序排序;降序排列之后的所有变量依次排列,形成二维向量,作为网络的输入数据。
  8. 根据权利要求3所述的基于多任务学习模型的慢病预测系统,其特征在于,使用训练集对所述慢病预测模型进行训练的具体过程如下:
    输入一组训练集,分别经过潜在相关性的共享层特征提取、针对单种慢性病的特征提取,输出预测结果;
    将输出的预测结果与数据所对应的标签进行对比,运用ACC函数作为当前模型的损失,并回传到模型中,对所述模型中的参数进行更新;
    当达到设定的ACC阈值或者达到指定迭代次数时,所述模型停止更新,输出结果;
    使用上述方法依次输入剩余的训练集进行训练,直至模型收敛。
  9. 根据权利要求8所述的基于多任务学习模型的慢病预测系统,其特征在于,训练过程还包括:在每组的训练集训练结束后,将该组的验证集输入模型,得到对应的分类结果;将所有验证集得到的损失值求平均值,作为模型的性能评估,用于寻找最优参数。
PCT/CN2020/128427 2019-12-19 2020-11-12 一种基于多任务学习模型的慢病预测系统 WO2021120936A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/623,555 US20220254493A1 (en) 2019-12-19 2020-11-12 Chronic disease prediction system based on multi-task learning model

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911317824.0A CN111180068A (zh) 2019-12-19 2019-12-19 一种基于多任务学习模型的慢病预测系统
CN201911317824.0 2019-12-19

Publications (1)

Publication Number Publication Date
WO2021120936A1 true WO2021120936A1 (zh) 2021-06-24

Family

ID=70657545

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/128427 WO2021120936A1 (zh) 2019-12-19 2020-11-12 一种基于多任务学习模型的慢病预测系统

Country Status (3)

Country Link
US (1) US20220254493A1 (zh)
CN (1) CN111180068A (zh)
WO (1) WO2021120936A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113555110A (zh) * 2021-07-15 2021-10-26 北京鹰瞳科技发展股份有限公司 一种训练多疾病转诊模型的方法及设备
CN114648509A (zh) * 2022-03-25 2022-06-21 中国医学科学院肿瘤医院 一种基于多分类任务的甲状腺癌检出系统
CN115116607A (zh) * 2022-08-30 2022-09-27 之江实验室 一种基于静息态磁共振迁移学习的脑疾病预测系统
CN115579128A (zh) * 2022-10-19 2023-01-06 内蒙古卫数数据科技有限公司 一种多模型特征增强疾病筛查系统
CN116913519A (zh) * 2023-07-24 2023-10-20 东莞莱姆森科技建材有限公司 基于智能镜的健康监护方法、装置、设备及存储介质

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111180068A (zh) * 2019-12-19 2020-05-19 浙江大学 一种基于多任务学习模型的慢病预测系统
CN112117009A (zh) * 2020-09-25 2020-12-22 北京百度网讯科技有限公司 用于构建标签预测模型的方法、装置、电子设备及介质
US11468993B2 (en) * 2021-02-10 2022-10-11 Eyethena Corporation Digital therapeutic platform
CN113076850A (zh) * 2021-03-29 2021-07-06 Oppo广东移动通信有限公司 多任务预测方法、多任务预测装置及电子设备
CN113611411B (zh) * 2021-10-09 2021-12-31 浙江大学 一种基于假阴性样本识别的体检辅助决策系统
US20230290502A1 (en) * 2022-03-10 2023-09-14 Aetna Inc. Machine learning framework for detection of chronic health conditions
CN116130084A (zh) * 2022-12-12 2023-05-16 中国医学科学院医学信息研究所 一种老年肺癌发病高危人群干预效果预测方法
CN115862870B (zh) * 2022-12-16 2023-11-24 深圳市携康网络科技有限公司 一种基于人工智能的慢性病管理系统及方法
CN116825360A (zh) * 2023-07-24 2023-09-29 湖南工商大学 基于图神经网络的慢病共病预测方法、装置及相关设备
CN117421548B (zh) * 2023-12-18 2024-03-12 四川互慧软件有限公司 基于卷积神经网络对生理指标数据缺失的治理方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874663A (zh) * 2017-01-26 2017-06-20 中电科软件信息服务有限公司 心脑血管疾病风险预测方法及系统
CN109378066A (zh) * 2018-12-20 2019-02-22 翼健(上海)信息科技有限公司 一种基于特征向量实现疾病预测的控制方法及控制装置
CN109684922A (zh) * 2018-11-20 2019-04-26 浙江大学山东工业技术研究院 一种基于卷积神经网络的多模型对成品菜的识别方法
CN109994201A (zh) * 2019-03-18 2019-07-09 浙江大学 一种基于深度学习的糖尿病与高血压概率计算方法
CN111180068A (zh) * 2019-12-19 2020-05-19 浙江大学 一种基于多任务学习模型的慢病预测系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108831559B (zh) * 2018-06-20 2021-01-15 清华大学 一种中文电子病历文本分析方法与系统
CN109036553B (zh) * 2018-08-01 2022-03-29 北京理工大学 一种基于自动抽取医疗专家知识的疾病预测方法
CN109658419B (zh) * 2018-11-15 2020-06-19 浙江大学 一种医学图像中小器官的分割方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874663A (zh) * 2017-01-26 2017-06-20 中电科软件信息服务有限公司 心脑血管疾病风险预测方法及系统
CN109684922A (zh) * 2018-11-20 2019-04-26 浙江大学山东工业技术研究院 一种基于卷积神经网络的多模型对成品菜的识别方法
CN109378066A (zh) * 2018-12-20 2019-02-22 翼健(上海)信息科技有限公司 一种基于特征向量实现疾病预测的控制方法及控制装置
CN109994201A (zh) * 2019-03-18 2019-07-09 浙江大学 一种基于深度学习的糖尿病与高血压概率计算方法
CN111180068A (zh) * 2019-12-19 2020-05-19 浙江大学 一种基于多任务学习模型的慢病预测系统

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113555110A (zh) * 2021-07-15 2021-10-26 北京鹰瞳科技发展股份有限公司 一种训练多疾病转诊模型的方法及设备
CN114648509A (zh) * 2022-03-25 2022-06-21 中国医学科学院肿瘤医院 一种基于多分类任务的甲状腺癌检出系统
CN115116607A (zh) * 2022-08-30 2022-09-27 之江实验室 一种基于静息态磁共振迁移学习的脑疾病预测系统
CN115579128A (zh) * 2022-10-19 2023-01-06 内蒙古卫数数据科技有限公司 一种多模型特征增强疾病筛查系统
CN115579128B (zh) * 2022-10-19 2023-11-21 内蒙古卫数数据科技有限公司 一种多模型特征增强疾病筛查系统
CN116913519A (zh) * 2023-07-24 2023-10-20 东莞莱姆森科技建材有限公司 基于智能镜的健康监护方法、装置、设备及存储介质

Also Published As

Publication number Publication date
US20220254493A1 (en) 2022-08-11
CN111180068A (zh) 2020-05-19

Similar Documents

Publication Publication Date Title
WO2021120936A1 (zh) 一种基于多任务学习模型的慢病预测系统
CN111292853B (zh) 基于多参数的心血管疾病风险预测网络模型及其构建方法
RU2703679C2 (ru) Способ и система поддержки принятия врачебных решений с использованием математических моделей представления пациентов
Islam et al. Chronic kidney disease prediction based on machine learning algorithms
Ho et al. The dependence of machine learning on electronic medical record quality
CN113782183B (zh) 一种基于多算法融合的压力性损伤风险预测装置及方法
Bardak et al. Improving clinical outcome predictions using convolution over medical entities with multimodal learning
CN110400610B (zh) 基于多通道随机森林的小样本临床数据分类方法及系统
Chen et al. Heterogeneous postsurgical data analytics for predictive modeling of mortality risks in intensive care units
Popkes et al. Interpretable outcome prediction with sparse Bayesian neural networks in intensive care
Juraev et al. Multilayer dynamic ensemble model for intensive care unit mortality prediction of neonate patients
Overweg et al. Interpretable outcome prediction with sparse Bayesian neural networks in intensive care
CN112542242A (zh) 数据转换/症状评分
CN114611879A (zh) 一种基于多任务学习的临床风险预测系统
Sun et al. A general fine-tuned transfer learning model for predicting clinical task acrossing diverse EHRs datasets
Alshari et al. Machine learning model to diagnose diabetes type 2 based on health behavior
CN114613465A (zh) 一种脑卒中患病风险预测和个性化治疗推荐方法及系统
Bamidele et al. Survival model for diabetes mellitus patients’ using support vector machine
Kasabe et al. Cardio Vascular ailments prediction and analysis based on deep learning techniques
Ravaji et al. CSChO-deep MaxNet: Cat swam chimp optimization integrated deep maxout network for heart disease detection
Shabbeer et al. Liver Disease Prediction Model Using SVM and Logistic Regression
Lawal et al. Heart disease diagnosis using data mining techniques and a decision support system
Shi et al. Assessing palliative care needs using machine learning approaches
Bose et al. Female Diabetic Prediction in India Using Different Learning Algorithms
Sarkar et al. Machine Learning based Early Predication and Detection of Diabetes Mellitus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20901713

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20901713

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 23.01.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20901713

Country of ref document: EP

Kind code of ref document: A1