CN108053885A

CN108053885A - A kind of hemorrhagic conversion forecasting system

Info

Publication number: CN108053885A
Application number: CN201711209120.2A
Authority: CN
Inventors: 王枫
Original assignee: Shanghai Sixth Peoples Hospital
Current assignee: Shanghai Sixth Peoples Hospital
Priority date: 2017-11-27
Filing date: 2017-11-27
Publication date: 2018-05-18
Anticipated expiration: 2037-11-27
Also published as: CN108053885B

Abstract

The invention discloses a kind of hemorrhagic conversion forecasting systems, belong to field of medical technology；System includes：Acquiring unit, for obtaining a plurality of training patient data；Model generation unit is used to be used for the prediction model for predicting hemorrhagic conversion with patient data generation one according to the training of a plurality of acquisition, and model generation unit further comprises：Feature selection module, for being made choice to training with the training in patient data with state of an illness feature；Tagsort module, for carrying out tagsort with state of an illness feature to selected training；Model training module, for training to form prediction model with state of an illness feature according to the training by classification；Collecting unit, for collecting actual patient data；Predicting unit, for actual patient data to be sent into the prediction model that training is formed, to export corresponding prediction result.The advantageous effect of above-mentioned technical proposal is：Reduce hemorrhagic conversion the occurrence of probability, so as to reduce clinical risk and corresponding medical expense.

Description

A Prediction System for Hemorrhagic Transformation

技术领域technical field

本发明涉及医疗技术领域，尤其涉及一种出血转化预测系统。The invention relates to the field of medical technology, in particular to a hemorrhagic transformation prediction system.

背景技术Background technique

脑梗死是全球主要公共卫生问题之一，其发病率、致死率、致残率及复发率高，且临床治疗手段有限。其中静脉溶栓治疗是近20年在缺血性脑卒中临床治疗手段上的一项重大突破，可有效降低死亡及致残率。但是静脉溶栓治疗是一种高风险的治疗手段，在治疗的同时可能伴随着溶栓后出血转化的症状，有些出血可加重神经功能损害，甚至危及生命，出血转化也是静脉溶栓治疗手段无法进一步推广的重要原因之一。具体地，在临床救治过程中，静脉溶栓有严格的时间限制，只有发病4.5小时以内符合静脉溶栓条件的急性脑梗死患者才可以进行该项治疗，所谓“符合静脉溶栓条件”即指发生出血转化的风险较低或者基本没有出血转化危险。因此，对于出血转化风险的预判直接关系到医生与患者的沟通及对治疗的决策把握。Cerebral infarction is one of the major public health problems in the world, with high morbidity, mortality, disability and recurrence, and limited clinical treatment methods. Among them, intravenous thrombolytic therapy is a major breakthrough in the clinical treatment of ischemic stroke in the past 20 years, which can effectively reduce the death and disability rate. However, intravenous thrombolytic therapy is a high-risk treatment, and it may be accompanied by symptoms of hemorrhagic transformation after thrombolysis. Some bleeding can aggravate neurological damage, and even endanger life. One of the important reasons for further promotion. Specifically, in the clinical treatment process, intravenous thrombolysis has a strict time limit. Only patients with acute cerebral infarction who meet the conditions for intravenous thrombolysis within 4.5 hours of onset can receive this treatment. The so-called "qualified for intravenous thrombolysis" means The risk of hemorrhagic transformation is low or almost no risk of hemorrhagic transformation. Therefore, the prediction of the risk of hemorrhagic transformation is directly related to the communication between doctors and patients and the decision-making of treatment.

现有技术中对于出血转化的预测通常依据一些相关的临床研究来决定，例如譬如年龄、血糖、发病时的临床神经功能等。同时也会有相关的临床评分表来辅助临床医生对出血转化的可能性进行预判，例如溶栓后出血评分、多中心卒中调查预测评分、SITS评分、GRASPS评分以及SEDAN评分等。但是这些评分表基本都是基于医生的诊疗经验或者是使用了简单的逻辑回归而得到的，使用价值和相应的精确性都有待检验。可以参照《五种预测模型在中国人群溶栓后出血预测应用中的比较》一文中，对不同的评分模型的性能进行了比较。其中GRASPS评分性能最优，但是其受试者工作特征曲线(receiver operatingcharacteristiccurve,ROC)所对应的AUC值仅有0.7056，其工作性能仍然有待提升。并且，上述临床提示相对独立，评分表也相对简单，基本都只涵盖了几项临床上可能对预后影响较大的因素，而不会对患者的临床相关因素进行综合评判，亦缺乏个体化评判。In the prior art, the prediction of hemorrhagic transformation is usually determined based on some relevant clinical studies, such as age, blood sugar, clinical neurological function at the time of onset, and the like. At the same time, there will be related clinical scoring tables to assist clinicians in predicting the possibility of hemorrhagic transformation, such as hemorrhage score after thrombolysis, multicenter stroke survey prediction score, SITS score, GRASPS score, and SEDAN score. However, these scoring tables are basically obtained based on the doctor's diagnosis and treatment experience or using simple logistic regression, and the use value and corresponding accuracy are yet to be tested. You can refer to the article "Comparison of five prediction models in the application of bleeding prediction after thrombolysis in Chinese population" to compare the performance of different scoring models. Among them, the GRASPS score has the best performance, but the AUC value corresponding to its receiver operating characteristic curve (ROC) is only 0.7056, and its working performance still needs to be improved. Moreover, the above-mentioned clinical prompts are relatively independent, and the scoring table is relatively simple. Basically, they only cover a few clinical factors that may have a greater impact on the prognosis, and do not comprehensively evaluate the clinically relevant factors of the patient, and lack individualized evaluation. .

发明内容Contents of the invention

根据现有技术中存在的问题，现提供一种出血转化预测系统的技术方案，旨在降低静脉溶栓治疗过程中出血转化情况的发生概率，从而降低临床风险和相应的医疗费用。According to the problems existing in the prior art, a technical solution of a hemorrhagic transformation prediction system is now provided, which aims to reduce the occurrence probability of hemorrhagic transformation during intravenous thrombolytic therapy, thereby reducing clinical risks and corresponding medical expenses.

上述技术方案具体包括：The above-mentioned technical solutions specifically include:

一种出血转化预测系统，其中，包括：A hemorrhagic transformation prediction system, including:

获取单元，用于获取多条训练用患者数据，每条所述训练用患者数据中包括多个训练用病情特征；An acquisition unit, configured to acquire multiple pieces of patient data for training, each piece of patient data for training includes multiple disease characteristics for training;

模型生成单元，连接所述获取单元，用于根据多条获取的所述训练用患者数据生成一用于对出血转化进行预测的预测模型，所述模型生成单元进一步包括：A model generation unit, connected to the acquisition unit, is used to generate a prediction model for predicting hemorrhagic transformation according to the plurality of obtained training patient data, and the model generation unit further includes:

特征选择模块，用于对所述训练用患者数据中的所述训练用病情特征进行选择；A feature selection module, configured to select the disease features for training in the patient data for training;

特征分类模块，连接所述特征选择模块，用于对被选择的所述训练用病情特征进行特征分类；A feature classification module, connected to the feature selection module, for feature classification of the selected disease features for training;

模型训练模块，连接所述特征分类模块，用于根据经过分类的所述训练用病情特征训练形成所述预测模型；A model training module, connected to the feature classification module, for training and forming the prediction model according to the classified training disease features;

采集单元，用于采集得到实际患者数据；an acquisition unit, configured to acquire actual patient data;

预测单元，分别连接所述采集单元和所述模型生成单元，用于将所述实际患者数据送入训练形成的所述预测模型中，以输出对应的预测结果。The prediction unit is connected to the acquisition unit and the model generation unit respectively, and is used to send the actual patient data into the prediction model formed by training, so as to output a corresponding prediction result.

优选的，该出血转化预测系统，其中，所述特征选择模块中进一步包括：Preferably, the hemorrhagic transformation prediction system, wherein the feature selection module further includes:

第一特征选择部件，用于采用CM特征选择方式对所述训练用病情特征进行选择；The first feature selection component is used to select the disease features for training by using the CM feature selection method;

第二特征选择部件，用于采用封装模型特征选择方式对所述训练用病情特征进行选择；The second feature selection component is used to select the disease features for training by adopting the packaging model feature selection method;

第三特征选择部件，用于采用过滤模型特征选择方式对所述训练用病情特征进行选择；The third feature selection component is used to select the disease features for training by adopting a filter model feature selection method;

选择控制部件，分别连接所述第一特征选择部件、所述第二特征选择部件和所述第三特征选择部件，用于根据所述训练用病情特征之间的对应关系选择启用所述第一特征选择部件或者所述第二特征选择部件或者所述第三特征选择部件。A selection control part is connected to the first feature selection part, the second feature selection part and the third feature selection part, and is used to select and enable the first feature selection part according to the correspondence between the training disease features. A feature selection component or said second feature selection component or said third feature selection component.

优选的，该出血转化预测系统，其中，所述特征分类模块采用随机森林模型的方式对所述训练用病情特征进行分类。Preferably, in the hemorrhagic transformation prediction system, the feature classification module uses a random forest model to classify the disease features for training.

优选的，该出血转化预测系统，其中，所述特征分类模块采用支持向量机的方式对所述训练用病情特征进行分类。Preferably, in the hemorrhagic transformation prediction system, the feature classification module uses a support vector machine to classify the disease features for training.

优选的，该出血转化预测系统，其中，所述特征分类模块采用Logistic回归或者感知器的方式对所述训练用病情特征进行分类。Preferably, in the hemorrhagic transformation prediction system, the feature classification module uses Logistic regression or perceptron to classify the disease features for training.

优选的，该出血转化预测系统，其中，所述特征分类模块采用AdaBoost算法对所述训练用病情特征进行分类。Preferably, in the hemorrhagic transformation prediction system, the feature classification module uses the AdaBoost algorithm to classify the disease features for training.

优选的，该出血转化预测系统，其中，还包括：Preferably, the hemorrhagic transformation prediction system further includes:

数据处理单元，连接在所述获取单元和所述模型生成单元之间，用于对所述训练用患者数据进行预设处理，以实现所述训练用患者数据的数据均衡；A data processing unit, connected between the acquisition unit and the model generation unit, for performing preset processing on the patient data for training, so as to achieve data balance of the patient data for training;

所述预设处理为：采用过采样和/或多元支持向量机算法的方式对所述训练用患者数据进行处理。The preset processing is: processing the patient data for training by means of oversampling and/or multivariate support vector machine algorithm.

数据处理单元，连接在所述采集单元和所述模型生成单元之间，用于对所述训练用患者数据进行预设处理，以实现所述训练用患者数据的数据均衡；A data processing unit, connected between the acquisition unit and the model generation unit, for performing preset processing on the patient data for training, so as to achieve data balance of the patient data for training;

所述预设处理为：采用过采样的方式对所述训练用患者数据进行处理；和/或The preset processing is: processing the training patient data in an over-sampling manner; and/or

采用代价敏感损失函数对所述训练用患者数据进行处理；和/或processing said training patient data with a cost-sensitive loss function; and/or

采用代价敏感学习率对所述训练用患者数据进行处理。The training patient data is processed using a cost sensitive learning rate.

优选的，该出血转化预测系统，其中，所述模型生成单元中还包括：Preferably, the hemorrhagic transformation prediction system, wherein the model generation unit further includes:

风险评级模块，连接所述模型训练模块，所述风险评级模块根据获取的所述训练用患者数据进行风险等级离散化处理，以形成一组风险评级的参照离散点，作为所述模型训练模块训练形成所述预测模型时的参考数据。The risk rating module is connected to the model training module, and the risk rating module performs risk level discretization processing according to the obtained patient data for training to form a set of reference discrete points for risk rating as the model training module training Reference data when forming the predictive model.

上述技术方案的有益效果是：提供一种出血转化预测系统，能够降低静脉溶栓治疗过程中出血转化情况的发生概率，从而降低临床风险和相应的医疗费用。The beneficial effect of the above technical solution is to provide a predictive system for hemorrhagic transformation, which can reduce the occurrence probability of hemorrhagic transformation during intravenous thrombolytic therapy, thereby reducing clinical risks and corresponding medical expenses.

附图说明Description of drawings

图1是本发明的较佳的实施例中，一种出血转化预测系统的总体结构示意图；Figure 1 is a schematic diagram of the overall structure of a hemorrhagic transformation prediction system in a preferred embodiment of the present invention;

图2是本发明的较佳的实施例中，特征选择模块的具体结构示意图；Fig. 2 is in the preferred embodiment of the present invention, the specific structural representation of feature selection module;

图3是本发明的较佳的实施例中，特征选择模块的具体工作原理示意图。Fig. 3 is a schematic diagram of the specific working principle of the feature selection module in a preferred embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动的前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

需要说明的是，在不冲突的情况下，本发明中的实施例及实施例中的特征可以相互组合。It should be noted that, in the case of no conflict, the embodiments of the present invention and the features in the embodiments can be combined with each other.

下面结合附图和具体实施例对本发明作进一步说明，但不作为本发明的限定。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments, but not as a limitation of the present invention.

随着科学技术的不断发展，一些机器学习算法也被应用到了对出血转化进行预测的过程中，即利用一些机器学习算法来实现对出血转化的预测，从大量的非结构化的数据中识别出人类难以识别的模式。已有的对溶栓出血风险预测仅仅只是将溶栓治疗和出血转化对应的数据库输入给了一个固定的机器学习模型，没有针对医学数据的特征对算法进行改进，没有考虑数据的不均衡、数据特征之间可能存在的关联关系以及医生的先验诊断给数据集带来的影响，因此模型的预测水准较为一般，比如在14年H.Asadi等人的研究中，模型虽然精确性达到了70％但是其召回率接近于零，预测的AUC值也仅有0.6左右。这些预测模型的设计系统都是固定的，参数都是基于对之前的数据库内数据训练得到的，对新进入的患者数据没有设计自更新功能。随着时间的推移，这些系统预测的准确性会有明显的下降。With the continuous development of science and technology, some machine learning algorithms have also been applied to the process of predicting hemorrhagic transformation, that is, using some machine learning algorithms to realize the prediction of hemorrhagic transformation, identifying Patterns that are difficult for humans to recognize. The existing risk prediction of thrombolytic hemorrhage only input the database corresponding to thrombolytic therapy and hemorrhagic transformation into a fixed machine learning model, without improving the algorithm according to the characteristics of medical data, without considering the imbalance of data, data The possible correlation between features and the impact of the doctor's prior diagnosis on the data set, so the prediction level of the model is relatively general. For example, in the research of H.Asadi et al. % But its recall rate is close to zero, and the predicted AUC value is only about 0.6. The design systems of these prediction models are all fixed, and the parameters are all based on the training of the data in the previous database, and there is no self-renewal function designed for the newly entered patient data. Over time, the accuracy of these systems' predictions degrades significantly.

基于现有技术中存在的上述问题，现提供一种出血转化预测系统，该预测系统应用于对静脉溶栓治疗过程中可能产生的出血转化状况进行预测。Based on the above-mentioned problems in the prior art, a hemorrhagic transformation prediction system is now provided, which is applied to predict the hemorrhagic transformation that may occur during intravenous thrombolytic therapy.

具体地，上述出血转化预测系统中具体如图1所述，包括：Specifically, the above-mentioned hemorrhagic transformation prediction system is specifically as described in Figure 1, including:

获取单元1，用于获取多条训练用患者数据，每条训练用患者数据中包括多个训练用病情特征；An acquisition unit 1, configured to acquire multiple pieces of patient data for training, each piece of patient data for training includes multiple disease characteristics for training;

模型生成单元2，连接获取单元1，用于根据多条获取的训练用患者数据生成一用于对出血转化进行预测的预测模型，模型生成单元2进一步包括：The model generation unit 2 is connected to the acquisition unit 1, and is used to generate a prediction model for predicting hemorrhagic transformation according to multiple obtained training patient data, and the model generation unit 2 further includes:

特征选择模块21，用于对训练用患者数据中的训练用病情特征进行选择；The feature selection module 21 is used to select the disease features for training in the patient data for training;

特征分类模块22，连接特征选择模块21，用于对被选择的训练用病情特征进行特征分类；The feature classification module 22 is connected to the feature selection module 21, and is used for feature classification of the selected training disease features;

模型训练模块23，连接特征分类模块22，用于根据经过分类的训练用病情特征训练形成预测模型；The model training module 23 is connected to the feature classification module 22, which is used to form a predictive model according to the classified training with disease feature training;

采集单元3，用于采集得到实际患者数据；The acquisition unit 3 is used to acquire actual patient data;

预测单元4，分别连接采集单元3和模型生成单元2，用于将实际患者数据送入训练形成的预测模型中，以输出对应的预测结果。The prediction unit 4 is respectively connected to the acquisition unit 3 and the model generation unit 2, and is used to send actual patient data into the prediction model formed by training, so as to output corresponding prediction results.

具体地，本实施例中，上述获取单元1可以连接一外部的数据库，或者远程连接一服务端，并通过数据库或者远程的服务端获取预先准备好的训练用患者数据。当然，上述训练用患者数据也可以通过用户设置的方式直接输入到获取单元1中。Specifically, in this embodiment, the acquisition unit 1 may be connected to an external database, or remotely connected to a server, and obtain pre-prepared patient data for training through the database or the remote server. Of course, the above patient data for training can also be directly input into the acquisition unit 1 by way of setting by the user.

本实施例中，上述训练用患者数据中包括多个训练用病情特征，这些训练用病情特征都是一些物理意义明确的特征，因此不需要系统再从训练用患者数据中进行特征提取。所谓物理意义明确的特征，可以为根据历史记录中的一些患者的病历信息中的一些基本信息例如年龄和性别等，进行术前体征检查的一些检查信息，以及进行静脉溶栓治疗后是否有出血转化情况的相关信息(无出血转化、轻微出血或者严重出血)等。In this embodiment, the patient data for training includes a plurality of disease features for training, and these disease features for training are features with clear physical meanings, so there is no need for the system to perform feature extraction from the patient data for training. The so-called features with clear physical meaning can be based on some basic information in the medical records of some patients in historical records, such as age and gender, etc., some inspection information for preoperative physical examination, and whether there is bleeding after intravenous thrombolytic therapy Information about the transformation status (no hemorrhagic transformation, slight hemorrhage or severe hemorrhage), etc.

本实施例中，上述获取单元1在获取数据后，需要对数据进行预处理，具体地需要筛选掉训练用患者数据中存在缺失或者明显有错误的数据条目，并且对连续型数据进行标准化处理。进一步地，上述数据预处理的过程可以由用户手动执行，也可以由系统根据预设的一些筛选规则自动执行，例如预设训练用患者数据的数据填充模板，并根据该模板来对训练用患者数据进行匹配，以判断训练用患者数据中是否存在数据缺失的情况，以及根据模板中不同填充位的数据格式来对其进行匹配，判断训练用患者数据中是否存在数据明显错误的情况。In this embodiment, after the acquisition unit 1 acquires the data, it needs to preprocess the data. Specifically, it needs to filter out missing or obviously erroneous data items in the training patient data, and standardize the continuous data. Further, the above-mentioned data preprocessing process can be performed manually by the user, or can be automatically performed by the system according to some preset screening rules, such as filling templates with preset training patient data data, and according to the templates, training patients Match the data to determine whether there is data missing in the training patient data, and match it according to the data format of different filling bits in the template to determine whether there is an obvious data error in the training patient data.

本实施例中，上述模型生成单元2中，在进行预测模型的训练生成之前，首先需要对特征进行选择。之所以需要进行特征选择，是因为在数据集大小、特征维度以及特征属性不同的情况下，不同的训练用病情特征的选择框架都会有不同的性能以及各自适宜使用的环境，因此需要在训练模型之前对特征进行选择，以将不同的特征放置到最适合的特征选择框架中，体现其最佳的测试效果。具体的特征选择方法在下文中会详述。In this embodiment, in the above-mentioned model generation unit 2, before performing training and generation of the prediction model, it is first necessary to select features. The reason why feature selection is needed is that in the case of different data set sizes, feature dimensions, and feature attributes, different training disease feature selection frameworks will have different performance and their respective suitable environments, so it is necessary to train the model The features are selected before to place different features into the most suitable feature selection framework to reflect their best test results. The specific feature selection method will be described in detail below.

本实施例中，经过特征选择之后，需要采用特征分类模块22对经过选择的训练用病情特征进行分类。上述特征分类模块22可以采用分类器实现，经过特征分类后的训练用病情特征就可以应用到模型训练的过程中。In this embodiment, after feature selection, the feature classification module 22 needs to be used to classify the selected disease features for training. The above-mentioned feature classification module 22 can be realized by a classifier, and the disease features for training after feature classification can be applied to the process of model training.

本实施例中，采用与现有技术中类似的方式，根据特征训练得到相应的预测模型，在此不再赘述。In this embodiment, a method similar to that in the prior art is adopted to obtain a corresponding prediction model according to feature training, which will not be repeated here.

本实施例中，训练得到的预测模型就可以应用到实际的出血转化预测的过程中。具体地，按照预测模型的输入要求采集病患的实际数据并送入预测模型中，经过预测模型的预测后就能够得到用于表示该病患经过静脉溶栓治疗后出现出血转化的可能性的预测结果。医生可以将该预测结果作为参考信息与患者之间进行沟通以及制定相关的诊疗计划等，从而降低临床风险，节约医疗费用。In this embodiment, the trained prediction model can be applied to the actual process of predicting hemorrhagic transformation. Specifically, according to the input requirements of the prediction model, the actual data of the patient are collected and sent into the prediction model. After the prediction of the prediction model, the probability of hemorrhagic transformation of the patient after intravenous thrombolytic therapy can be obtained. forecast result. Doctors can use the predicted results as reference information to communicate with patients and formulate relevant diagnosis and treatment plans, thereby reducing clinical risks and saving medical expenses.

本发明的较佳的实施例中，如图2中所示，特征选择模块21中进一步包括：In a preferred embodiment of the present invention, as shown in Figure 2, the feature selection module 21 further includes:

第一特征选择部件211，用于采用CM特征选择方式对训练用病情特征进行选择；The first feature selection component 211 is used to select the disease feature for training using the CM feature selection method;

第二特征选择部件212，用于采用封装模型特征选择方式对训练用病情特征进行选择；The second feature selection component 212 is used to select the disease features for training by adopting the package model feature selection method;

第三特征选择部件213，用于采用过滤模型特征选择方式对训练用病情特征进行选择；The third feature selection component 213 is used to select the disease features for training by adopting the filter model feature selection method;

选择控制部件214，分别连接第一特征选择部件211、第二特征选择部件212和第三特征选择部件213，用于根据训练用病情特征之间的对应关系选择启用第一特征选择部件211或者第二特征选择部件212或者第三特征选择部件213。The selection control part 214 is connected with the first feature selection part 211, the second feature selection part 212 and the third feature selection part 213 respectively, for selecting and enabling the first feature selection part 211 or the third feature selection part 211 according to the corresponding relationship between the training disease features. The second feature selection component 212 or the third feature selection component 213 .

本实施例中，上述第一特征选择部件211、第二特征选择部件212以及第三特征选择部件213即分别表示系统的三种不同的特征选择框架，其可以由计算机系统实现自动的训练和测试。In this embodiment, the first feature selection component 211, the second feature selection component 212, and the third feature selection component 213 respectively represent three different feature selection frameworks of the system, which can be automatically trained and tested by a computer system .

具体地，第一特征选择部件211采用的是CM特征选择方式(Conservative MeanFeature Selection)对训练用病情特征进行选择。CM特征选择方式主要针对单个特征进行选择，其提供了一种采样提高特征选择稳定性的方案。具体地，CM特征选择方式中利用了单调函数映射情况下AUC值不变的特性，使用K-fold validation来计算对于某一个特定的特征和分类结果的AUC值。随后，对于这K个AUC值，再求取其平均值μ和标准差α。最后通过比较(μ-α)的值来选择最佳的多个训练用病情特征组成特征子集。Specifically, the first feature selection component 211 uses a CM feature selection method (Conservative Mean Feature Selection) to select disease features for training. The CM feature selection method mainly selects a single feature, which provides a solution for sampling to improve the stability of feature selection. Specifically, the CM feature selection method utilizes the characteristic that the AUC value does not change in the case of monotonic function mapping, and uses K-fold validation to calculate the AUC value for a specific feature and classification result. Then, for these K AUC values, calculate their mean value μ and standard deviation α. Finally, by comparing the value of (μ-α) to select the best multiple training disease features to form a feature subset.

上述第二特征选择部件212采用的是封装模型特征选择方式(Wrapper)对训练用病情特征进行选择。第三特征选择部件213采用的是过滤模型特征选择方式(Filter)对训练用病情特征进行选择。这两种特征选择方式中，需要考虑的是它们的评估函数和搜索算法。在本技术方案中，对于这两种特征选择方式，提供了包括前向搜索、反向搜索、遗传算法以及穷举搜索等多种搜索算法。并且对于封装模型特征选择方式，采用CFS框架(Correlation-based feature Selection)学习算法输出的AUC值作为其评估函数。对于过滤模型特征选择方式，采用symmetrical uncertainty、RELIEF以及最小描述长度作为其评估函数。The above-mentioned second feature selection component 212 selects the disease features for training by adopting the wrapper model feature selection method (Wrapper). The third feature selection component 213 uses a filter model feature selection method (Filter) to select disease features for training. In these two feature selection methods, what needs to be considered is their evaluation function and search algorithm. In this technical solution, for these two feature selection methods, multiple search algorithms including forward search, reverse search, genetic algorithm and exhaustive search are provided. And for the feature selection method of the package model, the AUC value output by the CFS framework (Correlation-based feature Selection) learning algorithm is used as its evaluation function. For the filter model feature selection method, symmetrical uncertainty, RELIEF and minimum description length are used as its evaluation function.

本实施例中，采用一个选择控制部件214来控制上述三种特征选择框架的运行。具体地，如图3中所示，选择控制部件214首先根据训练用病情特征之间的关联关系进行判断：In this embodiment, a selection control component 214 is used to control the operation of the above three feature selection frameworks. Specifically, as shown in Figure 3, the selection control component 214 first judges according to the association relationship between the training disease features:

1)若关联关系较为简单，则选择控制部件214直接选择启用第一特征选择部件211，即采用CM特征选择方式对训练用病情特征进行选择；1) If the correlation is relatively simple, then the selection control part 214 directly selects and enables the first feature selection part 211, that is, adopts the CM feature selection method to select the disease feature for training;

2)若关联关系较为复杂，则选择控制部件214选择较为传统的另两种特征选择框架。进一步地，若获取到了足够多的数据，则选择控制部件214选择启用第二特征选择部件212，即采用封装模型特征选择方式对训练用病情特征进行选择；2) If the association relationship is relatively complicated, the selection control component 214 selects the other two traditional feature selection frameworks. Further, if enough data has been obtained, the selection control component 214 selects and enables the second feature selection component 212, that is, adopts the encapsulation model feature selection method to select the disease features for training;

3)若获取到的数据量较少，则选择控制部件214选择启用第三特征选择部件213，即采用过滤模型特征选择方式对训练用病情特征进行选择。3) If the amount of acquired data is small, the selection control unit 214 selects and activates the third feature selection unit 213, that is, selects the disease features for training by adopting a filtering model feature selection method.

本发明的较佳的实施例中，上述特征分类模块22采用随机森林模型的方式对训练用病情特征进行分类。In a preferred embodiment of the present invention, the feature classification module 22 uses a random forest model to classify the disease features for training.

具体地，对于随机森林模型，可以在系统中直接使用Scikit-learn学习库中的RandomForestClassifier。因为随机森林的每课决策树在训练的过程中就等同于在做特征选取，因此采用随机森林模型进行特征分类的过程中无需对特征选取算法进行额外的考虑。在交叉验证(cross validation)的过程中直接对需要调整的参数进行网格搜索，最后确定在AUC值最优的情况下其各个参数值即可。Specifically, for the random forest model, the RandomForestClassifier in the Scikit-learn learning library can be directly used in the system. Because the decision tree of each lesson of the random forest is equivalent to feature selection during the training process, there is no need to consider the feature selection algorithm in the process of feature classification using the random forest model. In the process of cross validation (cross validation), directly perform grid search on the parameters that need to be adjusted, and finally determine the value of each parameter under the condition that the AUC value is optimal.

本发明的较佳的实施例中，上述特征分类模块22采用支持向量机的方式对训练用病情特征进行分类。In a preferred embodiment of the present invention, the feature classification module 22 uses a support vector machine to classify the disease features for training.

具体地，对于传统的支持向量机(Support Vector Machine，SVM)而言，使用的为Python 3.6下的libsvm库。其中，用作衡量性能的AUC值考量的是数据点到最优超平面的距离值与最后分类之间的关系。Specifically, for a traditional support vector machine (Support Vector Machine, SVM), the libsvm library under Python 3.6 is used. Among them, the AUC value used to measure performance considers the relationship between the distance value of the data point to the optimal hyperplane and the final classification.

而对于多元支持向量机(Multivariate SVM)而言，可以用AUC值作为预测标签和真实标签之间的损失函数。使用的核都为线性核，并且不考虑过采样的处理，以及使用C语言编写的svm-perf库。For Multivariate SVM, the AUC value can be used as the loss function between the predicted label and the real label. The kernels used are all linear kernels, and the processing of oversampling is not considered, and the svm-perf library written in C language is used.

在实际处理过程中，可以根据实际情况选择不同种类的支持向量机、相对应的特征选择算法以及数据均衡处理方法(下文中会详述)，输出当前最佳的SVM模型。In the actual processing process, different types of support vector machines, corresponding feature selection algorithms, and data equalization processing methods (described in detail below) can be selected according to the actual situation, and the current best SVM model can be output.

本发明的较佳的实施例中，上述特征分类模块22采用Logistic回归或者感知器的方式对训练用病情特征进行分类。In a preferred embodiment of the present invention, the feature classification module 22 uses Logistic regression or a perceptron to classify the disease features for training.

具体地，采用Logistic回归的方式时，其训练过程中需要对比不同的特征选择方案以及数据均衡处理方法(下文中会详述)。本技术方案中，针对Logistic回归采用的是Python3.6下的Theano框架。Specifically, when using the Logistic regression method, different feature selection schemes and data equalization processing methods need to be compared during the training process (details will be described below). In this technical solution, the Theano framework under Python3.6 is used for Logistic regression.

采用感知器的方式时，在感知器的设计方面，特征选取以及单隐层感知器适当的非线性设计能够使得感知器模型对当前数据集具有较好的拟合效果。并且考虑到目前训练用患者数据的数据集大小有限，感知器的隐层数目也不宜过多，否则反而会引来更多的误差。When using the perceptron method, in terms of perceptron design, feature selection and proper nonlinear design of single hidden layer perceptron can make the perceptron model have a better fitting effect on the current data set. And considering the limited size of the current patient data set for training, the number of hidden layers of the perceptron should not be too large, otherwise it will lead to more errors.

对于感知器的一些超参数，可以采用交叉验证来确定。更进一步地，可以根据实际情况选择适宜使用的特征选择和数据均衡处理方法(下文中会详述)。同样地，本技术方案中，针对感知器也采用Python3.6下的Theano框架。For some hyperparameters of the perceptron, cross-validation can be used to determine. Furthermore, an appropriate feature selection and data equalization processing method can be selected according to the actual situation (details will be described below). Similarly, in this technical solution, the Theano framework under Python 3.6 is also used for the perceptron.

本发明的较佳的实施例中，上述特征分类模块22采用AdaBoost算法对训练用病情特征进行分类。In a preferred embodiment of the present invention, the feature classification module 22 uses the AdaBoost algorithm to classify the disease features for training.

具体地，在AdaBoost算法中，每个弱分类器都被设计成较为简单的感知器模型。其中弱分类器的个数可以由交叉验证来确定。上述AdaBoost算法也主要通过Python3.6下的Theano框架实现。Specifically, in the AdaBoost algorithm, each weak classifier is designed as a relatively simple perceptron model. The number of weak classifiers can be determined by cross-validation. The above-mentioned AdaBoost algorithm is also mainly realized through the Theano framework under Python3.6.

本发明的较佳的实施例中，仍然如图1中所示，上述出血转化预测系统具体还包括：In a preferred embodiment of the present invention, still as shown in Figure 1, the above-mentioned hemorrhagic transformation prediction system specifically further includes:

数据处理单元5，连接在获取单元1和模型生成单元2之间，用于对训练用患者数据进行预设处理，以完成以下目标：筛选明显存在错误的患者数据、对缺失数据进行填充、实现训练用患者数据的数据均衡。The data processing unit 5 is connected between the acquisition unit 1 and the model generation unit 2, and is used for performing preset processing on the patient data for training, so as to accomplish the following goals: screening patient data with obvious errors, filling missing data, realizing Data equalization of patient data for training.

具体地，在输入特征缺失方面，对于现有的多数据中心数据集，由于每个数据中心记录的特征都有所侧重，因此存在一定量的数据缺失。本发明采用的缺失处理方案为missing indicate方案，以避免平均数填充或者中位数填充对数据集的准确性带来负面的影响。在目标分类均衡方面，对于现有的患者数据而言，其最终发生出血转化的训练样本和未发生出血转化的训练样本之间的占比非常不均衡，发生症状性出血的样本可能仅占未发生出血样本的1/20左右，导致整个训练样本的数据集出现了数据不均衡的问题。因此，系统需要对数据集进行一些数据均衡处理，以避免采用不均衡的数据集对模型进行训练导致最终的预测模型输出不准确的问题。Specifically, in terms of missing input features, for the existing multi-data center datasets, there is a certain amount of data missing because the features recorded in each data center are emphasized. The missing processing scheme adopted in the present invention is a missing indicate scheme, so as to avoid negative impacts on the accuracy of the data set caused by the filling of the mean or the filling of the median. In terms of target classification balance, for the existing patient data, the proportion of the training samples with hemorrhagic transformation and the training samples without hemorrhagic transformation is very unbalanced, and the samples with symptomatic bleeding may only account for the unbalanced proportion. About 1/20 of the bleeding samples occurred, resulting in a problem of data imbalance in the data set of the entire training sample. Therefore, the system needs to perform some data equalization processing on the data set to avoid the problem of inaccurate output of the final prediction model caused by using an unbalanced data set to train the model.

进一步地，针对特征分类模块22所采用的不同的特征分类方法(即不同的分类器)，上述预设处理(即系统采用的数据均衡处理方法)也有所不同，具体为：Further, for the different feature classification methods (i.e. different classifiers) adopted by the feature classification module 22, the above-mentioned preset processing (i.e. the data equalization processing method adopted by the system) is also different, specifically:

1)当特征分类模块22采用支持向量机的方式对特征进行分类时，上述数据处理单元5中采用的预设处理如下：1) When the feature classification module 22 uses a support vector machine to classify features, the preset processing adopted in the above-mentioned data processing unit 5 is as follows:

①过采样，采用过采样的方式对训练用患者数据进行数据均衡处理。具体地，随机采样一系列的少类样本(即有症状性出血的训练用患者数据，下文中不再详述)，以使不同类型所对应的样本数量相近。① Oversampling, using the oversampling method to perform data equalization processing on the training patient data. Specifically, a series of few-class samples (ie, patient data for training with symptomatic bleeding, which will not be described in detail below) are randomly sampled, so that the number of samples corresponding to different types is similar.

②使用多元支持向量机算法进行分类，并用AUC值替代错误率来更新支持向量机。②Multivariate support vector machine algorithm was used for classification, and the AUC value was used instead of the error rate to update the support vector machine.

2)当特征分类模块22采用Logistic回归或者感知器的方式对特征进行分类时，上述数据处理单元5中采用的预设处理如下：2) When the feature classification module 22 uses Logistic regression or a perceptron to classify features, the preset processing adopted in the above-mentioned data processing unit 5 is as follows:

①采用过采样的方式对训练用患者数据进行数据均衡处理。具体地，随机采样一系列的少类样本，以使不同类型所对应的样本数量相近。①Using oversampling to perform data equalization processing on the training patient data. Specifically, a series of few-class samples are randomly sampled so that the number of samples corresponding to different types is similar.

②采用代价敏感损失函数替代系统中使用的一般的损失函数，以加大对少类样本错判的惩罚力度。②A cost-sensitive loss function is used to replace the general loss function used in the system to increase the punishment for misjudgment of few-class samples.

③使用代价敏感学习率替代系统中使用的一般的学习率，使得对于少类样本的学习率更高，对于多类样本(即没有症状性出血的训练用患者数据，下文中不再详述)的学习率更低。在这种情况下，模型的参数针对少类样本进行更新的步长要大于多类样本。③Use the cost-sensitive learning rate to replace the general learning rate used in the system, so that the learning rate for few-class samples is higher, and for multi-class samples (that is, patient data for training without symptomatic bleeding, which will not be described in detail below) The learning rate is lower. In this case, the parameters of the model are updated with a larger step size for few-class samples than for multi-class samples.

本发明的较佳的实施例中，仍然如图1中所示，上述模型生成单元2中还包括：In a preferred embodiment of the present invention, still as shown in Figure 1, the above-mentioned model generation unit 2 also includes:

风险评级模块24，连接模型训练模块23，风险评级模块24根据获取的训练用患者数据进行风险等级离散化处理，以形成一组风险评级的参照离散点，作为模型训练模块训练形成预测模型时的参考数据。The risk rating module 24 is connected to the model training module 23, and the risk rating module 24 performs risk level discretization processing according to the obtained training patient data to form a set of reference discrete points for risk rating, which are used as the model training module training to form the prediction model. reference data.

具体地，系统针对风险等级离散化提供了两种方案：Specifically, the system provides two solutions for the discretization of risk levels:

在系统数据库内数据比较匮乏，数据质量还不是很理想的情况下，系统使用无监督的等频数离散化方案。等频数离散化作用在训练集上，可以获得一组连续数据离散点。这一组离散点是后续为患者作风险评级的时候的参照点；When the data in the system database is relatively scarce and the data quality is not very ideal, the system uses an unsupervised equal-frequency discretization scheme. The discretization of equal frequency numbers acts on the training set, and a set of discrete points of continuous data can be obtained. This group of discrete points is the reference point for subsequent risk rating for patients;

在系统数据库内数据比较充足并且数据质量可以得到一定保证的情况下，系统使用有监督的最小化信息熵离散化方案。这种方案作用在训练集上，通过训练集不同分组的信息熵之和，使得不同分组组间出血样本占总样本的比率的差异尽可能大，进而也生成一组可以用于新来患者风险分级参照的离散点。When the data in the system database is relatively sufficient and the data quality can be guaranteed to a certain extent, the system uses a supervised discretization scheme to minimize information entropy. This scheme acts on the training set, and through the sum of the information entropy of different groups in the training set, the difference in the ratio of bleeding samples to the total samples between different groups is as large as possible, and then a set of risk factors for new patients is generated. Discrete points for grading references.

上述两种风向等级离散化方案的分级的有效性可以通过威尔科克森符号秩检验(Wilcoxon秩和检验)进行验证。系统选择检验过程中Z值偏离原点更大的方案所生成的离散点作为预测部分的参照点。The effectiveness of the classification of the above two wind direction grade discretization schemes can be verified by the Wilcoxon signed rank test (Wilcoxon rank sum test). The system selects the discrete point generated by the scheme whose Z value deviates more from the origin during the inspection process as the reference point for the prediction part.

以上所述仅为本发明较佳的实施例，并非因此限制本发明的实施方式及保护范围，对于本领域技术人员而言，应当能够意识到凡运用本发明说明书及图示内容所作出的等同替换和显而易见的变化所得到的方案，均应当包含在本发明的保护范围内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the implementation and protection scope of the present invention. For those skilled in the art, they should be able to realize that all equivalents made by using the description and illustrations of the present invention The solutions obtained by replacement and obvious changes shall all be included in the protection scope of the present invention.

Claims

1. A hemorrhagic transformation prediction system, comprising:

An acquisition unit, configured to acquire multiple pieces of patient data for training, each piece of patient data for training includes multiple disease characteristics for training;

A model generation unit, connected to the acquisition unit, is used to generate a prediction model for predicting hemorrhagic transformation according to the plurality of obtained training patient data, and the model generation unit further includes:

A feature selection module, configured to select the disease features for training in the patient data for training;

A feature classification module, connected to the feature selection module, for feature classification of the selected disease features for training;

A model training module, connected to the feature classification module, for training and forming the prediction model according to the classified training disease features;

an acquisition unit, configured to acquire actual patient data;

The prediction unit is connected to the acquisition unit and the model generation unit respectively, and is used to send the actual patient data into the prediction model formed by training, so as to output a corresponding prediction result.

2. hemorrhagic transformation prediction system as claimed in claim 1, is characterized in that, further comprises in the described feature selection module:

The first feature selection component is used to select the disease features for training by using the CM feature selection method;

The second feature selection component is used to select the disease features for training by adopting the packaging model feature selection method;

The third feature selection component is used to select the disease features for training by adopting a filter model feature selection method;

A selection control part is connected to the first feature selection part, the second feature selection part and the third feature selection part, and is used to select and enable the first feature selection part according to the correspondence between the training disease features. A feature selection component or said second feature selection component or said third feature selection component.

3. The hemorrhagic transformation prediction system according to claim 1, wherein the feature classification module classifies the disease features for training using a random forest model.

4. The hemorrhagic transformation prediction system according to claim 1, wherein the feature classification module classifies the disease features for training by means of a support vector machine.

5. The hemorrhagic transformation prediction system according to claim 1, wherein the feature classification module uses Logistic regression or a perceptron to classify the disease features for training.

6. The hemorrhagic transformation prediction system according to claim 1, wherein the feature classification module uses the AdaBoost algorithm to classify the disease features for training.

7. The hemorrhagic transformation prediction system as claimed in claim 4, further comprising:

A data processing unit, connected between the acquisition unit and the model generation unit, for performing preset processing on the patient data for training, so as to achieve data balance of the patient data for training;

The preset processing is: processing the patient data for training by means of oversampling and/or multivariate support vector machine algorithm.

8. The hemorrhagic transformation prediction system as claimed in claim 5, further comprising:

The preset processing is: processing the training patient data in an over-sampling manner; and/or

processing said training patient data with a cost-sensitive loss function; and/or

The training patient data is processed using a cost sensitive learning rate.

9. hemorrhagic transformation prediction system as claimed in claim 1, is characterized in that, also comprises in the described model generating unit:

The risk rating module is connected to the model training module, and the risk rating module performs risk level discretization processing according to the obtained patient data for training to form a set of reference discrete points for risk rating as the model training module training Reference data when forming the predictive model.