CN113345581B

CN113345581B - Cerebral apoplexy post thrombolysis bleeding probability prediction method based on ensemble learning

Info

Publication number: CN113345581B
Application number: CN202110525660.1A
Authority: CN
Inventors: 梁浩然; 倪文田; 李峰; 施超; 孙阳; 梁荣华
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2023-06-27
Anticipated expiration: 2041-05-14
Also published as: CN113345581A

Abstract

The cerebral apoplexy post-thrombolysis bleeding probability prediction method based on ensemble learning comprises the steps of preprocessing original data to ensure that the data for model training is sufficiently simplified, and simultaneously, the original data can be completely represented; then inputting the preprocessed data into four sub learners, respectively training to generate a preliminary result, and summarizing the result; and finally, inputting the summarized result obtained in the last step into a Catboost model, and obtaining a final probability prediction model through training. According to the invention, theoretical and practical analysis is performed on a plurality of factors related to the post-thrombolytic hemorrhage probability of cerebral apoplexy, key factors are extracted, and a prediction model with higher accuracy is trained by means of the performance advantage of integrated learning. By virtue of deep research on the correlation among various features and reasonable utilization of ensemble learning, the model finally presented by the invention can give reliable theoretical guidance to doctors clinically.

Description

Cerebral apoplexy post thrombolysis bleeding probability prediction method based on ensemble learning

Technical Field

The invention relates to a cerebral apoplexy post-thrombolysis bleeding probability prediction method based on ensemble learning.

Background

Cerebral stroke, i.e. the blockage of cerebral vessels by thrombus, once it occurs, will cause great harm to the life health of the patient. The common clinical treatment method is to use thrombolytic drugs for thrombolysis, but after the thrombolytic drugs are used, serious bleeding complications are caused with a certain probability, and the probability of serious disability or death of patients suffering from bleeding exceeds 90 percent. Under the general condition, the thrombolytic bleeding probability of the doctor informing the family members of the patient is mostly obtained according to the experience of the doctor, the judgment basis for the individual patient is lacking, and each doctor has own judgment standard and has larger randomness. For this situation, an effective prediction method is needed in clinic.

In recent years, with the advent of products such as AlphaGo, the heat of machine learning is higher and higher, and more fields such as natural language processing and image style migration are involved, and considerable roles are played in these fields. The integrated learning is a large class in machine learning, and is mainly characterized in that a plurality of learners are built to complete learning tasks: 1) The sub-learners are not limited in type, and the integrated total learners can be homogenous (the sub-learners are the same in type) or heterogeneous (the self-learning Xi Qi is different in type), so that the performance and compatibility of the integrated learning are improved; 2) Typically, a generalization performance significantly superior to that of a single learner can be achieved through integration, which is more pronounced if the sub learner is weaker.

Disclosure of Invention

In order to overcome the defects of the prior art, in order to provide a reliable enough basis for a doctor to judge the bleeding probability of a cerebral apoplexy patient after thrombolysis, the invention provides an integrated learning-based cerebral apoplexy post-thrombolysis bleeding probability prediction method, which integrates a learning model, integrates factors of a plurality of aspects of the patient to consider and outputs the final judgment.

In order to solve the technical problems, the invention provides the following technical scheme:

an ensemble learning-based post-thrombolytic hemorrhage probability prediction method for cerebral apoplexy, comprising the following steps:

(1) Input feature processing: after the original data is obtained, irrelevant features and the like are required to be removed, so that the performance of a final model is improved;

(2) Training a single model: training on the preprocessed data set by using four sub learners respectively to obtain respective prediction results, and waiting for the next processing; the process is as follows:

firstly, dividing a data set into a plurality of folds, and setting the total number of the folds as (N+1), wherein the number of samples contained in the first N folds is the same in the (N+1) folds, and the (N+1) th folds have no requirement, but the number of the contained samples is as close to the first F folds as possible;

for each sub-learner, training is required N times in total; in the T-th training, the training set is the first N folds with folds removed _T The remaining (N-1) folds of the model is used to predict folds _T Fold _N+1 Is provided with fold _T And fold _N+1 The number of samples contained is M and M respectively, and after one round of training, two vectors are obtained, namely the model is related to fold respectively _T Is the prediction result of (2)

Regarding fold _N+1 Predicted outcome of->

After training by N rounds, all

Splicing to obtain the prediction result of the sub learner about the first N folds +.>

For all->

Averaging to obtain the child learner about fold _N+1 Final prediction result->

Due to adoption ofFour sub-learners, thus after a single model training step, two prediction matrices are obtained, i.e. prediction result matrices for the first N folds

Regarding fold _N+1 Is>

(3) Multi-model fusion: fusing the results output in the last step according to a specific method, and outputting final probability prediction; the operation of the multimodal fusion is as follows:

construction of a new learner

To learn the outputs of the respective sub-learners, < >>

Is a Catboost model whose training set is +.>

Matrix, verification set is +.>

Matrix, training completed->

I.e., the final model, for predicting the new sample; when predicting a new sample, the new sample is processed in the steps (1) and (2) and finally +.>

The received input is a preliminary prediction of the single model training output of step (2), after which a probability prediction of post-thrombolytic hemorrhage can be output.

Further, in the step (1), the method for processing the input features includes the following steps:

(1-1) for conventional features, it is possible to input directly into the model without any processing;

(1-2) directly rejecting useless features so as not to pollute the trained model;

(1-3) merging the relevant features;

(1-4) continuous feature normalization, compressing numerical features into the same range;

(1-5) discrete feature one-time thermal encoding.

The integrated learning model provided by the invention comprises four sub learners, namely CatBoost, XGBoost, lightGBM and Factorizationachine respectively. It should be noted that, in addition to the factor learning, three other sub learners are actually integrated with a plurality of sub learners, and thus, from this point of view, the present invention can be regarded as "integrated integration". In order to avoid ambiguity, the sub-learner refers to all four frameworks unless otherwise specified.

The ensemble learning can be divided into two main categories according to the generation mode of the sub-learner: serial and parallel. There is a strong dependency between the serially generated sub-learners, and there is no parallel. The three integrated learning frameworks (except the factor learning) related by the invention are all serial generation sub learners, and in particular belong to Boosting one family of algorithms.

The earliest Boosting algorithm was AdaBoost, which appeared in 1997. When training, adaBoost firstly gives the same initial weight to each sample, each round of learner adjusts the weight of each sample according to the performance of each round of learner after training, and the weights of the erroneous classifying samples are increased, so that the learner can get more attention in the subsequent learners, a plurality of learners are trained according to the process, and finally weighted combination is carried out. Gradient Boosting, which is generally similar to AdaBoost, was developed in 2001 with the greatest difference that the learner was more concerned with the method of classifying erroneous samples in the previous round in each round. AdaBoost is the weight of the error sample added, and Gradient Boosting is the error of the last round corrected by fitting the negative gradient of the last round.

XGBoost, which was shown in 2015 and is known as Extreme Gradient Boosting, is a new implementation of Gradient Boosting, with the main differences in loss function: 1) The regularization term is added on the basis of the original loss function, so that a new objective function is generated; 2) The objective function is second-order taylor-expanded and optimized in a manner similar to newton's method. Equation (1) is the objective function of the mth learner in XGBoost, where N is the total number of samples, Ω (f) _m ) Is a newly added regularization term.

The LightGBM was proposed by microsoft in 2017, and is known in full as Light Gradient Boosting Machine, whose name reveals its greatest features, light. Compared with XGBoost, lightGBM, the training device has the advantages of faster training speed, better accuracy and lower memory consumption. In addition, the method supports distributed computing, and can rapidly process mass data. Overall, the LightGBM is in principle just as practical and XGBoost is to fit a negative gradient, which is optimized mainly in terms of multithreading, decision trees, etc.

CatBOOST was open-sourced by Russian search giant Yandex in month 4 of 2017, which is the most mainstream three of the Boosting algorithm family with XGBoost, lightGBM. CatBoost, categorical Boost, which is based on a symmetrical tree, the developer claiming accuracy is better than XGBoost and LightGBM, and the pain point which is mainly solved is to efficiently and reasonably process category characteristics. The biggest innovation of Catboost, compared to the former two, is the fact that a symmetrical tree is used.

FactorizationMachine (FM), factorization machine. In a traditional linear model, each feature is independent, if interaction between the features needs to be considered, the features may need to be manually combined in a crossing way, and when the feature dimension is too high, the method is conceivable to be feasible; the nonlinear SVM can perform kernel mapping on the features, but at feature heightIn the case of sparsity, learning is not performed well; other decomposition models, such as matrix decomposition MF, svd++, etc., do learn interactive hidden relationships between features, but basically each model is only used for a particular scene. For this reason, FM occurs under highly sparse data scenes such as recommendation systems. Equation (2) is the objective function of FM, the most critical part of which is<V _i ,V _j >x _i x _j ，<V _i ,V _j >Is the dot product, x, of the ith and jth rows of matrix V _i And x _j Then two features are adopted, and through the operation, all the features of the sample are connected with each other to different degrees, so that hidden connection among the features can be conveniently found out.

The beneficial effects of the invention are as follows: the theory and practice analysis is carried out on a plurality of factors related to the post-thrombolytic hemorrhage probability of cerebral apoplexy, key factors are extracted, and a prediction model with higher accuracy is trained by means of the performance advantage of ensemble learning; by virtue of deep research on the correlation among various features and reasonable utilization of ensemble learning, the model finally presented by the invention can give reliable theoretical guidance to doctors clinically.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a view of data preprocessing according to the present invention.

FIG. 3 is a single model training view of the present invention.

Detailed description of the preferred embodiments

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 3, a method for predicting post-thrombolytic hemorrhage probability of cerebral apoplexy based on ensemble learning includes the following steps:

1) The input feature processing, as shown in fig. 1, in the original feature space, the subsequent processing performed according to the need can divide the features into five categories: a) Conventional features, which do not require any processing and can be directly input into the model; b) Useless features, such as blood type and the like, obviously do not play any positive role on the target according to experience, and for the features, the method of the invention directly eliminates the model trained by the training model so as not to be polluted; c) In many cases, the original data contains some features with similar implicit meanings, such as blood sugar level and diabetes, if the features exist as they are, the same implicit features of the features may occupy excessive weight in the final model, so that the correct judgment of the model on the result is affected, and in this case, the method of the invention combines the features; d) The continuous feature normalization, a large class of features common in raw data are continuous numerical features, but there are often magnitude differences between different numerical features, such as systolic pressure and BMI, which are essentially unproblematic, but when used to train a model, larger values tend to take up more weight. Therefore, for the situation, the invention selects to perform feature standardization and compresses numerical type features into the same range; e) Discrete features are thermally encoded uniquely, the discrete features being categorical features such as gender. For class-type features, one approach is to assign a value to each class in natural number order to represent that class. However, as in d), although the difference between the representative numbers of each category is not particularly large, the representative numbers with different sizes always have a certain influence on the result, so that the invention selects to treat the category type characteristics by using the single-heat coding, and the memory burden is increased, but the rationality of the model is greatly improved;

2) Single model training FIG. 2 illustrates the learning process of one of the sub-learners during a single model training process. Training has N rounds, N can be set by oneself according to specific situations, or a plurality of N values are selected and then an optimal N value is selected according to a final model effect, as shown in fig. 2, a data set after the previous feature pretreatment is divided into n+1fold, wherein the first N fold is used as a training set, and the n+1fold is used as a verification set to verify the performance of the model. Training deviceThe number of samples of each fold in the training set is M, the number of samples of the n+1st fold is M, in fig. 2, the right is the total single model training process, training is performed by four sub learners respectively, and then the results of all the sub learners are integrated to generate the final prediction matrix. The left is the detailed training process of a single sub-learner, as shown, a total of N rounds of training, in the T-th round of training, fold _T (Tth fold) is a verification set, the rest N-1 folds are training sets, and a preliminary model obtained by training the sub-learner on the training sets which are further divided is used for correcting the folds _T Fold _N+1 Making predictions to generate two prediction vectors, e.g. obtained by first training

Is->

According to the operation of the step, after N rounds of training, the sub learner is used for merging the prediction results of the previous N folds to obtain the corresponding folds ₁ To fold _N For the predictive vector of (2) for fold _N+1 The prediction vector of (2) can be defined by N +.>

The vectors are averaged, and after all the sub learners perform N rounds of training, the obtained results are combined to obtain corresponding prediction matrixes;

3) The multi-model fusion is carried out, the result output in the last step is fused according to a specific method, and then final probability prediction can be output; the operation of the multimodal fusion is as follows:

construction of a new learner

To learn the outputs of the respective sub-learners, < >>

Is a Catboost model whose training set is +.>

Matrix, verification set is +.>

Matrix, training completed->

According to the embodiment, theoretical and practical analysis is carried out on a plurality of factors related to the post-thrombolytic hemorrhage probability of cerebral apoplexy, key factors are extracted, and a prediction model with higher accuracy is trained by means of the performance advantage of integrated learning; by virtue of deep research on the correlation among various features and reasonable utilization of ensemble learning, the model finally presented by the invention can give reliable theoretical guidance to doctors clinically.

Claims

1. The cerebral apoplexy post-thrombolysis bleeding probability prediction method based on ensemble learning is characterized by comprising the following steps of:

(1) Input feature processing: after the original data is obtained, irrelevant feature processing is needed to be removed, and the performance of a final model is improved;

in the original feature space, features are divided into five main categories: a) Conventional features, which do not need to be processed and are directly input into the model; b) Directly removing useless features including blood types; c) Relevant characteristics including the height of blood sugar and the existence of diabetes are combined; d) Continuous feature standardization, including systolic pressure and BMI, is performed, and numerical features are compressed into the same range; e) Single-heat encoding of discrete features;

firstly, dividing a data set into a plurality of folds, and setting the total number of the folds as (N+1), wherein the number of samples contained in the first N folds is the same in the (N+1) folds, and the (N+1) th folds have no requirement, but the number of the contained samples is as close to the first N folds as possible;

Regarding fold _N+1 Predicted outcome of->

After N-round training, all +.>

For all->

Averaging to obtain the child learner about fold _N+1 Final prediction result->

Due to the collection ofFour sub-learners are used, so after a single model training step, two prediction matrices are obtained, namely a prediction result matrix for the first N folds and a prediction result matrix for the folds _N+1 Is a prediction result matrix of (a);

constructing a new learner I to learn the output of each sub learner, wherein I is a Catboost model, the training set of the new learner is the matrix in the step (2), the verification set is the matrix, and the trained I is a final model for predicting new samples; when a new sample is predicted, the new sample is processed in the step (1) and the step (2), and finally the input received by I is preliminary prediction output by the single model training in the step (2), and then the probability prediction of bleeding after thrombolysis can be output.

2. The method for predicting post-thrombolytic hemorrhage based on ensemble learning as set forth in claim 1, wherein in said step (1), the method for inputting feature processing comprises the steps of:

(1-3) merging the relevant features;

(1-5) discrete feature one-time thermal encoding.