CN115206527A

CN115206527A - Cerebral infarction surgery patient survival risk classification method based on machine learning

Info

Publication number: CN115206527A
Application number: CN202210760511.8A
Authority: CN
Inventors: 卢莉; 黄文弘; 王琳娜
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-10-18

Abstract

The invention provides a machine learning-based survival risk classification method for patients with cerebral infarction surgery, which is realized based on a survival prediction system of perioperative patients with cerebral infarction, wherein the survival prediction system of perioperative patients with cerebral infarction comprises an acquisition module, a prediction module and an output module; the acquisition module is used for inputting the data of the cerebral infarction patient; the prediction module is used for inputting the data of the cerebral infarction patients into the prediction model to predict the survival period, and the output module is used for outputting the prediction result; the prediction model comprises a base model of a first layer and a logistic regression model of a second layer; the base model is divided into a first base model, a second base model and a third base model, wherein the first base model is a comprehensive random forest model, the second base model is an XGboost model, and the third base model is an MLP model; solves the problems that the gold survival time of the patients is delayed due to the uneven capability of medical staff in the prior art, or the survival time of the cerebral infarction patients is shortened due to other serious side effects on the patients caused by over-nursing.

Description

Cerebral infarction surgery patient survival risk classification method based on machine learning

Technical Field

The invention belongs to the technical field of medical equipment intelligence, and particularly relates to a method for classifying survival risks of patients with cerebral infarction surgery based on machine learning.

Background

Cerebral infarction is also called ischemic stroke, which refers to softening necrosis of local brain tissue caused by blood circulation disorder, ischemia and anoxia. The problem of survival in patients with cerebral infarction has long been one of the most serious concerns for physicians. Perioperative, in addition to the decisions that the patient needs to perform medical treatment, the attending physician's clinical ability, ability to randomize strains, and means of medication and treatment for the patient are closely related to the survival rate of patients with cerebral infarction.

In the prior art, a specific decision-making treatment is usually performed on a patient by an attending physician by checking the physical state of the patient with the cerebral infarction in combination with self medical experience, but in many remote areas or places with immature medical level, the gold survival time of the patient is delayed due to the uneven ability of the attending physician, or the survival time of the patient with the cerebral infarction is shortened due to other serious side effects on the patient caused by over-nursing. Therefore, there is still a considerable risk and limitation in the targeted decision-making treatment of such patients, simply by means of medical personnel plus conventional medical techniques.

Disclosure of Invention

The invention aims to at least solve the technical problems in the prior art, provides a method for classifying survival risks of patients with cerebral infarction surgery based on machine learning, and solves the problems that in the prior art, the golden survival time of the patients is delayed due to the uneven capabilities of medical staff, or the survival time of the patients is shortened due to other serious side effects on the patients caused by over-nursing.

In order to achieve the above object, according to a first aspect of the present invention, the present invention provides a method for classifying survival risks of a cerebral infarction surgery patient based on machine learning, the method is implemented based on a survival prediction system of a perioperative cerebral infarction patient, the survival prediction system of the perioperative cerebral infarction patient comprises an obtaining module, a prediction module and an output module; the acquisition module is used for inputting cerebral infarction patient data; the prediction module is used for inputting the data of the cerebral infarction patients into the prediction model for survival prediction, and the output module is used for outputting prediction results; the prediction model comprises a base model of a first layer and a logistic regression model of a second layer; the base model is divided into a first base model, a second base model and a third base model, the first base model is a comprehensive random forest model, the second base model is an XGboost model, and the third base model is an MLP model.

In another preferred embodiment of the present invention, the system further comprises a data processing module for performing data change processing on the data of the cerebral infarction patient.

In another preferred embodiment of the present invention, the data processing module comprises a culling module, a cleaning module and a transformation module;

the removing module is used for removing the characteristic that the actual deletion rate in the data of the cerebral infarction patient is greater than the standard deletion rate;

the cleaning module is used for processing missing values of the data of the cerebral infarction patient;

the transformation module is used for carrying out feature coding processing and data normalization processing on the cerebral infarction patient data.

In another preferred embodiment of the present invention, the missing value processing is to fill in the missing value by using a misforest filling method.

In another preferred embodiment of the present invention, the feature encoding process is specifically to encode the cerebral infarction patient data by using One-hot encoding rule.

In another preferred embodiment of the present invention, the data normalization process specifically uses a standard deviation normalization method combined with a maximum value normalization method to process the data of the cerebral infarction patient.

In another preferred embodiment of the present invention, the system further comprises an optimization module for optimizing the hyper-parameters of the prediction model, wherein the optimization module comprises a screening module, a feature selection module and an assignment module;

the screening module is used for screening the data of the cerebral infarction patient from the patient database;

the characteristic selection module is used for sequencing the characteristic importance scores of the processed data of the cerebral infarction patients and screening out the characteristics of which the importance scores are greater than the standard scores to form a test set;

and the giving module is used for putting the test set into the base model to carry out hyper-parameter optimization processing to obtain the optimized base model hyper-parameters and giving the optimized hyper-parameters to the base model again.

In another preferred embodiment of the present invention, the feature selection module is specifically configured to rank the feature importance scores of the processed data of the cerebral infarction patients, screen out features with importance scores greater than a standard score, form a data set, and divide the data set into a test set and a training set;

the optimization module further comprises a training module, and the training module is used for putting the training set into the base model endowed with the hyperparameters again for training.

In another preferred embodiment of the present invention, the hyper-parameter optimization process employs a method combining genetic algorithm and cross-validation.

The beneficial technical effects of the technical scheme of the invention comprise: the invention introduces an ensemble learning method. The random forest is a Bagging algorithm, the XGboost is a Boosting algorithm, and the MLP is a deep network method, and the random forest, the XGboost and the MLP are different greatly in learning and have advantages and disadvantages. The method integrates random forests, XGboost and MLP, constructs the RF-XBM model by using the Stacking integration strategy, gives full play to the advantages of various learners and prevents overfitting. And the prediction result of the first-layer learner is used as the logistic regression model input to the second layer, and the final prediction result is output, so that the accuracy of the prediction result is improved.

According to the invention, a large amount of data of the cerebral infarction patients are combined with various data processing modes and hyper-parameter optimization modes for multiple times, the processed cerebral infarction patient data are put into a prediction model for prediction, a data processing mode with poor prediction effect is abandoned, and a data processing mode with the highest prediction accuracy is screened out, so that the data processing module disclosed by the invention is obtained.

The prediction model of the invention can predict the life cycle of the patient more quickly, and the attending doctor can make a decision more aiming at the treatment means of the patient through the prediction result of the prediction model, and for the patient with higher death possibility, more medical resources are given to the patient to save the life of the patient, and for the patient with lower death possibility, the overdose is avoided to prevent the side effect on the body of the patient, and the waste of the medical resources is avoided. Compared with the prior art, the prediction model is obtained by training a large amount of data, so that the survival time of the cerebral infarction patient can be judged more accurately compared with a common diagnosis and treatment method, and the risk and the limitation of medical staff on the targeted decision-making treatment of the cerebral infarction patient are reduced.

Drawings

FIG. 1 is a schematic diagram of the architecture of the survival prediction system for perioperative patients with cerebral infarction of the present invention;

FIG. 2 is a schematic diagram of the implementation steps of the survival prediction system for perioperative patients with cerebral infarction according to the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience of description and for simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention.

In the description of the present invention, unless otherwise specified and limited, it should be noted that the terms "mounted," "connected," and "connected" are to be interpreted broadly, and may be, for example, a mechanical connection or an electrical connection, a communication between two elements, a direct connection, or an indirect connection through an intermediate medium, and those skilled in the art will understand the specific meaning of the terms as they are used in the specific case.

The invention provides a survival prediction system for perioperative patients with cerebral infarction, which comprises an acquisition module, a data processing module, a prediction module and an output module, wherein the data processing module is connected with the acquisition module;

the acquisition module is used for inputting cerebral infarction patient data; cerebral infarction patient data comprises patient age, sex, duration of disease onset, basic physical data and disease state; the data processing module is used for carrying out data change processing on the data of the cerebral infarction patient so that the data of the cerebral infarction patient is suitable for machine learning; the prediction module is used for inputting the data of the cerebral infarction patients into the prediction model to predict the survival time, the output module is used for outputting the prediction result, and the output result is the information of the survival time of the patients.

As shown in fig. 2, the prediction model includes a base model of a first layer and a logistic regression model of a second layer; the base model is divided into a first base model, a second base model and a third base model, wherein the first base model is a comprehensive random forest model, the second base model is an XGboost model, and the third base model is an MLP model; XGboost is a Boosting algorithm based on a Gradient Boosting Decision Tree (GBDT); MLP is a basic algorithm for deep learning networks.

In the invention, eight machine learning models such as K-neighbor, naive Bayes, decision trees, adaBoost, random forest, GBDT, XGBoost, MLP and the like are researched in experiments, wherein the random forest is a Bagging algorithm, the XGBoost is a Boosting algorithm, the MLP is a deep network method, the three have great difference and respective advantages and disadvantages in the aspect of learning, and Bagging and Boosting are achieved by training a weak learner and then fusing through an averaging method, a voting method or other methods to obtain the strong learner.

On the basis, the invention introduces an ensemble learning method and a Stacking integration strategy, wherein the Stacking is different from Bagging and Boosting: and adding a layer of learner in the Stacking process, namely, respectively sending the data of the cerebral infarction patients to the first layer of learner for training, sending the training result of the first layer of learner as input to the second layer of learner for retraining, and taking the final result as the output result of the model. Random forests, XGboost and MLP are integrated, a Stacking integration strategy is used for constructing an RF-XBM model, the advantages of various learners are fully exerted, and overfitting is prevented.

In the experimental process, the eight machine learning models and the RF-XBM model are trained, the experiment adopts a data set of the cerebral infarction patient to carry out cross validation, the accuracy, the recall rate and the F1 value of each validation are recorded, the average values are respectively taken as the result of one experiment, each model carries out five experiments, and the average values of the five experiments are taken as the result of the model experiment.

The results of the study are shown in table one. The RF-XBM proposed herein is optimal, the precision (recall, F1 value) is 0.8320, and the performance improvement is significant. This demonstrates that the integrated model proposed herein works well in this experimental problem based on data from patients with cerebral infarction.

TABLE I Experimental results

In a preferred experiment mode, the data processing module comprises a rejection module, a cleaning module and a transformation module;

the removing module is used for removing the characteristic that the actual deletion rate in the data of the cerebral infarction patient is greater than the standard deletion rate; preferably, the standard deletion rate is 30%;

the cleaning module is used for processing missing values of the data of the cerebral infarction patient; in order to improve the survival prediction accuracy of the cerebral infarction patients, the method performs an experiment on the data processing of the cerebral infarction patients, and the cleaning processing of the cerebral infarction patient data in the experiment process is divided into two stages of missing value processing and abnormal value processing;

missing value processing: carrying out model test on the cerebral infarction patient data sets filled by different filling methods, and taking a prediction result as a model effect; the four padding methods and their effects under different models are summarized, and the results are shown in table two.

TABLE II, FOUR FILLING METHODS AND THE EFFECTS THEREOF IN DIFFERENT MODELS

The second table shows that the Missforest filling method with the optimal precision rate is selected as the missing value processing method;

abnormal value processing: the abnormal value detection of the invention adopts a box plot method, which is a way of simply summarizing a data set by only using 5 points, wherein the five points are respectively a middle point, upper and lower quartile points (Q3 and Q1), a highest point and a lowest point of a distribution state. A four-quadrant IQR = Q3-Q1 is defined, and data greater than Q3+1.5 × IQR or less than Q1-1.5 × QR is considered an outlier.

The test group uses the abnormal value detected by the box plot as the deficiency value. The misforest filling method is carried out on the processed data, model training is carried out on the processed data, and model training results of the experimental group and the control group are shown in the table three.

Third, training results of experimental group and control group models

As can be seen from the comparative experiment in Table III, the accuracy of the control group is higher than that of the experimental group, so the method of the invention adopts the control group for processing the abnormal value, i.e. the abnormal value is not processed.

The invention also discusses a data unbalance processing method, and the data unbalance processing method mainly comprises three methods: (1) an oversampling method; (2) an undersampling method; (3) not processing; the data unbalance processing method mainly discussed in the invention adopts an oversampling method and does not carry out data unbalance processing;

currently, commonly used oversampling methods include: (1) SMOTE; (2) ROS; (3) ADASYN; (4) SMOTE-Borderline; (5) SMOTE-SVM. The grouping experiment is carried out by the oversampling method, and the grouping is specifically set to be that the samples are not processed, the samples are processed by the SMOTE method, the samples are processed by the ROS method, the samples are processed by the ADASYNN method, the samples are processed by the SMOTE-Borderline method, the samples are processed by the SMOTE-SVM method, and the samples are processed by the new research method. In the experimental process, a data set of a cerebral infarction patient is firstly divided into a training set and a testing set, the training set is subjected to oversampling processing, the testing set is not processed, XGboost is adopted for training, and the model effects of all groups are compared. The experimental results are shown in table four;

fourth, micro-precision rate of each group under XGboost model

As can be seen from table four, the effect of not performing data imbalance processing is better than that of other oversampling methods, and for such a result, the reason is that oversampling is based on the structural information of the samples of the minority class, and even reversely optimized when the representation quality of the minority class is poor, so that data imbalance processing is not performed on the data of the cerebral infarction patient.

The data processing experiment shows that: (1) And comparing the four missing value processing methods to obtain a result that the MissForest filling method is superior to KNN filling, mean value filling and iterative filling. (2) Compared with the method for processing the abnormal value, the effect of obtaining the non-processing is better than the result of processing the abnormal value. (3) And the idea of data integration and change and feature selection is provided. (4) And comparing the six data unbalance processing methods to finally obtain the optimal result without data unbalance processing.

The transformation module is used for carrying out feature coding processing and data normalization processing on the data of the cerebral infarction patient; the characteristic coding processing specifically comprises the steps of coding data of the cerebral infarction patient by adopting One-hot coding rules; the data normalization processing specifically adopts a method of combining standard deviation normalization and maximum value normalization to process the data of the cerebral infarction patient.

In a preferred experimental mode, the system further comprises an optimization module, wherein the optimization module is used for optimizing the hyper-parameters of the prediction model and comprises a screening module, a feature selection module and an assignment module;

the characteristic selection module is used for sequencing the characteristic importance scores of the processed data of the cerebral infarction patients, screening out the characteristics of which the importance scores are greater than the standard scores to form a data set, and dividing the data set into a test set and a training set, wherein the proportion of the test set is 30%; preferably, the importance score ordering is based on an output data set of the XGboost model, and the standard score is 100;

the assigning module is used for placing the test set into the base model to perform hyper-parameter optimization processing to obtain the optimized base model hyper-parameters and assigning the optimized hyper-parameters to the base model again; the hyper-parameters are specifically divided into network structure related parameters and model training related parameters; the network structure related parameters include: the number and type of network middle layers, the number of neurons in each layer and an activation function; model training related parameters: loss function, optimization method, batch size, iteration times, learning rate, regular method and coefficient, optimization method;

the hyper-parameter optimization processing adopts a method combining genetic algorithm and cross validation. The cross-validation process comprises: setting the chromosome number of the whole population as 50 hyper-parameter combinations, adopting a 4-fold cross validation method for the population, selecting 10% of hyper-parameters for random value taking by a later generation population, selecting 50% of genes for cross by parent chromosomes, and selecting 3 hyper-parameter combinations with the best fitness from the previous generation each time for direct 'copying'.

The optimization module further comprises a training module, wherein the training module is used for putting the training set into the base model endowed with the hyperparameters again for training, and the logistic regression model still adopts default parameters.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A brain stem operation patient survival risk classification method based on machine learning is characterized in that: the method is realized based on a survival prediction system of perioperative cerebral infarction patients, and the survival prediction system of the perioperative cerebral infarction patients comprises an acquisition module, a prediction module and an output module; the acquisition module is used for inputting cerebral infarction patient data; the prediction module is used for inputting the data of the cerebral infarction patients into the prediction model for survival prediction, and the output module is used for outputting prediction results; the prediction model comprises a base model of a first layer and a logistic regression model of a second layer; the base model is divided into a first base model, a second base model and a third base model, the first base model is a comprehensive random forest model, the second base model is an XGboost model, and the third base model is an MLP model.

2. The machine learning-based method for classifying survival risk of patients with cerebral infarction surgery according to claim 1, wherein the method comprises the following steps: the system also comprises a data processing module which is used for carrying out data change processing on the data of the cerebral infarction patient.

3. The machine learning-based method for classifying survival risk of patients with cerebral infarction surgery according to claim 2, wherein the method comprises the following steps: the data processing module comprises a rejection module, a cleaning module and a transformation module;

4. The machine learning-based method for classifying survival risk of patients with cerebral infarction surgery according to claim 3, wherein the method comprises the following steps: the missing value processing specifically includes filling the missing value by using a MissForest filling method.

5. The machine learning-based method for classifying survival risk of patients with cerebral infarction surgery according to claim 3, wherein the method comprises the following steps: the characteristic coding processing specifically comprises the step of coding the data of the cerebral infarction patient by adopting One-hot coding rules.

6. The machine learning-based method for classifying survival risk of patients with cerebral infarction surgery according to claim 3, wherein the method comprises the following steps: the data normalization processing specifically adopts a method of combining standard deviation normalization and maximum value normalization to process the data of the cerebral infarction patient.

7. The machine learning based method for classifying survival risk of a patient with cerebral infarction surgery according to claim 1, 2, 3, 4, 5 or 6, wherein the method comprises the following steps: the system also comprises an optimization module used for optimizing the hyper-parameters of the prediction model, wherein the optimization module comprises a screening module, a feature selection module and an assignment module;

the characteristic selection module is used for sequencing the characteristic importance scores of the processed data of the cerebral infarction patients and screening out the characteristics with the importance scores larger than the standard scores to form a test set;

8. The method for machine learning-based survival risk classification of patients with cerebral infarction surgery according to claim 7, wherein the method comprises the following steps: the hyperparametric optimization processing adopts a method combining genetic algorithm and cross validation.

9. The machine learning-based method for classifying survival risk of patients with cerebral infarction surgery according to claim 7, wherein: the feature selection module is specifically used for sequencing feature importance scores of the processed data of the cerebral infarction patients, screening out features with the importance scores larger than standard scores to form a data set, and dividing the data set into a test set and a training set;

10. The machine learning-based method for classifying survival risk of patients with cerebral infarction surgery according to claim 9, wherein: the hyper-parameter optimization processing adopts a method combining genetic algorithm and cross validation.