CN113270193A

CN113270193A - PICC thrombus risk prediction method based on machine learning

Info

Publication number: CN113270193A
Application number: CN202110434400.3A
Authority: CN
Inventors: 李莉; 谢超; 汪淑华; 程博
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2021-08-17

Abstract

The invention discloses a PICC thrombus risk prediction method based on machine learning, which is characterized by collecting 30 characteristics of each patient influencing the PICC thrombus and filling missing values of the characteristics; preprocessing the filled features; constructing a plurality of risk prediction models based on machine learning, and respectively calculating an F1 index, an accuracy rate, a recall rate and an AUC of each risk prediction model; determining the effect of each risk prediction model according to the F1 index, the accuracy, the recall rate and the AUC of each risk prediction model; selecting two risk prediction models with optimal effects from the two risk prediction models; fusing the two risk prediction models with the optimal effects based on the AUC values and the probabilities of predicting to be 1 of the two risk prediction models with the optimal effects, and outputting a PICC thrombus risk prediction result; the thrombus occurrence risk can be accurately identified, and treatment or nursing measures are taken in advance to perform medicine or physical intervention on high-risk patients, so that the probability of thrombus occurrence of the patients is reduced.

Description

PICC thrombus risk prediction method based on machine learning

Technical Field

The invention belongs to the technical field of machine learning and computer application, and particularly relates to a PICC thrombus risk prediction method based on machine learning.

Background

Machine learning is a generic term for a class of algorithms that aims to extract implicit rules from a large amount of data and use them for prediction or classification, and more specifically, machine learning can be regarded as finding a function, where the input is sample data, and the model is constructed by different algorithms, and the output is the desired result.

The random forest is widely applied to missing value filling, can process data with high dimensionality (many features), and does not need to make feature selection (feature column sampling); after training, the importance of the features can be returned; meanwhile, the trees are mutually independent during training, and parallelization is easy; missing features may be handled. Principal Component Analysis (PCA) is a widely used data dimension reduction algorithm. The main idea of PCA is to map n-dimensional features onto k-dimensions (k < n), which are completely new orthogonal features, also called principal components, and k-dimensional features reconstructed on the basis of the original n-dimensional features. The method aims to enable data to be processed more easily in a low-dimensional mode and reduce algorithm overhead. The logistic regression is a classical algorithm, is commonly used for the binary classification problem, is mainly applied to modeling of classification probability, can predict the classification and obtain the prediction probability, and is useful for tasks needing to use the probability to make an auxiliary decision. The SMOTE (synthetic Minity Oversampling technique) is an improved scheme based on a random Oversampling algorithm, and because the random Oversampling adopts a strategy of simply copying samples to increase a few class samples, the problem of model overfitting is easily generated, namely information learned by a model is too special (Specific) and not generalized (General), and the basic idea of the SMOTE algorithm is to analyze the few class samples and artificially synthesize a new sample according to the few class samples to add the new sample to a data set. The XGBoost is a short name of "eXtreme Gradient Boosting" (iterative Gradient Boosting), is an ensemble learning algorithm, and belongs to the class of Boosting algorithms in 3 types of commonly used integration methods (bagging, Boosting, stacking). The method is an addition model, a base model generally selects a tree model, and due to the advantages of using a first derivative second derivative, parallel optimization, being capable of appointing default directions of branches for missing values and appointed values and the like, the algorithm efficiency of the model can be greatly improved, and the method is widely applied to the use of a prediction model. The Support Vector Machine (SVM) is a generalized linear classifier (generalized linear classifier) for binary classification of data in a supervised learning manner, and a decision boundary of the SVM is that a maximum-margin hyperplane (maximum-margin hyperplane) SVM for solving learning samples calculates an empirical risk (empirical risk) by using a hinge loss function (change loss) and adds a regularization term in a solving system to optimize a structural risk, so that the SVM is a classifier with sparsity and robustness. SVMs can be classified non-linearly by a kernel method, which is one of the common kernel learning (kernel learning) methods.

The peripherally inserted central venous catheter (PICC) is a deep vein catheterization technique which is used for puncturing and catheterization around upper limbs (basilic vein, cephalic vein and median elbow vein) and positioning the tip of the catheter in the superior vena cava, and is a vein treatment way for providing patients with medium-term to long-term. The PICC has the advantages of easy mastering of puncture technology, simple and safe operation and long retention time, reduces puncture times and reduces the stimulation effect of medicaments on blood vessels, so the PICC is widely applied to the fields of parenteral nutrition, tumor chemotherapy and the like in clinic and provides a safe and reliable channel for venous transfusion of patients. PICCs combine the need for therapy with lower operational risks and have a good cost-benefit ratio, and are therefore widely developed clinically.

With the continuous expansion of the application range and the application number of the PICC, the related complications and adverse effects thereof are gradually highlighted, mainly including catheter blockage, displacement, detachment, rupture, phlebitis, catheter-related blood stream infection, thrombosis and the like, wherein PICC-related thrombosis is one of the most common and serious complications. The PICC-related thrombus refers to a process of forming blood clots on the inner wall of a blood vessel where the PICC is located and the catheter adnexal wall due to the direct injury of the blood vessel intima caused by puncture or catheter, the self state of a patient and other factors after the PICC is placed. Thrombosis can easily induce pulmonary embolism, which can endanger life. Post-thrombotic syndrome (PTS), which is caused by late thrombosis, interferes with the function of the venous valve, causing pain, swelling and dysfunction of the affected limb, affecting the quality of life of the patient; meanwhile, thrombosis is also the main reason for the non-planned drawing of the PICC, so that the hospitalization time of the patient is prolonged, the hospitalization cost is increased, and the like.

Scientific risk assessment has important significance for PICC thrombus prevention. The important role of thrombus risk assessment in the prevention of catheter-related thrombosis is emphasized by the "guidelines for venous thromboembolic anti-thrombosis therapy" published by the American Co-lle of Chest Physician, ACCP, and by the "standards of infusion therapy practice" published by the American Society for Intravenous infusion Care (Intra Nurses Society, INS).

Just as with numerous advantages of these algorithms, we can analyze and predict data through the fusion of algorithms, the PICC correlation venous thrombosis risk assessment tool with scientificity and practicability is constructed based on machine learning, thrombus high-risk patients can be accurately identified, clinical medical workers are guided to identify thrombus high-risk factors as soon as possible, treatment or nursing measures are taken in advance to carry out medicine or physical intervention on the high-risk patients, and therefore the probability of thrombus occurrence of the patients is reduced, the life quality of the patients is improved, and the ordered proceeding of treatment and nursing work of the patients can be guaranteed.

Disclosure of Invention

In order to solve the defects in the prior art, the PICC thrombus risk prediction method based on machine learning is provided, accurate identification of thrombus occurrence risks can be achieved, treatment or nursing measures are taken in advance to perform medicine or physical intervention on high-risk patients, and therefore the probability of thrombus occurrence of the patients is reduced.

The technical scheme adopted by the invention is as follows:

step 1, collecting 30 characteristics of each patient, which influence PICC thrombus, and filling missing values of the characteristics; preprocessing the filled features;

step 2, constructing a plurality of risk prediction models based on machine learning, and respectively calculating the F1 index, the accuracy rate, the recall rate and the AUC of each risk prediction model; determining the effect of each risk prediction model according to the F1 index, the accuracy, the recall rate and the AUC of each risk prediction model; selecting two risk prediction models with optimal effects from the two risk prediction models;

and 3, fusing the two risk prediction models with the optimal effects based on the AUC values and the probabilities of predicting to be 1 of the two risk prediction models with the optimal effects, and outputting a PICC thrombus risk prediction result.

Further, the method for obtaining the PICC thrombus risk prediction result in step 3 comprises:

step 3.1, dividing the preprocessed data, taking 30 features of each patient as analysis data, and forming an input data set data _ x by the analysis data of all patients; taking whether each patient has thrombus as a judgment result, and forming a tag data set data _ y according to the judgment results of all patients; taking the input data set data _ x and the corresponding tag data set data _ y as training data of the SMOTE algorithm;

step 3.2, dividing the data _ x and the data _ y processed by the SMOTE algorithm into a training set and a testing set, and respectively inputting the training sets into two risk prediction models with optimal effects; respectively obtaining AUC values and the probability of predicting to be 1 of the two risk prediction models;

and 3.3, creating a prediction probability function by using the AUC values of the two risk prediction models with the optimal effect in the step 3.2 and the probability of predicting 1, wherein the prediction probability function is expressed as follows:

the predicted is the prediction Probability, two risk prediction models with the optimal effect are respectively represented by X and Y, AUC _ X represents the AUC value of the risk prediction model X, and 1_ Probasic is the Probability that the risk prediction model X predicts to be 1; AUC _ Y is the AUC value in the risk prediction model Y, 1_ Probability' is the Probability of prediction of 1 in the risk prediction model Y.

Further, 10 features to be filled in the missing values are classified into two types, leukocyte, neutrophil, hemoglobin and PLT belong to the first type of features, and C-reactive protein, plasma prothrombin time, INR, activated partial prothrombin time, plasma fibrinogen and D-2 mer belong to the second type of features.

Further, aiming at the missing corresponding value of the first-class features, in all data, the mode or average of each feature in the first-class features is extracted respectively, and the missing values of the white blood cells, the neutrophils, the hemoglobin and the PLT are filled by using the mode or average.

Further, aiming at the second class of characteristic missing corresponding numerical values, a random forest algorithm is adopted to fill missing values.

Further, the method for preprocessing the characteristics comprises the following steps: and carrying out standardized processing on the filled data to enable the data to be in a uniform format.

Further, the risk prediction models constructed in step 2 include a first risk prediction model composed of principal component analysis and logistic regression, a second risk prediction model composed of SMOTE and XGBoost algorithms, and a third risk prediction model composed of SVM, SMOTE and genetic algorithms.

Further, the construction method of the risk prediction model I comprises the following steps:

dividing the data after the standardization treatment, taking 30 characteristics of each patient as analysis data, and forming an input data set data _ x by the analysis data of all patients; taking whether each patient has thrombus as a judgment result, and forming a tag data set data _ y according to the judgment results of all patients;

performing principal component analysis on the input data set data _ x and the corresponding tag data set data _ y to reduce the number of features;

after the characteristic quantity is determined, dividing an input data set daat _ x and a corresponding label data set data _ y into a training set and a testing set, and constructing a logistic regression model by using the training set data to obtain a first risk prediction model.

Further, the construction method of the risk prediction model II comprises the following steps:

taking the input data set data _ x and the corresponding tag data set data _ y as training data of the SMOTE algorithm; dividing data processed by the SMOTE algorithm into a training set and a test set; training the XGboost algorithm by using the processed training set; and completing the construction of a risk prediction model II by training the XGboost algorithm.

Further, the third risk prediction model is constructed by the following method:

taking the input data set data _ x and the corresponding tag data set data _ y as training data of the SMOTE algorithm;

optimizing parameters C and gamma in the support vector machine by using a genetic algorithm; constructing a support vector machine model by using the optimized parameters C and gamma;

dividing data processed by the SMOTE algorithm into a training set and a test set; and training the support vector machine model by using the training set, and completing the construction of a risk prediction model III by training the support vector machine model.

The invention has the beneficial effects that:

(1) when risk prediction of PICC thrombus is carried out, partial characteristic loss is easily caused due to manual input of patient information, so that different loss value filling modes are adopted according to different characteristics; filling in missing values with mode or mean values for features such as white blood cells, neutrophils, hemoglobin, PLT; filling missing values by using a random forest method aiming at the characteristics of C-reactive protein, plasma prothrombin time, INR, activated partial prothrombin time, plasma fibrinogen and D-2 polymer; on one hand, the influence of the missing of the feature data on the prediction accuracy can be avoided, and on the other hand, different filling modes are adopted for different types of features, so that possible missing values can be more accurately given according to the distribution of the data, and the distribution of the different types of features cannot be changed due to the filling modes.

(2) This application founds multiple risk prediction model based on machine learning, and fuse the more excellent risk prediction model of effect, the PICC correlation venous thrombosis risk assessment model that has scientificity and practicality has been obtained, can accurate discernment thrombus high-risk patient, and then guide clinical medical staff to discern thrombus high-risk factor early, take treatment or nursing measure in advance and carry out medicine or physical intervention to high-risk patient, thereby reduce the probability that the patient takes place the thrombus, not only improve patient's quality of life, more can ensure that patient's treatment and nursing work go on in order.

Drawings

FIG. 1 is a flow chart of a PICC thrombus risk prediction method based on machine learning according to the present invention;

FIG. 2 is a schematic diagram showing the results of principal component analysis;

FIG. 3 is a schematic diagram of an iterative process of a genetic algorithm;

FIG. 4(a) is a table of four indices (F1 index, accuracy, recall and AUC) for three risk prediction models; FIG. 4(b) four indices (F1 index, accuracy, recall and AUC) histograms for three risk prediction models;

fig. 5 is a PICC thrombus risk prediction result case display diagram.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, the invention relates to a machine learning-based PICC thrombus risk prediction method, which comprises the following steps:

step 1, data acquisition and pretreatment.

Step 1.1, collecting 625 patients' relevant data, specifically 30 characteristics affecting PICC thrombus for each case, which are sex, age, whether to bed, primary tumor site, high risk, tumor metastasis, high risk of metastatic site, basic disease, major surgery, deep vein thrombosis history, smoking history, radiotherapy, drug properties, targeted drug, acute infection, BMI value, leukocyte, neutrophil, hemoglobin, PLT, C-reactive protein, plasma prothrombin time, INR, activated partial prothrombin time, plasma fibrin, D-2 mer, tip optimal position, number of catheterization times, catheterization liquid, and catheterization vein.

Step 1.2, as the project data is derived from clinical medical data and the data has incompleteness, missing value filling is carried out by using mode, average and random forest algorithm according to the specific type of the data. Among the 30 characteristics affecting PICC thrombosis, 20 characteristics of sex, age, whether the patient lies in bed, primary tumor part, high risk, tumor metastasis, high risk of metastatic part, basic disease, major operation, deep vein thrombosis history, smoking history, radiotherapy, drug property, targeting drug, acute infection, BMI value, optimal tip position, catheter placement times, catheter placement liquid and catheter placement vein are well recorded and stored because data is easy to obtain and data is recorded and stored, and no loss exists; another 10 influencing features: white blood cells, neutrophils, hemoglobin, PLT, C-reactive protein, plasma prothrombin time, INR, activated partial prothrombin time, plasma fibrinogen and D-2 mer cause partial data loss due to unclear clinical data record and the like, so that missing values need to be filled.

Step 1.3, classifying 10 features needing to fill up missing values into two types, wherein leucocytes, neutrophils, hemoglobin and PLT belong to a first type of features, and C-reactive protein, plasma prothrombin time, INR, activated partial prothrombin time, plasma fibrinogen and D-2 polymers belong to a second type of features;

step 1.4, aiming at the two types of characteristics, different filling modes are adopted, and the filling mode specifically comprises the following steps:

and aiming at the missing corresponding numerical value of the first-class characteristic, in all data, respectively extracting the mode or average of each characteristic in the first-class characteristic, and filling the missing value of the leucocyte, the neutrophil granulocytes, the hemoglobin and the PLT by using the mode or average. Taking white blood cells as an example, extracting the average number of the white blood cells in the test data set, and filling the missing white blood cell number by using the average number of the characteristics; this fills in neutrophils, hemoglobin, PLT, respectively.

And filling the missing values by adopting a random forest algorithm aiming at the missing corresponding numerical values of the second type of characteristics. The method for filling missing values by adopting a random forest algorithm comprises the following steps:

and S1, respectively carrying out deletion and non-deletion division on each characteristic in the second type of characteristics, storing the case data with the characteristic into an X data set, and storing the case data without the characteristic into a y data set. Taking C-reactive protein as an example, dividing case data with a C-reactive protein value and case data without the C-reactive protein value, storing the case data with the C-reactive protein value into an X data set, and storing the case data without the C-reactive protein value into a y data set

S2, training the data of the X data set by using a random forest algorithm to obtain a random forest model; and inputting the y data set into a random forest model to obtain a predicted value of the missing value, and filling the predicted value into the corresponding missing value in the y data set. Repeating the above steps fills in each of the second class of features.

And step 1.5, carrying out standardized processing on the filled data to enable the data to be in a uniform format.

Step 2, constructing a risk prediction model to realize PICC thrombus risk prediction, wherein the specific process is as follows:

and 2.1, respectively forming a first risk prediction model, a second risk prediction model and a third risk prediction model by principal component analysis and logistic regression, wherein the first risk prediction model is formed by an SMOTE algorithm and an XGboost algorithm, and the third risk prediction model is formed by an SVM algorithm, an SMOTE algorithm and a genetic algorithm.

The method for forming the first risk prediction model by principal component analysis and logistic regression comprises the following steps:

dividing the standardized data, taking 30 characteristics of each patient as analysis data, and forming an input data set data _ x by the analysis data of 625 patients; the determination result of whether or not thrombus is present for each patient is defined as tag data set data _ y from the determination results of 625 patients.

Since the principal component analysis can increase the model effect by reducing the number of features, the principal component analysis is performed on the input data set data _ x and the tag data set data _ y corresponding thereto. As shown in fig. 2, when the number of principal components reaches 20, 85% of information of the original data can be retained, so that values between 20 and 30 are considered.

After the characteristic quantity is determined through principal component analysis, an input data set daat _ x and a corresponding label data set data _ y are divided into a training set and a testing set, a logistic regression model is constructed by using training set data, the principal component analysis can cause partial information loss when the characteristic quantity is reduced, meanwhile, the collinear relation among characteristic attributes needs to be considered in the logistic regression algorithm, so that the principal component number is selected by adopting a traversal idea, and then, the numbers with the principal component number of 20-30 are brought into the model and trained to obtain a risk prediction model I.

The method of the risk prediction model II formed by the SMOTE algorithm and the XGboost algorithm comprises the following steps:

The input data set data _ x and the corresponding tag data set data _ y are used as training data of the SMOTE algorithm, the SMOTE algorithm is an interpolation type algorithm in essence, and the principle is that two samples in a minority are randomly selected, and a certain point in a connecting line of the two samples is selected to serve as new data. The raw data ratio is about 7: 3, at this point the SMOTE algorithm is used to change the scale to 1: 1, the balance of the training data set can be achieved.

Dividing data processed by the SMOTE algorithm into a training set and a test set; training the XGboost algorithm by using the processed training set; and completing the construction of a risk prediction model II by training the XGboost algorithm.

The third method of the risk prediction model formed by the SVM, SMOTE and the genetic algorithm is as follows:

And taking the input data set data _ x and the corresponding tag data set data _ y thereof as training data of the SMOTE algorithm to realize the balance of the training data.

And optimizing parameters C and gamma in a Support Vector Machine (SVM) by using a genetic algorithm. The specific process is as follows: firstly, a problem interface is required to be customized, data is imported, data preprocessing is carried out by using an SMOTE algorithm and scale function normalization, then parameters C and gamma to be called are directly set as variables, and AUC is set as a target function. At this time, the source _ SGA _ templet function in the getpy is directly called to obtain the result. Fig. 3 shows an iterative process of the genetic algorithm. It can be known that the population is more and more close to the optimal value with the increase of the iteration number, and the obtained population is optimal and gradually stable when the genetic algebra is more than 5. And finally, outputting the values of the parameters C and gamma in the Support Vector Machine (SVM), and constructing a support vector machine model by using the optimized parameters C and gamma.

Step 2.2, respectively calculating F1 indexes, accuracy rates, recall rates and AUC of the three risk prediction models by using the test set, and further determining the effects of the three models according to the four indexes; selecting two risk prediction models with optimal effects from the two risk prediction models; as shown in fig. 4a and 4b, it can be seen from the four indexes of the three risk prediction models that each index of the second risk prediction model and each index of the third risk prediction model reach more than 90%.

And 3, fusing the risk prediction model II and the risk prediction model III with optimal effects to obtain a PICC thrombus risk prediction result, wherein the method comprises the following steps:

and 3.1, taking the input data set data _ x and the corresponding tag data set data _ y thereof as training data of the SMOTE algorithm to realize the balance of the training data.

Step 3.2, dividing the data _ x and the data _ y processed by the SMOTE algorithm into a training set and a testing set, and respectively inputting the training set into a second risk prediction model and a third risk prediction model which are well constructed; respectively obtaining the AUC value and the probability of being predicted to be 1 of the risk prediction model II and the risk prediction model III;

and 3.3, creating a prediction probability function by using the AUC values of the risk prediction model II and the risk prediction model III in the step 3.2 and the probability of predicting 1, wherein the prediction probability function is expressed as follows:

the predicted is the prediction probability, AUC _ XGboost is the AUC value in the risk prediction model II, and 1_ Probabil is the probability that the risk prediction model II predicts to be 1; AUC _ SVM is the AUC value in risk prediction model III, and 1_ Probasic' is the Probability of prediction as 1 in risk prediction model III. And taking the predicted probability predicted as a prediction result of the PICC thrombus risk of the patient. As shown in fig. 5.

The PICC thrombus risk prediction method based on machine learning provided by the invention is realized based on Python, and the specific code realization process is as follows:

step 1.1: the project data is derived from clinical medical data, and data preprocessing is needed because of the incompleteness of the data. Firstly, importing Excel data by using a Panda library of Python, enabling the Excel data to be data, namely calling the data through a function data ═ pd.read _ Excel ("); then, taking "white blood cells" containing the deletion value as an example, the characteristics "white blood cells", "neutrophils", "hemoglobin", and "PLT" are compensated using data [ "white blood cell ']. fillna (data [" white blood cell' ]. mode ()/. mean ()), and "place ═ True").

Step 1.2: for data 'C reactive protein', 'plasma prothrombin time', 'INR', 'activated partial prothrombin time', 'plasma fibrinogen' and 'D-2 polymer' containing a missing value, a random forest algorithm is used for missing value filling, for example 'C reactive protein', the data are divided into two types, namely, the data containing the missing value and the data without the missing value, the data without the missing value are stored in X, the data needing to be filled in the missing value are stored in y, the data are trained by the random forest algorithm, a predicted value is obtained, and the missing value is filled in. The usage code is as follows: the nown _ data1 ═ data [ data.c reactive protein ], the notnull () function can find data that does not contain missing values, and store it in the nown _ data 1; the unknown _ data1 is data [ data.c. reactive protein ], and the isnull () function can find data containing missing values and store it in the unknown _ data 1; array (known _ data1[ ' C-reactive protein ' ], dtype ═ float), known _ data1[ ' C-reactive protein ', ' plasma prothrombin time ', ' INR ', ' activated partial prothrombin time ', ' plasma fibrinogen ', ' D-2 mer ', ' thrombus ', ' axis ═ 1) data containing C-reactive protein was put into y, and since training data in random forests could not contain missing values, data originally of known _ data1 was deleted for ' C-reactive protein ', ' plasma prothrombin time ', ' INR ', ' activated partial prothrombin time ', ' plasma fibrinogen ', ' D-2 mer ', ' thrombus ', which were put into X: arm (knock _ data1, dtype ═ float); finally, using random forest to train the data in the skearn _ ensemble, wherein rfr is random forest (random _ state is 0, n _ estimators is 2000, and n _ blobs is-1); where random _ state is the seed used by the random number generator, and n _ estimators is the maximum number of iterations of the weak learners, or the maximum number of weak learners, and is 10 by default. N _ estimators are generally too small to be under-fit, and too large to be over-fit, and a moderate number is generally chosen, n _ jobs ═ -1b indicates that all processors are used, and rfr.fit (X, y) trains X, and y using a random forest; prediction data is then filled in with the deletion value predicted ═ rfr.predicted (np. array (unknown _ data1.drop ([ ' C-reactive protein ', ' plasma prothrombin time ', ' INR ', ' activated partial prothrombin time ', ' plasma fibrinogen ', ' D-2 mer ', ' thrombus ', ' axis ═ 1))); data.loc [ (data.c-reactive protein., 'C-reactive protein' ] — predicted, rfr.predicted () is data that calls a previous random forest training, and complements a function unknown _ data1 that contains a missing value.

The code implementation process in the step 2 is specifically as follows:

step 2.1.1: first, importing Excel data by using a Panda library of Python, and calling the Excel data as data through a function data pd, read _ Excel (", encoding _ utf _8', index _ col ═ 0), wherein encoding is encoding, and the encoding is set to uft-8, that is, chinese encoding is supported, and index _ col ═ 0 is indexed by a first row. The data to be trained are divided into data _ x and data _ y, which are respectively characterized by sex to vessel vein and thrombus: data _ x ═ data.loc [, ' sex ': vein placement ' ]; data _ y ═ data.loc [: thrombus' ]; the test data _ x is subjected to dimensionless normalization processing data _ x is scale (data _ x.values). We performed principal component analysis on all data aiming at reducing the effect of the feature number increase model, performed principal component analysis on 31 feature data _ x from which thrombus was removed using the PCA module in the skearn.

Step 2.1.2: and (5) drawing and displaying the principal component analysis result, and searching an optimal solution. Drawing a principal component cumulative proportion graph plt.plot (np. cumsum (pca. ex _ variance _ ratio) >), where linewidth is 3), plt.plot () is a drawing library in a matplotlib. pylab library, np. cumsum () refers to calculating the axial element cumulative sum, and returning an array consisting of intermediate results, plt.xlabel ('principal component number'); yl abel ('cumulative interpretation variance'); grid (True); the X-axis represents the principal component number and the y-axis represents the cumulative interpretation variance, and is presented using plt

From the above figure, it can be obtained that 85% of the original data can be retained when the number of the principal components is 20, so that the value is considered between 20 and 30. While the principal component analysis necessarily causes partial information loss when the number of the characteristic attributes is reduced, but meanwhile, the logistic regression algorithm needs to consider the relation of collinearity among the characteristic attributes, so the traversal idea is adopted to select the number of the principal components, and then the numbers between 20 and 30 are substituted for training: randomly and objectively dividing data _ x and a prediction result data _ y participating in prediction into training data and test data by using a train _ test _ split in a sketch _ selection library, wherein the x _ train, the x _ test, the y _ train, and the y _ train _ test _ split are data _ x, data _ y, and a test _ size of 0.2 and a random _ state of 9, the test _ size is a proportion of the test data to the sample data, and the random _ state sets a random number seed to ensure that the random numbers are different at each time; the model was then trained using logistic regression in skear, linear _ model (dependency ═ l2', class _ weight ═ balanced ', multi _ class ═ multinomial '), model. The penalty selects a parameter for regularization, and the selection of the penalty parameter affects the selection of our lossy function optimization algorithm. That is, the parameter resolver is selected, if the regularization is L2 regularization, 4 algorithms, newton-cg, lbfgs, libilinear, sag, and class _ weight, can be selected as type weight parameters, and when the class _ weight is balanced, the class weight calculation method is as follows: bincount (y), where n _ samples is the number of samples, n _ classes is the number of classes, and np. bincount (y) outputs the number of samples per class, e.g., y is [1,0,0,1,1], and np. bincount (y) is [2,3 ]; multi _ class selects a parameter for the classification mode.

Step 2.1.3: model training was followed by calculation of F1 indices, precision, recall and AUC using F1_ score, accuracy _ score, call _ score, roc _ AUC _ score in the skearn. metrics library: f1_ score (y _ test, model.predict (x _ test), accuracy _ score (y _ test, model.predict (x _ test), call _ score (y _ test, model.predict (x _ test)), roc _ AUC _ score (y _ test, model.predict (x _ test)) calculates four indexes of the model, when the number of the principal components is found to be 21 in the principal components of 20-30, the model effect is optimal, and the F1 index is 0.6571, the accuracy rate is 0.8095, the recall rate is 0.8214 and the AUC is 0.8138.

Step 2.2.1: first, importing Excel data by using a Panda library of Python, and calling the Excel data as data through a function data pd, read _ Excel (", encoding _ utf _8', index _ col ═ 0), wherein encoding is encoding, and the encoding is set to uft-8, that is, chinese encoding is supported, and index _ col ═ 0 is indexed by a first row. The data to be trained are divided into data _ x and data _ y, which are respectively characterized by sex to vessel vein and thrombus: data _ x ═ data.loc [, ' sex ': vein placement ' ]; data _ y ═ data. The test data _ x is subjected to dimensionless normalization processing data _ x is scale (data _ x.values).

Step 2.2.2: the method comprises the steps of using SMOTE in an offset, over _ sampling library, using an SMOTE algorithm to realize balance of a training data set, and assigning the SMOTE algorithm to over _ samples, wherein over _ samples is SMOTE (random _ state is 84), and random _ states ensures that a program is divided into different training sets and test sets every time the program runs by setting random numbers. And resampling the data _ X and the data _ y to obtain an over _ samples _ X and an over _ samples _ y, wherein the over _ samples _ X and the over _ samples _ y are over _ samples. Randomly and objectively dividing resampled data, over _ samples _ X and over _ samples _ y, into training data and test data by using trace _ test _ split in a sketch _ selection library, wherein X _ trace, X _ test, y _ trace, y _ test _ split (over _ samples _ X, over _ samples _ y, and a random _ state is 0.3 and 36), wherein the test _ size is the proportion of the test data to the sample data, and the random _ state sets a random number seed to ensure that different random numbers exist at each time.

Step 2.2.3: the method comprises the steps of optimizing parameters C and gamma in the SVM by using a genetic algorithm, firstly, using a program library based on python to get, firstly, customizing a problem interface, importing data in a program MyProblim. At this time, the source _ SGA _ templet function in the getpy is directly called to obtain the result. Fig. 3 shows an iterative process of the genetic algorithm. It can be known that the population is more and more close to the optimal value with the increase of the iteration number, and the obtained population is optimal and gradually stable when the genetic algebra is more than 5. And finally, outputting the values of the C and gamma parameters in the SVC module, and providing parameter premise for the construction of the SVM model.

Step 2.2.4: svm library, training model ═ SVC (kernel ═ rbf ═ gamma ═ 0.156312, C ═ 45, class _ weight ═ balanced'); fit (x _ train, y _ train), Kernel sets Kernel function as radial basis Kernel function, gamma is a parameter of the function after RBF function is selected as Kernel. Implicitly determining the distribution of the data after mapping to a new feature space, wherein the larger the gamma is, the fewer the support vectors are, the smaller the gamma value is, the more the support vectors are, and the number of the support vectors influences the training and predicting speed; c is a punishment coefficient, the value of C is small, the punishment on misclassification is reduced, fault tolerance is allowed, and the misclassification is regarded as a noise point, so that the generalization capability is strong; class _ weight is the weight of a class, passed through the dictionary form.

Step 2.2.5: after the model training is completed, F1 indexes, accuracy, recall and AUC of the model are calculated through F1_ score, accuracy _ score, call _ score and roc _ AUC _ score in a skearn. F1_ score (y _ test, model.predict (x _ test), accuracy _ score (y _ test, model.predict (x _ test), call _ score (y _ test, model.predict (x _ test)), roc _ AUC _ score (y _ test, model.predict (x _ test)) calculate four indices of the model F1 index is 0.9534, accuracy is 0.9532, recall is 0.9432 and AUC is 0.9534.

Step 2.3.1: first, importing Excel data by using a Panda library of Python, and calling the Excel data as data through a function data pd, read _ Excel (", encoding _ utf _8', index _ col ═ 0), wherein encoding is encoding, and the encoding is set to uft-8, that is, chinese encoding is supported, and index _ col ═ 0 is indexed by a first row. The data to be trained are divided into data _ x and data _ y, which are respectively characterized by sex to vessel vein and thrombus: data _ x ═ data.loc [, ' sex ': vein placement ' ]; data _ y ═ data. The test data _ x is subjected to dimensionless normalization processing data _ x is scale (data _ x.values).

Step 2.3.2: using SMOTE in the imblean. over _ sampling library, oversampling SMOTE: the essence is an interpolation algorithm, two samples in a few classes are randomly selected, and a certain point in a connecting line of the two samples is selected as new data. The raw data ratio is about 7: 3, at this point the SMOTE algorithm is used to change the scale to 1: 1. the SMOTE algorithm is used for realizing the balance of the training data set, the training data set is assigned to over _ samples, wherein the over _ samples is SMOTE (random _ state is 84), and the random _ states ensures that the program is divided into different training sets and testing sets every time the program runs by setting random numbers. And resampling the data _ X and the data _ y to obtain an over _ samples _ X and an over _ samples _ y, wherein the over _ samples _ X and the over _ samples _ y are over _ samples. Randomly and objectively dividing resampled data, namely, over _ samples _ X and over _ samples _ y, into training data and test data by using trace _ test _ split in a sketch _ selection library, wherein X _ trace, X _ test, y _ trace, y _ test _ trace _ split (over _ samples _ X, over _ samples _ y, and the test _ size is 0.32 and the random _ state is 9), wherein the test _ size is the proportion of the test data in the sample data, and the random _ state sets a random number seed to ensure that different random numbers exist each time;

step 2.3.3: the model was trained using the xgbclasifier in the xgboost library: xgbclasifier (objective) where objective is the objective function and logdraw is chosen to be the category as the binary criterion. Fit (x _ train, y _ train) training model

Step 2.3.4: after the model training is completed, F1 indexes, accuracy, recall and AUC of the model are calculated through F1_ score, accuracy _ score, call _ score and roc _ AUC _ score in a skearn. F1_ score (y _ test, model.predict (x _ test), accuracy _ score (y _ test, model.predict (x _ test), call _ score (y _ test, model.predict (x _ test)), roc _ AUC _ score (y _ test, model.predict (x _ test)) calculate four indices of the model F1 index is 0.9103 precision rate 0.9090, recall rate 0.9013 and AUC 0.9093.

The step 3 comprises the following steps:

step 3.1: first, importing Excel data by using a Panda library of Python, and calling the Excel data as data through a function data of pd, read _ Excel (", encoding of uft-8, that is, supporting chinese encoding, and index _ col of 0, that is, using the first row as an index. The data to be trained are divided into data _ x and data _ y, which are respectively characterized by sex to vessel vein and thrombus: data _ x ═ data.loc [, ' sex ': vein placement ' ]; data _ y ═ data. The test data _ x is processed by scale (data _ x.values) using a standardized scale () in the dimensionless process.

Step 3.2: save model _ model ('XGB. model') from step 5. The model _ XGBoost model is created using the XGBLASSIFI () function in the XGBoost library: model xgbclasifier (); and (3) loading the model _ XGboost, load _ model ('xgb. model') trained in the step 5, using SMOTE in an imblean. over _ sampling library, using an SMOTE algorithm to realize balance of a training data set, and assigning the training data set to over _ samples, wherein over _ samples is SMOTE (random _ state is 84), and random _ states ensures that a program is divided into different training sets and test sets each time the program runs by setting random numbers. And resampling the data _ X and the data _ y to obtain an over _ samples _ X and an over _ samples _ y, wherein the over _ samples _ X and the over _ samples _ y are over _ samples. Randomly and objectively dividing resampled data, over _ samples _ X and over _ samples _ y, into training data and test data by using trace _ test _ split in a sketch _ selection library, wherein X _ trace, X _ test, y _ trace, y _ test _ trace _ split (over _ samples _ X, over _ samples _ y, and a random _ state is 0.3 and 36), wherein the test _ size is the proportion of the test data to the sample data, and the random _ state sets a random number seed to ensure that different random numbers exist at each time.

Step 3.3: svm library, training model (SVC) (kernel _ rbf, gamma _ 0.156312, C45, class _ weight ═ balanced', and quality ═ True); fit (x _ train, y _ train), Kernel sets Kernel function as radial basis Kernel function, gamma is a parameter of the function after RBF function is selected as Kernel. The distribution of the data after being mapped to a new feature space is determined implicitly, the larger the gamma is, the fewer the support vectors are, and the smaller the gamma value is, the more the support vectors are. The number of support vectors influences the speed of training and prediction; c is a punishment coefficient, the value of C is small, the punishment on misclassification is reduced, fault tolerance is allowed, and the misclassification is regarded as a noise point, so that the generalization capability is strong; class _ weight is the weight of the class and is transmitted in a dictionary form; the probability of use is the estimate of the probability of use.

Step 3.4: creating a predicted probability function predicted _ probability (x _ predicted), wherein x _ predicted is the characteristic of an object to be predicted, and firstly carrying out dimensionless standardization processing on data to be predicted, wherein x _ predicted is scale. And then carrying out probability calculation, wherein the calculation rule is that the probability of XGboost being 1 is multiplied by the AUC value of the model, the probability of SVM being 1 is multiplied by the AUC value of the model, and the comprehensive of XGboost and the AUC of the SVM model is divided to obtain the probability of the prediction model.

The above embodiments are only used for illustrating the design idea and features of the present invention, and the purpose of the present invention is to enable those skilled in the art to understand the content of the present invention and implement the present invention accordingly, and the protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes and modifications made in accordance with the principles and concepts disclosed herein are intended to be included within the scope of the present invention.

Claims

1. A PICC thrombus risk prediction method based on machine learning is characterized by comprising the following steps:

2. The method of predicting risk of PICC thrombosis according to claim 1, wherein the method of obtaining a PICC thrombosis risk prediction result in step 3 comprises:

step 3.2, dividing the data _ x and the data _ y processed by the SMOTE algorithm into a training set and a testing set, and respectively inputting the training sets into two risk prediction models with optimal effects; respectively obtaining the AUC values and the predicted probability of 1 of the two risk prediction models;

3. The method for predicting thrombus risk of PICC according to claim 1, wherein the 10 features to be filled in the missing values are classified into two types, wherein the leukocyte, neutrophil, hemoglobin, PLT belong to the first type of features, and the C-reactive protein, plasma prothrombin time, INR, activated partial prothrombin time, plasma fibrinogen, D-2 mer belong to the second type of features.

4. The PICC thrombus risk prediction method of claim 3, wherein for the missing corresponding values of the first type features, the mode or average of each feature in the first type features is extracted separately from all data, and the missing values of leukocytes, neutrophils, hemoglobin, PLT are filled up with the mode or average.

5. The PICC thrombus risk prediction method of claim 3, wherein for the second type of feature missing corresponding values, a random forest algorithm is used to fill the missing values.

6. A method for predicting the risk of thrombosis of a PICC according to claim 1, 2,3, 4 or 5, characterized in that the characteristics are preprocessed by: and carrying out standardized processing on the filled data to enable the data to be in a uniform format.

7. The PICC thrombus risk prediction method of claim 6, wherein the plurality of risk prediction models constructed in step 2 include a first risk prediction model consisting of principal component analysis and logistic regression, a second risk prediction model consisting of SMOTE and XGboost algorithms, and a third risk prediction model consisting of SVM, SMOTE, and genetic algorithms.

8. The PICC thrombus risk prediction method of claim 7, wherein a first risk prediction model is constructed by:

9. The PICC thrombus risk prediction method of claim 7, wherein a second risk prediction model is constructed by:

10. The PICC thrombus risk prediction method of claim 7, wherein a third risk prediction model is constructed by: