CN117198508A

CN117198508A - Disease auxiliary prediction system based on S-NStackingV balance optimization integrated framework

Info

Publication number: CN117198508A
Application number: CN202311160717.8A
Authority: CN
Inventors: 王旭春; 仇丽霞; 崔宇; 乔宇超; 任浩
Original assignee: Shanxi Medical University
Current assignee: Shanxi Medical University
Priority date: 2023-09-11
Filing date: 2023-09-11
Publication date: 2023-12-08

Abstract

The invention relates to a disease prediction method of hepatic encephalopathy, in particular to a disease auxiliary prediction system based on an S-NStackingV balance optimization integration framework, wherein S in the S-NStackingV is an SMOTE algorithm and a plurality of improved algorithms thereof; n is NSGA-II, and represents a selection integration part, and a part of basic learners are selected to participate in integration by utilizing a multi-objective optimization method, so that the basic learners participating in integration can enable a Stacking integration model to obtain a better solution; v is a voting integration strategy for carrying out secondary integration on multiple meta-classifiers, and the defect of unstable performance of a single meta-classifier is avoided. The classification result output by the integrated model is only used as a reference index for identifying the HE by a clinician, and the clinician performs diagnosis and evaluation on the HE from various angles by combining the expression of the neuropsychiatric, the imaging examination, the assay examination of the liver function and the like of the patient.

Description

Disease auxiliary prediction system based on S-NStackingV balance optimization integrated framework

Technical Field

The invention relates to an auxiliary prediction method for hepatic encephalopathy, in particular to a disease auxiliary prediction system based on an S-NStackingV balance optimization integrated framework.

Background

Hepatic encephalopathy (hepatic encephalopathy, HE), also known as hepatic coma, is a neuropsychiatric complication resulting from acute or chronic liver failure. The clinical manifestations of the medicine are complex and various and are classified into dominant HE and hidden HE. The dominant HE has obvious clinical manifestations, mainly including reversible damages of motor functions, cognition, emotion/emotion regulation and behavior patterns; the hidden hepatic encephalopathy has no obvious clinical symptoms and can be detected only by means of neuropsychological tests ^[1-2] 。

HE is a very serious complication of cirrhosis, one of the most common causes of death in various liver diseases, and has low long-term survival rate. Most patients with cirrhosis will develop HE to varying degrees at some stage of the course, with up to 80% incidence. And the rate of HE concurrence increases with increasing degree of liver function damage ^[1] . The incidence rate of the concurrent HE in the foreign liver cirrhosis patients is at least 30-45 percent ^[3-4] More than 60% of patients with cirrhosis have had a history of mild HE disease ^[5] . Domestic investigation and research data show that the incidence rate of HE in China is different from 10% to 50%, wherein the incidence rate of light and miniature HE reaches 39.9% ^[1] . Studies show that the survival rate of patients with liver cirrhosis complicated with HE is remarkably reduced compared with that of patients with pure liver cirrhosis ^[6] . There are also studies showing that after concurrent HE, 1-year survival falls to 42% and 3-year survival is only 23% in patients with chronic liver disease ^[7] 。

Since the occurrence of HE is often concurrent on the basis of already suffering from the corresponding liver disease, its clinical manifestation often has the nature of the original liver disease and is related to the degree of damage of liver cells, the predisposition to the occurrence, and thus makes its clinical manifestation complex and diverse ^[1] . Also, liver function is severely impairedHE patients often experience gastrointestinal bleeding, jaundice, hepatorenal syndrome, cerebral edema, and various infections, which result in more occult clinical manifestations of HE that are difficult to find in clinical examinations ^[8] . However, the diagnosis and identification of HE have not been clinically "gold standard", and in order to identify HE, a clinician can only perform diagnosis and evaluation of HE from various angles such as abnormal manifestation of neuropsychiatric, imaging examination and assay examination of liver function; aiming at the hidden HE with no obvious clinical signs, the cognitive dysfunction of a patient can be identified by using precise neurophysiologic or psychological examination, however, the neurophysiologic test examination takes longer time and has complicated process, and is also easily influenced by the education level, cultural degree and illness state of the patient; the neurophysiologic detection is easy to have the problems of poor accuracy, complex used instruments and limitation in clinical application ^[9,10] . At present, the treatment means of hepatic encephalopathy is complex, the treatment effect is not satisfactory, and the discovery and reduction of the induction factors are still the basis for preventing, controlling and treating HE. Therefore, it is important to recognize the high risk group of hepatic encephalopathy as early as possible, take targeted prevention and control measures, and reduce the harm of HE.

In recent years, along with the popularization and rapid development of health informatization, medical big data such as electronic medical record data, genome data and proteome data become main components of the medical comprehensive information system in China ^[11] How to effectively mine valuable information from massive medical system big data for assisting clinical application becomes a research hot spot in the current biomedical field and is an increasingly outstanding requirement nowadays ^[12] . At present, researchers at home and abroad focus on exploring factors inducing hepatic encephalopathy and relatively valuable indexes for predicting whether hepatic encephalopathy exists. In 2001, garc MS et al studied hepatic encephalopathy using COX survival analysis model, found that diagnosis of light and miniature hepatic encephalopathy can effectively predict the occurrence of dominant hepatic encephalopathy ^[13] . In 2006 Takikawa Y et al constructed a predictive model of HE for severe acute hepatitis patients using Logistic regression, foundThe factors such as the rise in total bilirubin level, age, prothrombin time and non-hepatitis A patient are risk factors for complications of HE ^[14] . Investigation by Company et al in 2010 found that patients with cirrhosis who had mild cone system symptoms such as face blunting, slow motion and resting tremors were highly likely to develop HE in parallel, and these symptoms could be used as predictors of HE in parallel for patients with cirrhosis ^[15] . After 2013 Bai M et al performed meta analysis study on HE, patients with history of HE were found to have increased risk of concurrent HE after TIPS surgery with higher Child-Pugh grading scores ^[16] . The constant et al Jiao Yun and Wang Xun in 2015 adopts a default network Bayesian network to carry out discriminant analysis on mild hepatic encephalopathy to obtain higher accuracy ^[17] . Liu Baorong, fang Jiankai et al in 2017 apply Artificial Neural Network (ANN) to evaluate risk factors of HE (human hepatitis B) caused by chronic acute liver failure, and show that elevated White Blood Cells (WBC), elevated international standardized ratio (INR), lowered hemoglobin (Hb) and high age are main risk factors affecting HE (human hepatitis B) complicated with chronic acute liver failure patients, and compared with Logistic regression model, the risk factor evaluation prediction value of the artificial neural network model is higher ^[18] . In 2018, zhang Yijun, BP neural network based on particle swarm optimization is adopted to conduct classified prediction on liver cirrhosis complicated upper gastrointestinal hemorrhage, and research results show that the method is superior to the prediction performance of traditional Logistic regression analysis ^[19] . Subject team member Wei Zhen also uses bayesian networks to study the classification and identification of factors and diseases related to cirrhosis and HE, and compares the results with traditional Logistic regression to obtain better predictive performance.

In conclusion, the predictive classification research on liver cirrhosis complicated hepatic encephalopathy is less and the method is focused on logistic regression ^[20] Neural network ^[21] Naive bayes ^[22] The single classifier and the improved research based on the single classifier, but due to the complexity and inconsistency of data distribution and the preference of the traditional learning algorithm, each traditional single classification algorithm has relatively limited adaptive range, relatively low generalization capability and relatively unstable classification performance, and the traditional learning algorithmThe problem of overfitting is also computationally prone. Therefore, researchers put forward the concept of ensemble learning, and the ensemble learning attempts to construct sets of multiple classifiers, unlike the conventional learning method in which only one classifier is constructed for a given training data set, by using a certain combination strategy to integrate the output results of multiple classifiers, the ensemble learning method can effectively improve the generalization capability of the model because it can integrate the variability and accuracy of multiple different classifiers ^[23,24] . And under the same conditions, the calculation cost for constructing an integrated classifier is not much higher than that for constructing a single classifier, which is also one of the reasons why integrated learning is widely used.

The Stacking algorithm is used as an integration method with strong integration effect and special integration effect, and the prediction results generated by the operation of a plurality of different individual learners are used as the input of the next-level learning algorithm ^[25] And a K-fold cross-validation mode (K is generally 5) is adopted when each individual learner is trained, so that a model with more robustness and better generalization performance is obtained. The primary classifier of the Stacking algorithm often selects a plurality of classification algorithms of different categories to integrate to generate a heterogeneous integrated model, and can integrate learning mechanisms of the plurality of classification algorithms, so that the accuracy of the heterogeneous integrated model obtained after the Stacking integration is higher than that of any component learners forming the heterogeneous integrated model. The research shows that other general integration methods (5 algorithms such as AdaBoost and random forest) and the Stacking algorithm are subjected to demonstration analysis research on 36 groups of real data and 2 groups of simulation data, the performance of the Stacking algorithm is better than that of other general integration methods, and the Stacking algorithm has stronger generalization performance and is suitable for the condition of large samples ^[26] . However, when the number of the basic learners participating in the prediction of the integrated basic learner is too large, the result is similar or the accuracy is poor, so that the prediction accuracy of the integrated model is affected, and the running efficiency of the model is reduced due to the too large number of the basic learners. The prediction performance of the integrated model is optimal, so that the requirement on diversity and accuracy of the basic learner is met, and the integration effect of the integrated strategy on the basic learner is ensured to be optimal. For this purpose, sta was used in the pastThe research of the locking heterogeneous integration method is that the selection of the base classifier is mainly manually specified and reference documents, the base classifier selected by different scholars is often different, and the comparability between the researches is lacking.

Reference to the literature

[1] Chinese medical society of digestive diseases and liver diseases, chinese medical society of liver diseases, chinese liver encephalopathy diagnosis and treatment consensus opinion (2013, chongqing) [ J ]. Chinese medical journal of leading edge, 2014,6 (2): 81-83.

[2] Ding Kai, hu Pingfang, xie Weifen diagnosis and treatment of hepatic encephalopathy [ J ]. Gastroenterology, 2015,20 (2): 65-66.

[3]Bismuth M,Funakoshi N,Cadranel JF,Blanc P.Hepatic encephalopathy:from pathophysiology to therapeutic management[J].Eur J Gastroenterol Hepatol,2011,23(1):8-22.

[4]Khungar V,Poordad F.Management of overt hepatic encephalopathy[J].Clin Liver Dis,2012,16(1):73-89.

[5]Wakim-Fleming J.Hepatic encephalopathy:suspect it early in patients with cirrhosis[J].Cleve Clin J Med,2011,78(9):597-605.

[6]Increased toll-like receptor 4in cerebral endothelial cells contributes to the astrocyte swelling and brain edema in acute hepatic encephalopathy[J].Journal of Neurochemistry,2014,128(6):890-903.

[7] Jiang Huiqing, yao Dongmei, yao Xixian and hepatic encephalopathy [ J ]. Chinese general medicine, 2003,6 (6): 452-454.

[8] Deng Changsheng and Zhang Youcai diagnosis and treatment of hepatic encephalopathy J. J.Utility medicine, 2002,22 (11): 648-651.

[9] Yang Yaqi, zhang Zhaolan, shi Feng, li Mingming, zhao Lingling. Research progress in the treatment of mild and minimal hepatic encephalopathy in Western and traditional Chinese medicine [ J ]. Clinical research in TCM, 2014, (25): 142-144.

[10] Zhang Yingxue, zhao Xinxiang, sun Yong. Progress in the study of various magnetic resonance functional imaging techniques in mild hepatic encephalopathy [ J ]. Medical review. 2016,22 (7), 1340-1342.

[11] Dai Mingfeng, meng Qun. Health care big data mining and analysis is faced with opportunities and challenges. Journal of chinese health information management. 2017;14 (02):126-30.

[12] Liu Yi, huang Zhenghang, saiwei, section junglong, current clinical medical big data research status quo and hope, medical equipment, 2017;38 (03):112-5.

[13]Garc MS,Boza F,Garcia-Valdecasaa MS,Garcia E,Aguilar-Reina J.Subclinical Hepatic EncephalopathyPredicts the Development of Overt Hepatic Encephalopathy[J].American Journal of Gastroenterology,2001,96(9):2718-2723.

[14]Takikawa Y,Endo R,Suzuki K,Omata M.Prediction of Hepatic Encephalopathy Development in Patients With Severe Acute Hepatitis[J].Digestive Diseases and Sciences,2006,51(2):359-364.

[15]Company L,Zapater P,Pérez-Mateo M,Jover R.Extrapyramidal signs predict the development of overt hepatic encephalopathy in patients with liver cirrhosis.[J].European Journal of Gastroenterology&Hepatology,2010,22(5):519-525.

[16]Bai M,Qi X,Yang Z,Han G.Predictors of hepatic encephalopathy after transjugular intrahepatic portosystemic shunt in cirrhotic patients:A systematic review[J].Journal of Gastroenterology&Hepatology,2011,26(6):943-951.

[17] Jiao Yun, wang Xunheng, shang Tianyu, zhu Xiqi, teng Gaojun. Method for discriminating light and miniature hepatic encephalopathy based on a Bayesian model of a default network (English) [ J ]. Journal of Southeast University (England Edition), 2015,31 (04): 582-587.

[18] Liu Baorong, fang Jiankai, lin Minghua, gao Haibing, pan Chen. An artificial neural network was used to evaluate the risk factor for hepatic encephalopathy in chronic and acute liver failure of hepatitis B [ J ]. Liver, 2017,22 (12): 1085-1089+1093.

[19] Zhang Yijun the prediction of liver cirrhosis with concomitant upper gastrointestinal hemorrhage based on particle swarm optimization of BP neural network [ D ]. University of Shanxi medical, 2018.

[20]Hai ND,Giang NL.Anomaly Detection with Multinomial Logistic Regression andBayesian.Lecture Notes in Electrical Engineering.2013；240:1129-36.

[21]Kavzoglu T.Increasing The Accuracy Of Neural Network Classification Using Refined Training Data.Environmental Modelling&Software.2009；24(7):850-8.

[22]Wikipedia F.Naive Bayes Classifier.2016.

[23]Dietterich,Thomas,G.Machine-learning research.Ai Magazine.1997.

[24]Gams M,Bohanec M,Cestnik B,editors.A schema for using multiple knowledge.conference on learning theory；1994.

[25]David H.Wolpert.Stacked Generalization[J].Neural Networks,1992,5(2):241-259.

[26]Breiman L.Randomizing Outputs to Increase Prediction Accuracy[J].Machine Learning,2000,40(3):229-242.2000

Disclosure of Invention

Aiming at the problems that the research of HE predictive recognition is mostly based on a traditional model and a plurality of common homogeneous integrated models at present, the invention provides a disease auxiliary prediction system based on an S-NStarkingV balance optimization integrated framework, which is beneficial to early diagnosis of patients with HE lesions and timely adoption of effective treatment measures by medical staff.

The invention is realized by adopting the following technical scheme: disease auxiliary prediction system based on S-NStarkingV balance optimization integration framework, which predicts by adopting the following method:

data collection and pretreatment: collecting and preprocessing data of medical record data of patients with liver cirrhosis, wherein the data comprises m variables, and the data is divided into a training set, a verification set and a test set;

the feature screening method comprises the following steps: adopting a plurality of feature screening methods, selecting half or more variables selected by the feature screening methods as model prediction factors in a voting mode, incorporating the model prediction factors into a later model prediction, and solving the problem of high dimension of data features in the medical record data set;

balance optimization integration framework: constructing an S-NStakinggV optimization balance integration framework, wherein the construction is divided into a phase I, a phase II and a phase III;

and I, stage: generating basic learners, dividing a training set through the idea of cross verification, carrying out balancing treatment on the divided training subsets, and respectively training each balanced training subset by using different basic learning algorithms to obtain a plurality of basic learners;

stage II: a basic learner selects a basic learner participating in integration from a plurality of basic learners based on a multi-objective optimization algorithm NSGA-II;

and III, stage: the basic learner is integrated, an output result of the meta classifier is obtained based on a Stacking integration strategy, the meta classifier is replaced, the output results of a plurality of meta classifiers are obtained, secondary voting integration is further carried out on the output results of the meta classifiers to obtain an improved integration strategy S-NStackingV, the S-NStackingV integration strategy is adopted to integrate the basic learner which participates in final integration, the adaptability of the integration model is evaluated by using a verification set, so that an optimal set is determined, and performance evaluation is carried out on the optimal set by using test data.

The disease auxiliary prediction system based on the S-NStardingV balance optimization integration framework comprises the following specific processes of selecting a basic learner participating in integration from a plurality of basic learners based on a multi-objective optimization algorithm NSGA-II in the stage II:

firstly, defining a function as an objective function of a multi-objective optimization algorithm, wherein the specific definition is as follows: objective function Z ₁ The target function Z is an accuracy index of the integrated model ₂ Z is the complexity index of the integrated model ₁ ＝F1(Stacking)，Z ₂ ＝nsc/ntc，Z ₁ The objective function is calculated by four indexes representing classification states, namely true positive TP, true negative TN, false positive FP and false negative FN; to construct an objective function Z ₁ Four random index variables (X _i1 ,X _i2 ,X _i3 ,X _i4 ) TP, TN, FP and FN are calculated for all data examples respectively; these index variables are described as X _i1 ＝I{AL _i ＝PL _i ＝C ₊ }；X _i2 ＝I{AL _i ＝PL _i ＝C _- }，X _i3 ＝I{AL _i ≠PL _i ＝C ₊ }；X _i4 ＝I{AL _i ≠PL _i ＝C _- And the actual positive (+) and actual negative (-) labels are respectively marked by C ₊ And C-represents, the i < th) ^th PL for predictive tagging of individual data instances _i Indicating the ith ^th Actual label AL for individual data instance _i The representation is made of a combination of a first and a second color,

for two classes of classification problems for a dataset made up of N data, where N ₁ Positive samples and n ₂ The following expression was calculated for each negative sample:

the objective function is defined as equation O:

wherein n is _sc The number of basic classifiers selected for NSGA II, n _tc For the total number of classifiers, find the optimal minimum number of basic classifier sets by evolution, let Z ₁ Reaching a maximum.

The disease auxiliary prediction system based on the S-NStarkingV balance optimization integration framework comprises the following preprocessing process of a data set: deleting the variable if the proportion of the data lacking a certain variable to the whole data set is more than or equal to 30%, deleting the data with the variable deletion number more than or equal to 30% m in each piece of data in the data set, and filling the data with the misforest method if the variable or sample deletion proportion in each piece of data is less than 30%; and performing independent thermal coding on the classified variables in the data, and performing normalization processing on the continuous variable, so as to solve the influence of data deletion and distribution difference on the performance of the model.

According to the disease auxiliary prediction system based on the S-NStarkingV balance optimization integration framework, when the adaptability of the integration model is evaluated by using the verification set, the particle swarm optimization algorithm is adopted to conduct super-parameter optimization of the integration model.

According to the disease auxiliary prediction system based on the S-NStarkingV balance optimization integration framework, a Voing Voting algorithm is adopted in the III stage, weights are set according to the accuracy of the test results of each meta-classifier, the probability of the classification results is estimated according to each meta-classifier, weighted average is carried out, and finally the category with the highest score is taken as the classification result.

The disease auxiliary prediction system based on the S-NStarkingV balance optimization integration framework adopts an SMOTE method and an improvement method thereof when the balance treatment is carried out in the stage I.

The classification result output by the integrated model is only used as a reference index for identifying the HE by a clinician, and the clinician performs diagnosis and evaluation on the HE from various angles by combining the expression of the neuropsychiatric, the imaging examination, the assay examination of the liver function and the like of the patient.

Drawings

Fig. 1 is a feature screening policy diagram.

FIG. 2 is a schematic diagram of chromosome structure.

FIG. 3 is a flowchart of NSGA-II algorithm.

FIG. 4 is a diagram of the overall framework of a Stacking integrated classification model for voting by multiple meta-classifiers.

Detailed Description

Disease auxiliary prediction system based on S-NStarkingV balance optimization integration framework, which predicts by adopting the following method:

1. data collection and pretreatment:

and collecting data of the hospitalized liver cirrhosis patients with complete medical record data in the digestive system department of 1 month 2006-2015 in a certain hospital, wherein 950 cases of the liver cirrhosis patients are effective after preliminary arrangement, 68 cases of the complicated liver encephalopathy and 882 cases of the uncomplicated liver cirrhosis patients are included. Recording 24 index variables such as basic demographic information, clinical manifestation, other clinical complications, biochemical indexes of the patient and the like of the patient with cirrhosis, wherein each biochemical index is selected from the first detection results within 24 hours after the patient is admitted. Deleting the variable if the proportion of the data lacking a certain variable to the whole data set is more than or equal to 30%, deleting the data with the variable deletion number more than or equal to 30% m in each piece of data in the data set, and filling the data with the missing proportion of the variable or the sample in each piece of data being less than 30% by adopting a misforest method; and performing independent heat coding on the classified variables in the data set, and performing normalization processing on the continuous variable.

2. The feature screening method comprises the following steps:

because the high dimension of the data may reduce the accuracy and efficiency of the model, researchers often perform data dimension reduction on the variable through variable screening, select a feature subset with the greatest influence on the target variable HE from the original data set, and aim to improve the accuracy and the interpretability of the model and reduce the calculation cost and noise caused by processing irrelevant features.

At present, the feature screening method is mainly divided into three types of filtering, wrapping and embedding methods, wherein the filtering method selects features based on the general performance of the features without considering the mode of a model; the wrapping method is to select a "custom" feature subset for a given classifier by taking the performance of the classifier to be used as an evaluation criterion for the feature subset. Compared with the filtering method, the method has more pertinence and is helpful for improving the model performance; the embedding rule is to embed the feature selection step into the classifier training process, which saves time compared with the wrapping method and gives the feature selection to the model for learning. Considering that each method has advantages and disadvantages, and that all correlation relations among variables are difficult to accurately describe by a single feature screening method, and that indexes of the measured correlation of each method are different, 10 feature screening methods including linearity and nonlinearity (including filtering, wrapping and embedding three methods) are considered to perform joint selection on an original feature set, 10 analysis methods are comprehensively considered to select an optimal feature subset, and the following feature screening strategies are established:

(1) the method comprises the steps of respectively adopting 10 feature selection methods including a filtering method (variance filtering, maximum mutual information coefficient (Maximal Information Coefficient, MIC), a spline correlation coefficient), a packaging method (SVM_RFE, LR_RFE) and an embedding method (SVM, LR, extreme random tree, random forest, GBDT) to perform feature selection on an original feature data set;

(2) summarizing 10 feature subsets selected by the 10 feature screening methods;

(3) recording the selected frequency of each feature in a voting mode;

(4) and incorporating the variables with the selected frequency being greater than or equal to 4 (simultaneously, the variables selected by 4 or more methods) into the model to serve as final predictors. See feature screening policies for details (FIG. 1).

3. Class imbalance method:

currently, methods for dealing with the problem of class imbalance mainly include resampling, cost-sensitive analysis and ensemble learning. The invention mainly adopts an oversampling and comprehensive resampling algorithm for later construction of a Stacking heterogeneous integrated balance base classifier (a plurality of balanced data processed by different oversampling methods are used for training the base classifier). For the over-sampling and comprehensive resampling methods employed in the present invention: considering that the SMOTE algorithm is the most widely used oversampling method at present, the main idea is to add new sample points by inserting new non-existing sample points into a small number of samples at similar positions, instead of simply copying existing sample points, the method can effectively avoid the problem of 'over fitting' compared with simple random oversampling, so that the SMOTE algorithm is adopted when the imbalance problem is solved, however, the new samples generated by the SMOTE algorithm only consider the characteristic of sample aggregation, neglect the distribution situation of the samples, and have a certain blindness in the process of synthesizing new samples. In view of this, the present invention incorporates 6 resampling methods surrounding the SMOTE method (SMOTE-IPF, NRSBoundary-SMOTE, safs_level_smote, kmeans-SMOTE, MWMOTE) for the later combination.

4. Super-parameter optimization algorithm

Super-parameter optimization in machine learning aims to find super-parameters that allow the machine learning algorithm to perform optimally on a validation dataset. The setting of the super parameters has direct influence on the performance of the model, the most commonly used method for adjusting the super parameters is a grid search algorithm, however, when the number and the range of the adjustment parameters are increased, the time consumption of grid search is long, and the method is not suitable for adjusting multiple models and complex super parameter models, so that the research adopts a particle swarm optimization algorithm in a swarm intelligent optimization algorithm to perform parameter optimization.

(1) Particle swarm optimization algorithm (Particle Swarm Optimization, PSO)

The PSO algorithm is a group intelligent optimization algorithm inspired by the foraging behavior of the bird group and has global iterative optimization capability. The PSO algorithm has the advantages of simple structure and good robustness, and is often used for solving the problem of the optimal solution.

In a multidimensional space, PSO algorithm assigns each particle x within population S _i One value in each dimension, each particle has a velocity attribute that updates its own value in a different dimension toward a better direction. In the iterative process, the algorithm records the optimal values of individuals and groups as the updating direction of each individual, and the algorithm flow is as follows:

step 1: the parameters of the particle population are initialized, and position attributes and velocity attributes are assigned to each particle within the population.

Step 2: the fitness value of each particle is obtained by a fitness function F, and the global optimum and the individual optimum are obtained by comparing the fitness value magnitudes.

Step 3: updating the speed and position of each particle in the population by the global optimum, expressed by equations (1) to (2), respectively:

v _i ＝ω·v _id +c ₁ r ₁ (p _id -x _id )+c ₂ 2(p _gd -x _id ) (1)

x _i(d+1) ＝x _id +v _i(d+1) (2)

wherein: omega is inertia weight, and is used for adjusting the local searching capability and the global searching capability of the algorithm; v _id Is the velocity of particle i in d-dimension; x is x _id Is the position of particle i in the d-dimension; c ₁ And c ₂ The value is usually 2 for the acceleration factor; r is (r) ₁ And r ₂ Is [0,1]Random numbers of (a); p is p _id 、p _gd Respectively representing the ith variable in d dimensionAn individual optimum and a global optimum of (a); v _i(d+1) The speed of particle i after being updated by the variables above in d+1 dimensions; x is x _i(d+1) From the history position x in d+1 dimension for particle i _id And velocity v _i(d+1) The location is updated.

5. Balance optimization integration framework (S-NStackingV integration learning method)

At present, the integrated learning method has been widely applied to predictive modeling, and the integrated learning is to train a large number of basic learners, and then integrate the results of the basic learners through an integration strategy, so that the prediction accuracy of the model can be effectively improved. However, when the number of the basic learners participating in the prediction of the integrated basic learner is too large, the result is similar or the accuracy is poor, so that the prediction accuracy of the integrated model is affected, and the running efficiency of the model is reduced due to the too large number of the basic learners. The prediction performance of the integrated model is optimal, so that the requirement on diversity and accuracy of the basic learner is met, and the integration effect of the integrated strategy on the basic learner is ensured to be optimal. In addition, consideration of the problem of class imbalance of clinical data often affects the predictive performance of classification models. Therefore, the invention provides an improved integrated learning method, namely S-NStackingV (S is an SMOTE algorithm and a plurality of improved algorithms thereof), wherein N is an NSGA-II expression selection integration part, a part of basic learners are selected to participate in integration by utilizing a multi-objective optimization method, the basic learners participating in integration are ensured to be capable of enabling a Stacking integration model to obtain a better solution, and V is a voting integration strategy for carrying out secondary integration on a plurality of element classifiers, so that the defect of unstable performance of a single element classifier is avoided. The concrete construction thought is as follows:

and I, stage: and (6) generating a basic learner. Dividing training set data through the idea of cross verification, carrying out balance treatment on the divided training subsets (6 resampling methods are used for respectively treating the divided training subsets, namely, one training subset corresponds to 6 balance methods and generates 6 different balance training subsets), and respectively training each balance training subset by using different basic learning algorithms to obtain a plurality of basic learners.

Stage II: the base learner selects. The basic learner participating in the integration is selected from the new basic learner set based on the multi-objective optimization algorithm NSGA-II.

And III, stage: the base learner is integrated. And obtaining an output result of the meta classifier based on a traditional Stacking integration strategy, replacing the meta classifier to obtain the output results of a plurality of meta classifiers, further performing secondary voting integration on the output results of a plurality of meta learners to obtain an improved integration strategy S-NStackingV, and integrating the basic learners participating in final integration by adopting the S-NStackingV integration strategy.

5.1 BaseLearn Generation

The generation of the basic learner is the most basic step in the integrated learning, and the subsequent work needs to be performed on the basis. To optimize the predictive effect of the integrated model, a large number of base learners with diversity need to be generated. In order to construct a basic learner with diversity on a limited data set and an existing classification learning algorithm, a training data set of a model may be divided, a training subset of the training data set is used to train the basic learning algorithm each time, and then a plurality of basic learners may be obtained. In order to utilize the sample points in the training data set, the training samples are divided in the same mode as the 5-fold cross validation, namely the training data is divided into 5 subsets which are basically the same in size and are mutually disjoint, and 5-1 subsets are selected to form training subsets to train a basic learning algorithm respectively. In addition, in order to solve the problem of unbalanced classification of the data set, a class balance processing method is added in the generation process of the basic classifier, each training subset is subjected to balance processing for training the classification algorithm, and the verification subset does not perform any processing for obtaining the output result of the classification algorithm.

In order to construct an integrated method with better prediction performance and ensure the diversity of the basic learner, the basic learner is constructed by adopting various classifiers of LR, SVM, MLP, RF, XGBoost, catBooost, NGBoost, lightGBM and other different algorithm principles. Initially, the entire data set is randomly split into two parts (i.e., a ratio of 7:3). 70% of the data is used as a training set and a verification set for prediction error estimation for model construction and model selection, respectively. The remaining 30% of the data was used as a test set to evaluate the generalization error of the selected model. Further, the first portion was subjected to five-fold cross-validation, dividing the data into 80:20 (i.e., training/validation set) ratios. The five folds of the cross-validation generate five bootstrap samples or packets representing five different training data sets. The five training data sets were further balanced (using the 6 resampling methods described above) to obtain 6 x 5 balanced training subsets, and on each of these 30 balanced training subsets, 8 base learners were applied, thereby creating 240 base learning models. Each algorithm builds 30 basic learning models separately. And evaluating the fitness of the integrated model by using the verification set to determine an optimal set, and evaluating the performance of the optimal set by using the test data.

5.2 selection of basic learner

The base learner will be selected using an optimization algorithm based on the original set of base learners. The integrated model has diversity and prediction accuracy for the basic learners participating in integration to ensure the prediction performance after integration, and the invention reduces the size of the basic learners as much as possible on the basis of ensuring the accuracy of the final integrated model, namely, the basic learners as few as possible are utilized to achieve the relatively best integration effect, so the problem of selecting the basic learners can be converted into a multi-objective optimization problem. There are many multi-objective optimization algorithms at present, wherein NSGA-II is a non-dominant sorting method based on, and improvement is performed on NSGA, and the multi-objective optimization algorithm is a milestone algorithm in the field of multi-objective evolutionary optimization. The basic idea of the algorithm is that (1) an initial population with the scale of N is randomly generated, and a first generation offspring population is obtained through three basic operations of selection, intersection and variation of a genetic algorithm after non-dominant sorting; (2) starting from the second generation, merging the parent population and the child population, performing rapid non-dominant sorting, simultaneously performing crowding calculation on individuals in each non-dominant layer, and selecting proper individuals to form a new parent population according to non-dominant relations and the crowding of the individuals; (3) a new population of offspring is generated by basic operation of the genetic algorithm and so on until the condition for the end of the program is met. The invention selects a basic learner by using NSGA-II algorithm, firstly, a function is required to be defined as an objective function of an optimization algorithm, and the specific definition is as follows:

objective function Z ₁ The target function Z is an accuracy index of the integrated model ₂ The function expression is shown as the formula (3-4) for the complexity index of the integrated model,

Z ₁ ＝F _1(Stacking) (3)

Z ₂ ＝n _sc /n _tc (4)

the multi-objective function O of the present invention can be defined as equation (5),

Z ₁ the objective function is calculated by four indicators representing classification status, true Positive (TP), true Negative (TN), false Positive (FP) and False Negative (FN).

To construct an objective function Z ₁ Four random index variables (X _i1 ,X _i2 ,X _i3 ,X _i4 ) TP, TN, FP, and FN are calculated for all data instances, respectively. These index variables are described as:

X _i1 ＝I{AL _i ＝PL _i ＝C ₊ }；X _i2 ＝I{AL _i ＝PL _i ＝C _- } (6)

X _i3 ＝I{AL _i ≠PL _i ＝C ₊ }；X _i4 ＝I{AL _i ≠PL _i ＝C _- } (7)

wherein, the actual positive (+) and actual negative (-) labels are respectively C ₊ And C _- And (3) representing. Ith (i) ^th P for predictive tagging of individual data instances _Li Indicating the ith ^th Actual tag for data instance A _Li And (3) representing. The sum of all random index variable values is equal to 1, namely:

two types of classification problems for disease datasets consisting of N samples, where N ₁ Positive samples and n ₂ The following expression was calculated for each negative sample:

thus, the objective function used in the present invention is defined as equation O:

wherein n is _sc The number of basic classifiers, n, selected for NSGA II program _tc Is the total number of base classifiers.

The object of the present invention is to find the optimal minimum number of basis classifiers by evolution, let F ₁ Reaching a maximum. Here, the number of base classifiers (NOBs) represents the integration complexity of the selected optimal learner.

5.2.1 Multi-target evolutionary Components

Integrated processes have proven to be more efficient than the process alone. Although, in building an integration, two factors need to be examined, (1) model selection and (2) model combination. The proposed multi-objective evolutionary components of the S-NStardingV method show both the model selection method and the model combination method. Training data is used to construct candidate sets, while validation data is used to evaluate fitness of all candidate sets in each iteration of the evolution process. Finally, this section details the method of model selection and model combining.

5.2.1.1 NSGA-II based model selection

This section describes the NSGA-II algorithm for optimizing for HEThe number of basis classifiers for classification of the dataset to obtain a maximum value for Z1. In the method of the invention, model selection is regarded as a double objective optimization problem, searching for an integration with the best accuracy and minimum number of basis classifiers from the generated integration model. The optimization process aims to find the best combination of the base learners. Thus, when encoding the base learning model, a binary encoding scheme is employed. Each bit of the base learning model represents whether the corresponding model is selected. A "1" for each bit indicates that the corresponding model is selected, and a "0" indicates that it is not selected. Thus, the base learning model set length is equal to the number of base learners. The model is selected such that the first objective function (F ₁ ) Maximization, the second objective function (NOB) is minimized.

In recent years, evolutionary algorithms have been used mainly to solve various multi-objective optimization problems, also known as multi-objective evolutionary algorithms (MOEA). NSGA-II [16] is one of the most popular and effective methods in a number of MOEA techniques. By examining the candidate solution field, it generates a Pareto optimal solution, rather than a single solution. By using two mechanisms of diversity protection and elite, a non-dominant solution is found. The algorithm adopts a rapid non-dominant sorting and crowding distance calculation technology to replace shared parameters needed to be used in the original NSGA. Both methods are used to rank solutions and calculate the pareto front of the overall non-dominant solution set. Then, the offspring of the current population and the previous generation population are integrated using crossover, mutation and propagation operators to generate a new population. Finally, solutions related to diversity and non-dominance are selected. NSGA-II has been recursively involved in the optimization process as a benchmark algorithm. Since model selection is considered a combinatorial optimization problem, NSGA-II is used to generate the appropriate integration.

Many real world situations can be modeled as single-objective or multi-objective optimization problems (MOOPs). MOOPs involve multiple targets that need to be optimized simultaneously. Since these objectives are often conflicting in nature, the progress of one objective can only be achieved at the cost of degradation of at least one other objective. To achieve this goal, the best compromise is generally sought among competing goals. For such problems, the optimal solution is defined as a Pareto optimal solution, i.e., a non-dominant solution set of the entire feasible decision space. Thus, these solution sets are considered optimal, with no other solutions being better than them, taking into account all the objectives. The corresponding target values of the pareto optimal solution constitute the pareto optimal front. In the study NSGA-II was used to optimize two goals, namely accuracy and the number of basis classifiers. The three key steps of the NSGA-II algorithm can be explained as follows:

1. based on the concept of dominance, a leading edge is generated considering the decision vector of the target space, where decision vector x dominates decision vector y (denoted as x < y), if and only if:

this indicates that x is not worse than y in all targets, and that x is strictly better than y in at least one targetThe value of the i-th objective function representing the decision vector x, M represents the number of targets.

2. The decision vector (di) is calculated by using the following equation with respect to the crowding distance of the two nearest neighbors asi-1 and i+1 at similar fronts. (12) - (14):

in the evolutionary algorithm, each chromosome represents a coded solution in the search space. To search for the optimal set of 240 base learners by NSGA-II, a fitness function is defined by encoding the chromosome and the search space is explored with genetic operators, creating a given problem. The chromosomes are encoded in a 48-set 5-bit string format, each set being associated with a resampling method and a classifier, one bit representing a data packet. The presence of a 1 in a bit indicates that the appropriate resampling method, classifier and associated data packet are used. For example, in FIG. 2, the first group contains 1, representing an SVM classifier balanced using the SMOTE method, and D1-5 are the data packets associated therewith.

The chromosome structure used here is considered to be 5 (data-bands) in gene size and 48 (resampling methods and classifiers) in gene dimension. So it ends up in a binary coded string of 240 bits in length. Each dimension of the chromosome displays a corresponding packet-resampling method-classifier pair, indicating its presence or absence in the collection.

NSGA-II algorithm pseudo code table

5.3 basic learner integration

In the Stacking-based integrated classification model proposed earlier, only one meta classifier is used by the meta classifier layer, which is also the most common mode in the Stacking integrated classification model. However, when only one Meta Classifier is used for making a final decision, the classification effect may be unstable due to the particularity of the Classifier, in order to reduce the deviation possibly occurring when a single Meta Classifier makes diagnosis against unbalanced monitoring data, the present section proposes to construct a Stacking integrated classification model (Multi-Meta Classifier-Stacking, MMC-Stacking) of a plurality of Meta classifiers on the basis of optimizing integration, namely LR, MLP, lightGBM is selected as the Meta Classifier in the Meta Classifier layer, a symmetrical two-layer stacked frame is constructed, the final diagnosis result is obtained by voting on the classification results of 3 Meta classifiers, and the stability of the whole integrated classification model is further enhanced by using the diversity of the Classifier in the Meta Classifier layer. An overall block diagram of a Stacking integrated classification model based on multiple meta classifier votes is shown in fig. 4:

in summary, the improved selection integration algorithm of the present invention mainly comprises three parts: the first part is a construction part of the basic learner, balance training sets are balanced by utilizing a plurality of oversampling and comprehensive resampling algorithms, verification sets and test sets are not processed, 30 (6 multiplied by 5) balance training subsets are generated, 5 verification subsets (1 verification subset corresponds to 6 balance training subsets) and 1 test set, and the 30 training sets are combined with 8 machine learning classification algorithms to construct 240 basic learners in total. The second part is a basic learner selecting part, which selects basic learners participating in integration in a new basic learner set based on a multi-objective optimization algorithm NSGA-II, and the final integrated model can be ensured to have diversity and accuracy through optimization selection; the third part is the secondary weighted integration of the optimized integrated model, and in order to reduce the possible deviation of a single meta-classifier when diagnosis is made on unbalanced monitoring data, a Stacking integrated classification model for voting by a plurality of meta-classifiers is proposed to be constructed on the basis of the optimized integration. The flow chart and pseudocode of the joint balance optimization integration framework are shown in the following table:

S-NStarkingV algorithm pseudo code table

6. Evaluation index:

(1) single evaluation index:

in the present invention, various commonly used performance metrics such as accuracy, specificity, sensitivity (recall), F1 score, G-mean, and AUC are used to examine the overall performance of the classifier.

(2) Comprehensive evaluation mode:

considering that the performances of different models on different indexes may be inconsistent (for example, the recall rate of some models is high and the specificity is low, and some models are stable and balanced in performance when integrating all indexes, but not all indexes can reach the optimal), in order to comprehensively consider all indexes and find the joint strategy with optimal comprehensive performance, the invention adopts a sequencing assignment summation mode to calculate the comprehensive score of the joint model, the assignment mode of each index is based on the ranking of the index value of the model in different indexes, the higher the ranking is (for example, 66 joint models, the highest score is 66 score, the descending is sequentially carried out, the lowest score is 1 score), and the final performance score of each model is obtained according to the calculation mode of 'total score=comprehensive index score (AUC, F1 score, G_mean) ×2+recall score×1.5+accuracy score', and the highest score is set as the optimal joint strategy.

7. Model interpretability analysis method:

①SHAP：

SHAP (Shapley Additive Explanations) interpretation is a machine learning model interpretation method proposed by Scott et al in 2017, and has both local and global interpretability. The method is based on the optimal shape principle in the game theory, builds a linear model, and can be used for explaining the output result of any machine learning model. The specific calculation formula is as follows:

the above formula is phi _i Representing characteristic x _(i,j) M represents a feature number, S is a feature x removed from m _(i,j) Is a subset of f (S) representing the model output predicted using the feature set S, f (S _(i,j) }) represents the use of the feature set S%x _i,j Model output of the prediction. The summation in the formula represents summing all possible subsets S, calculating the feature x for each subset S _(i,j) Is a contribution of (a).

②LIME：

LIME (Local Interpretable Model-agnostic Explanations) is a specific implementation method for local interpretability of a machine learning model proposed by Ribeiro et al in 2016, and is a model-independent interpretability method focusing on interpreting the prediction results of the machine learning model for a single sample. The basic idea of LIME is to approximate the black box model that needs to be interpreted locally using an interpretable simple model, giving an interpretation of the sample prediction results by the model.

Claims

1. A disease auxiliary prediction system based on an S-NStackingV balance optimization integration framework is characterized in that the system predicts by adopting the following method:

the feature screening method comprises the following steps: adopting a plurality of feature screening methods, selecting half or more variables selected by the feature screening methods in a voting mode as model prediction factors, incorporating the model prediction factors into the later model prediction, and solving the problem of high dimension of the data features in the monitoring data set;

2. The disease auxiliary prediction system based on the S-NStackingV balance optimization integration framework according to claim 1, wherein the specific process of selecting a base learner participating in integration from a plurality of base learners based on a multi-objective optimization algorithm NSGA-II in stage II is as follows:

firstly, defining a function as an objective function of a multi-objective optimization algorithm, wherein the specific definition is as follows: objective function Z ₁ The target function Z is an accuracy index of the integrated model ₂ Z is the complexity index of the integrated model ₁ ＝F1(Stacking)，Z ₂ ＝nsc/ntc，Z ₁ The objective function is calculated by four indexes representing classification states, namely true positive TP, true negative TN, false positive FP and false negative FN; to construct an objective function Z ₁ Four random index variables (X _i1 ,X _i2 ,X _i3 ,X _i4 ) TP, TN, FP and FN are calculated for all data examples respectively; these areThe index variable is described as X _i1 ＝I{AL _i ＝PL _i ＝C ₊ }；X _i2 ＝I{AL _i ＝PL _i ＝C _- }，X _i3 ＝I{AL _i ≠PL _i ＝C ₊ }；X _i4 ＝I{AL _i ≠PL _i ＝C _- And the actual positive (+) and actual negative (-) labels are respectively marked by C ₊ And C-represents, the i < th) ^th PL for predictive tagging of individual data instances _i Indicating the ith ^th Actual label AL for individual data instance _i The representation is made of a combination of a first and a second color,

the objective function is defined as equation O:

wherein n is _sc The number of basic classifiers selected for NSGA II, n _tc For the total number of classifiers, find the optimal minimum number of basic classifiers by evolution, let Z ₁ Reaching a maximum.

3. A disease assisted prediction system based on an S-NStackingV balance optimization integration framework according to claim 1 or 2, wherein the data set preprocessing procedure is: deleting the variable if the proportion of the data lacking a certain variable to the whole data set is more than or equal to 30%, deleting the data with the variable deletion number more than or equal to 30% m in each piece of data in the data set, and filling the data with the misforest method if the variable or sample deletion proportion in each piece of data is less than 30%; and performing independent thermal coding on the classified variables in the data, and performing normalization processing on the continuous variable, so as to solve the influence of data deletion and distribution difference on the performance of the model.

4. The disease auxiliary prediction system based on the S-NStackingV balance optimization integration framework according to claim 1 or 2, wherein when the fitness of the integration model is evaluated by using the verification set, a particle swarm optimization algorithm is adopted to perform super-parameter optimization of the integration model.

5. The disease auxiliary prediction system based on the S-NStackingV balance optimization integration framework according to claim 1 or 2, wherein in stage iii, a rotation Voting algorithm is adopted, weights are set according to the accuracy of the test result of each meta classifier, the probability of the classification result is estimated according to each meta classifier, weighted average is performed, and finally the category with the highest score is taken as the classification result.

6. The disease auxiliary prediction system based on the S-NStackingV balance optimization integration framework according to claim 1 or 2, wherein the SMOTE method and the improvement method thereof are adopted when the balancing treatment is performed in the stage i.