CN113470816A

CN113470816A - Machine learning-based diabetic nephropathy prediction method, system and prediction device

Info

Publication number: CN113470816A
Application number: CN202110737504.1A
Authority: CN
Inventors: 董哲毅; 柯雨景; 王倩; 苏仕斌; 段姝伟; 陈香美
Original assignee: First Medical Center of PLA General Hospital
Current assignee: First Medical Center of PLA General Hospital
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-10-01

Abstract

The invention discloses a diabetic nephropathy prediction method, a diabetic nephropathy prediction system and a diabetic nephropathy prediction device based on machine learning, wherein the diabetic nephropathy prediction method comprises the following steps: collecting data and combing the characteristics of the data; establishing a training set according to the characteristics, wherein labels of training set samples comprise diabetic nephropathy and non-diabetic nephropathy; training through a training set based on a machine learning method to obtain a prediction model; and predicting the data to be predicted through the prediction model to obtain a tag value. The machine learning method does not need assumptions about input variables and the output relation thereof, relatively gets rid of the limitation of the traditional statistical analysis, can simultaneously converge a large amount of characteristic data for analysis, is beneficial to improving the accuracy of prediction, and has extremely important significance for early discovery and timely intervention and treatment of diseases.

Description

Machine learning-based diabetic nephropathy prediction method, system and prediction device

Technical Field

The invention relates to the technical field of machine learning, in particular to a diabetic nephropathy prediction method, a diabetic nephropathy prediction system and a diabetic nephropathy prediction device based on machine learning.

Background

Diabetes is an increasingly serious global health problem. According to the data published by the international diabetes union, there are 4.25 million diabetics worldwide; by 2045, this figure is expected to rise to 6.93 billion. Diabetic Nephropathy (DN) is a common complication of diabetes and is also one of the leading causes of end-stage renal disease (ESRD). The type of renal disease in patients with diabetes complicated with kidney disease is not just DN, and many non-diabetic Nephropathy (NDRD) are found at the time of renal biopsy. DN is difficult to reverse, but some NDRD (such as IgA nephropathy and membranous nephropathy) are often treatable or even reversible. The prognosis of NDRD patients is superior to DN patients, because part of NDRD glomerular and renal tubule diseases can be cured or alleviated, and therefore, the differential diagnosis has quite important significance.

However, at present, the kidney biopsy is usually taken as the "gold standard" by patients with diabetes and nephropathy, but the trauma degree of the kidney biopsy is high, and the acceptance and feasibility of the patients are still limited. NDRD is often misjudged as DN, missing the opportunity for active treatment. Therefore, in order to properly manage type II diabetes mellitus complicated by kidney disease patients, a non-invasive prediction method and prediction apparatus for identifying DN are required to optimize the treatment strategy.

The existing prediction-based methods use logistic regression methods for analysis and prediction, but such methods are based on linear relationships between features or variables and results. But some non-linear relationship features will be ignored or excluded. Finally, available features are reduced, and variables which can be used for improving the prediction capability are missed, so that the accuracy, specificity and sensitivity of prediction cannot achieve good results at the same time.

Disclosure of Invention

In view of the above technical problems in the prior art, the present invention provides a method, a system and a device for predicting diabetic nephropathy based on machine learning.

The invention discloses a machine learning-based diabetic nephropathy prediction method, which comprises the following steps: collecting data and combing the characteristics of the data; establishing a training set according to the characteristics, wherein labels of training set samples comprise diabetic nephropathy and non-diabetic nephropathy; training through a training set based on a machine learning method to obtain a prediction model; and predicting the data to be predicted through the prediction model to obtain a tag value.

Preferably, the method for predicting diabetic nephropathy further comprises a method for constructing a simplified model:

ranking the importance of the features of the prediction model;

sorting and screening features according to the importance;

establishing a second training set based on the screened features and the training set;

and training through a second training set based on a machine learning method to obtain a simplified prediction model.

Preferably, the method for sorting the features according to importance comprises:

in the importance ordering, the first n characteristics are taken to construct a learning model, wherein n is a positive integer;

detecting the accuracy of the learning model through a test set;

establishing a learning curve according to a plurality of learning models and the accuracy thereof;

features are screened through learning curves.

Preferably, the characteristics screened include any one or a combination of the following: diabetic Retinopathy (DR), diabetic course, systolic blood pressure, blood urea nitrogen, fully glycosylated hemoglobin, blood creatinine, urine red blood cell count, and blood haptoglobin.

Preferably, the machine learning method includes any one of the following algorithms: catboost, random forest, XGboost, neural networks, support vector machines, decision trees, and logistic regression.

Preferably, the diabetic nephropathy prediction method further includes a method of predicting a probability of diabetic nephropathy by the SHapley value:

obtaining SHApley values of features in the prediction model;

and predicting the probability of the diabetic nephropathy according to the SHApley value of the characteristics in the data to be predicted.

Preferably, the method of combing said data features comprises:

after preprocessing and cleaning the collected data, screening characteristics by adopting any one of the following algorithms: recursive feature elimination, variance analysis and principal component analysis.

Preferably, the characteristics of the data include any one or a combination of the following variables:

acute phase protein, diabetic course, body mass index, diabetic course, blood pressure, smoking, drinking, history of hypertension, fatty liver, diabetic retinopathy, diabetic peripheral neuropathy, hyperlipidemia, hyperuricemia, coronary heart disease, nephrotic syndrome, haptoglobin, C-reactive protein, cuprammin, serum transferrin, serum albumin, hemoglobin, red blood cell count, urine red blood cell count, hyperglycosylated hemoglobin, plasma fibrinogen, a 1-acid glycoprotein, blood urea nitrogen, urinary creatinine, glomerular filtration rate, urine protein quantification, blood phosphorus, blood potassium, blood calcium, blood sodium, glutamic-pyruvic transaminase, gluten, total protein, total cholesterol, and triglyceride.

Preferably, the invention also comprises a system for realizing the diabetic nephropathy prediction method, which comprises a feature module, a training set module, a model construction module and a prediction module; the characteristic module is used for carding the characteristics of the acquired data; the training set module is used for establishing a training set according to the characteristics, and the labels of the training set comprise diabetic nephropathy and non-diabetic nephropathy; the model construction module is used for training through a training set based on a machine learning method to obtain a prediction model; the prediction module is used for predicting data to be predicted through the prediction model to obtain a tag value.

The invention also provides a prediction device, which comprises a memory and a processor;

the memory is used for storing instructions and a prediction model obtained by the diabetic nephropathy prediction method, and the instructions are used for predicting data to be predicted through the prediction model to obtain a label value;

the processor is to execute the instructions.

Compared with the prior art, the invention has the beneficial effects that: the machine learning method does not need assumptions about input variables and the output relation thereof, relatively gets rid of the limitation of the traditional statistical analysis, can simultaneously converge a large amount of characteristic data for analysis, is favorable for improving the accuracy of medical prediction, and has extremely important significance for early discovery and timely intervention and treatment of diseases; machine learning methods are able to analyze different types of data and incorporate them into the prediction of disease risk, prediction, prognosis and appropriate treatment.

Drawings

FIG. 1 is a flow chart of a machine learning based diabetic nephropathy prediction method of the present invention;

FIG. 2 is a logical block diagram of the system of the present invention;

FIG. 3 is a schematic illustration of a learning curve;

FIG. 4 is a schematic diagram of ROC curves for a simplified predictive model;

FIG. 5 is a SHAP overview;

FIG. 6 is a graphical representation of the SHAP values of a full sample;

FIG. 7 is a diagram of single sample SHAP values;

fig. 8 is a single-feature SHAP dependency graph.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention is described in further detail below with reference to the attached drawing figures:

a method for predicting diabetic nephropathy based on machine learning, as shown in fig. 1, the method comprising:

step 101: data is collected and features of the data are teased out. The data can also be preprocessed, and the preprocessing comprises missing value processing, abnormal value processing and the like. Specifically, patient data diagnosed as DN or NDRD is collected.

Step 102: and establishing a training set according to the characteristics, wherein the labels of the training set samples comprise diabetic nephropathy and non-diabetic nephropathy.

Step 103: and training through a training set based on a machine learning method to obtain a prediction model. The machine learning method comprises any one of the following algorithms: catboost, random forest, XGboost, neural network, support vector machine, decision tree, and logistic regression.

Step 104: and predicting the data to be predicted through the prediction model to obtain a tag value.

The machine learning method does not need assumptions about input variables and the output relation thereof, relatively gets rid of the limitation of the traditional statistical analysis, can simultaneously converge a large amount of characteristic data for analysis, is favorable for improving the accuracy of medical prediction, and has extremely important significance for early discovery and timely intervention and treatment of diseases; machine learning methods are capable of analyzing different types of data, such as demographic data, laboratory test data, imaging data, and the like, and incorporating them into the prediction of disease risk, prognosis, and appropriate treatment.

The method for predicting diabetic nephropathy of the present invention may further comprise a method for constructing a simplified prediction model:

step 201: the features of the predictive model are ranked for importance. The important ordering of the features of the prediction model is the prior art, and is not described in the invention.

Step 202: and sorting and screening characteristics according to the importance. In a root embodiment, the features screened include: any one or combination of the following features: DR, diabetic course, systolic blood pressure, blood urea nitrogen, fully glycosylated hemoglobin, blood creatinine, urine red blood cell count, and blood haptoglobin.

Step 203: based on the filtered features and the training set, a second training set is established.

Step 204: and training through a second training set based on a machine learning method to obtain a simplified prediction model.

The importance of the feature represents the weight of the feature in the model, and the higher the importance is, the greater the effect of the feature on the differential prediction of DN is shown. And the characteristics with higher importance are screened to construct a simplified prediction model, so that the calculation amount is reduced, and the movement efficiency is improved. It should be noted that the accuracy of the simplified predictive model should be controlled at a high level.

In step 202, the method for sorting the features according to importance includes:

step 211: in the importance ranking, the first n characteristics are taken to construct a learning model.

Step 212: the accuracy of the learning model is checked through a test set.

Step 213: and establishing a learning curve according to the plurality of learning models and the accuracy thereof.

Step 214: features are screened through learning curves.

Through the construction and learning of the model and the detection of the accuracy, a learning curve is established for a plurality of groups of detection data, and through the analysis of the learning curve, characteristics can be better screened, wherein the learning curve is shown in figure 3, the abscissa is the characteristic number, the ordinate is the accuracy, and the accuracy of the prediction model can be well controlled and simplified through the learning curve.

The method of the present invention may further comprise a method of predicting the probability of diabetic nephropathy by SHApley values:

step 301: SHAP values (SHApley) of features in the predictive model are obtained. The SHAP method is a distribution method that the obtained value is equal to the contribution. The method is widely used for problems of reasonable benefit distribution in economic activities and the like. Originally proposed by professor reuptapril (Lloyd sharley) at university of california, los angeles, usa. The proposal of the SHAP value method brings great influence to the important breakthrough of the cooperative game theory and the later development thereof. The SHAP value is used in the invention to describe the contribution and influence of the feature. The SHAP value calculation method is the prior art and is not described in detail in the invention.

Step 302: and predicting the probability of the diabetic nephropathy according to the SHAP value of the characteristics in the data to be predicted.

As shown in fig. 5, negative values on the left side of the abscissa indicate that the probability of being predicted as DN is low, positive values on the right side indicate that the probability of being predicted as DN is high, the ordinate indicates that each feature or variable is present, dark colors indicate that the variable is present in a high value, light colors indicate that the variable is present in a low value, and the descending order from top to bottom is the contribution of each feature. Thus, the probability of diabetic nephropathy can be predicted by a single feature, or by summing the contributions of multiple features.

It is noted that the predictive results of the present invention are different from the diagnostic results, and are used to improve the reference for clinical and testing strategies. For example, when the probability of predicting diabetic nephropathy is greater than a threshold, the patient may be advised to take a renal biopsy for medical diagnosis.

In step 101, the method for combing the data features may include: after preprocessing and cleaning the collected data, screening characteristics by adopting any one of the following algorithms: recursive feature elimination, analysis of variance, and principal component analysis. The feature screening is a very important link in the data processing process, after data preprocessing and cleaning, the significant features are screened, the non-significant features are abandoned, and the useless calculation amount in the model establishing process is reduced. The characteristic screening is focused on finding out the characteristics which greatly improve the performance of the model, so that the overall performance and the stability of the model are improved qualitatively. The data preprocessing and cleaning are common technologies, and are not described in detail in the present invention.

The invention also provides a system for realizing the diabetic nephropathy prediction method, as shown in fig. 2, a feature module 1, a training set module 2, a model construction module 3 and a prediction module 4; the characteristic module 1 is used for carding the characteristics of the acquired data; the training set module 2 is used for establishing a training set according to the characteristics, and the labels of the training set comprise diabetic nephropathy and non-diabetic nephropathy; the model construction module 3 is used for training through a training set based on a machine learning method to obtain a prediction model; the prediction module 4 is configured to predict data to be predicted through the prediction model to obtain a tag value.

The system also comprises a reconstruction module 5 and a SHAP analysis module 6, wherein the reconstruction module 5 is used for carrying out importance ranking on the characteristics of the prediction model; sorting and screening features according to the importance; establishing a second training set based on the screened features and the training set; and training through a second training set based on a machine learning method to obtain a simplified prediction model. The SHAP analysis module 6 is used for obtaining SHAP values of the features in the prediction model; and predicting the probability of the diabetic nephropathy according to the SHAP value of the characteristics in the data to be predicted.

the storage module is used for storing instructions and the prediction model or the simplified prediction model obtained by the diabetic nephropathy prediction method, and the instructions are used for predicting data to be predicted through the prediction model or the simplified prediction model to obtain a label value; the calculator is used for executing the instruction. But is not limited thereto, and may further include an inputter for obtaining data to be predicted and an outputter for outputting a tag value. Specifically, the prediction device may be a computer.

Examples

1. Data acquisition:

the data of the invention come from a large data research platform of a certain hospital, and are retrieved from data of patients with type II diabetes mellitus who undergo renal biopsy during 10/1/2019/12/30, detection structures comprise DN and NDRD, and finally 652 samples, 49 characteristics and complex data structures (continuous characteristics, discrete characteristics and missing values exist) are included in the research, wherein 38 continuous characteristics exist and 11 discrete characteristics exist. Part of the characteristics include: acute phase protein, diabetic course, body mass index, diabetic course, blood pressure, smoking, drinking, history of hypertension, fatty liver, diabetic retinopathy, diabetic peripheral neuropathy, hyperlipidemia, hyperuricemia, coronary heart disease, nephrotic syndrome, haptoglobin, C-reactive protein, cuprammin, serum transferrin, serum albumin, hemoglobin, hologlycosylated hemoglobin, plasma fibrinogen, blood urea nitrogen, urinary creatinine, glomerular filtration rate, urinary protein quantification, blood phosphorus, blood potassium, blood calcium, blood sodium, glutamic-pyruvic transaminase, glutamic-oxaloacetic transaminase, total protein, total cholesterol, and triglyceride. But is not limited thereto. The data for some of the features are shown in the following table:

2. data preprocessing:

the characteristics and samples with missing proportion more than 10% are deleted. But according to specific analysis of characteristics, the reasons of deletion are counted and the clinical significance is properly preserved, such as fully glycosylated hemoglobin. And sampling a random forest filling missing value. Outliers, i.e., values that are not reasonable in the dataset features. For example, BMI of-20, age of 300, etc., all fall within the range of abnormal values, and deletion is selected when the data is abnormal. The method of identifying outliers is as follows: simple identification: for example, the value of BMI should be a positive number, and if a negative number of BMI occurs, it can be determined that such a value is an abnormal value; 3sigma principle: if the continuous variable obeys normal distribution, the probability that the data are within 3sigma is high, and if the sample distance mean value exceeds the value, the probability that the sample value belongs to an abnormal value is high.

3. Establishing a data set and screening characteristics:

the modeling phase, firstly, the data set is as follows: a ratio of 1 is randomly grouped, with 75% assigned to the training set and the remaining 25% assigned to the test set. Variables used for analysis included demographic characteristics, laboratory tests, and variables such as diabetic complications, e.g., 44 variables such as diabetic course, DR, 24 hour urine protein quantification, blood creatinine, blood HP, etc. The DN and NDRD data serve as labels.

Feature screening is a very important link in the data processing process. After data preprocessing and cleaning, the salient features are screened, and the non-salient features are abandoned, so that the useless calculation amount in the model building process is reduced by deleting redundant features. The feature screening focuses on finding out few features which greatly improve the performance of the model, and the good feature screening can improve the overall performance and stability of the model. Sometimes enabling simple models to work better than complex models. Common feature screening methods include Recursive Feature Elimination (RFE), analysis of variance (ANOVA), principal component analysis (pca), and so on, and recursive feature elimination is used in this study.

4. Constructing a prediction model:

the prediction model is constructed by respectively adopting algorithms of Catboost, random forest, XGboost, neural network, support vector machine, decision tree and logistic regression.

The Logistic regression algorithm is a classification method commonly used by a data mining tool and machine learning, is suitable for classifying variables, and the explained variables can be any types of data, are usually used for variable explanation of a multiple regression model, but are easily influenced by a co-linearity problem among the explained variables. A random forest classification algorithm, which was proposed by Leo Beriman as early as 2001, applies a set of decision trees to sample training data and segments branches of each tree through a self-help resampling technique, belongs to the extension of a conventional decision tree classifier using an integration technique, and is essentially an upgrade of a decision tree algorithm. A Support Vector Machine (SVM) is a machine learning method which is proposed for the first time by Vapnik in 1995 [129] and can deal with machine learning problems such as non-linear classification, regression and the like. The artificial neural network is inspired by biological networks and is based on a mathematical processing method which is formed by loosely simulating nodes and connections of the internal work of the biological brain. The artificial neural network is a network formed by interconnected neurons, and a nonlinear large-scale self-adaptive dynamic system is formed. The principle and the function of the method are similar to those of the human brain, whether the variables conform to normal distribution or are independent is not considered, and the method is more suitable for the problems of unclear and complex processing rules. XGboost (eXtreeGradient Boosting, XGboost) is a machine learning function library which is concentrated in a gradient lifting algorithm and born in 2014 2 months, and an integrated algorithm based on boost is optimized in multiple aspects on the basis of a traditional GBDT (generalized binary tree transformation) algorithm, comprises the setting of regularization, the segmentation mode of segmentation points, the self-definition of loss functions and the like, and can solve more and more complex mathematical problems. The Catboost (Catboost) is a library of gradient boosting algorithms that can handle class-type features well. The learning algorithm in the library is realized based on a GPU, and the scoring algorithm is realized based on a CPU. Decision Trees (DTs) are hierarchical models that recursively partition data sets based on a kini index or entropy, and belong to a relatively simple classification algorithm because a graph formed by decision branches is similar to a trunk, each sample attribute is represented by a node of the tree, the judgment of the sample attribute is represented as a different branch, and the node of the last leaf corresponds to the final class.

5. Predictive model evaluation

The invention adopts AUC as model evaluation index. AUC is a evaluation index for measuring the quality of the two-classification model and represents the probability that a predicted positive case is ranked before a predicted negative case.

The confusion matrix is commonly used for measuring the prediction effect of the classification model, the accuracy, AUC, recall rate, F1 score, accuracy rate and the like in the confusion matrix are common model evaluation indexes in machine learning, and the structure of the confusion matrix is as follows:

the Accuracy (Accuracy) means the proportion of all correctly classified samples to the number of all samples, and is the proportion of correct samples. The Precision (Precision) means the ratio of correctly classified positive samples to the number of predicted positive samples, and is suitable for the case where the correctness of the prediction of the positive samples is important. Recall (Recall) is calculated as the proportion of correctly classified positive samples to the number of all (true) positive samples. The F1 score (F1-score) is calculated as a harmonic mean value of the precision rate and the recall rate, and when the using significance of the precision rate and the recall rate cannot be distinguished, the F1 score can be used for replacing the harmonic mean value, and the value of the F1 score is close to the smaller value of the precision rate and the recall rate. The Area under the characteristic curve (AUC) of a patient is the calculated Area under the ROC curve, and is a performance index for measuring the performance of the learner. The ROC curve is used for describing the relation between the true rate and the false positive rate of the classification model, and the prediction effect of the model with the AUC value range of 0.5 to 1 is better.

The evaluation comparison results of the seven machine learning models are shown in the following table:

	AUC	accuracy of	Recall rate	F1 value	Rate of accuracy
						catboost	0.957	0.894	0.877	0.883	0.889
Random forest	0.944	0.875	0.849	0.861	0.873
						XGboost	0.938	0.8625	0.863	0.851	0.840
Logistic regression	0.917	0.8625	0.877	0.853	0.831
						Neural network	0.911	0.850	0.767	0.824	0.889
Support vector machine	0.828	0.769	0.671	0.726	0.790
						Decision tree	0.803	0.788	0.753	0.764	0.775

The maximum AUC 0.957 is shown by the catboost, the accuracy of the catboost reaches 0.894, the recall rate is 0.877, the F1 score is 0.883, and the accuracy rate is 0.889, which is obviously higher than that of random forests, XGboost, logistic regression, artificial neural networks, support vector machines and decision tree machine learning methods.

In order to verify the stability and the prediction effect of the model and the stability of data in a training set, five-fold cross verification is carried out on the training set data, and the optimal model is confirmed according to model evaluation indexes. The five-fold cross validation results are shown in the following table:

6. screening characteristics according to characteristic importance

44 variables are ranked by the importance of features (featureimportance), the importance of the features represents the weight of the features in the overall model, and the more advanced the importance indicates that the features play a greater role in the DN prediction. Wherein, the top 10 variables of the characteristic importance ranking are DR, diabetes course, systolic blood pressure, blood urea nitrogen, fully glycosylated hemoglobin, blood creatinine, urine red blood cell count, red blood cell count and blood haptoglobin, respectively. The sum of the feature importance of the top 5 ranking is greater than 40, i.e., the weights in the DN and NDRD predictions exceed 40%. According to the method, the catboost is adopted to carry out curve learning on 44 variables according to the characteristic importance ranking so as to determine the number of the variables which need to be included in the construction of the optimal model, as shown in FIG. 3, through a learning curve, the model accuracy can exceed 0.86 when 6 variables are included in a newly developed model, the model accuracy is closer to 0.9 when 10 variables are included, and the model accuracy can exceed 0.9 when 37 variables are included.

7. Optimizing development of simplified predictive models

The first 10 variables of the important features were finally included: DR, diabetic course, systolic blood pressure, blood urea nitrogen, fully glycosylated hemoglobin, blood creatinine, urine red blood cell count, and haptoglobin, using caboost to develop a simplified prediction model with performance comparable to that when full features are used. Of the 652 patients, 75% were assigned to the training set and the remaining 25% to the test set. The following table shows the performance pairs of the prediction model based on the catboost and the simplified prediction model:

the area under the ROC curve (AUC) of the simplified predictive model is shown in fig. 4.

SHAP value interpretation model

The invention establishes a SHAP outline drawing and a characteristic drawing according to a SHAP method. As shown in fig. 5, the relationship between the height of the feature SHAP value and the shape value in the training data set is described, the left negative value of the abscissa indicates that the probability of the outcome of prediction (DN) is low, the right positive value indicates that the probability of the outcome of prediction (DN) is high, the left ordinate indicates each feature or variable, the dark color indicates that the value of the variable is high, and the light color indicates that the value of the variable is low, and the descending order from top to bottom is the contribution of each feature. The greater the shape value of the features, the greater the probability of DN occurring, and it is seen from the figure that DR contributes most to the model and that patients with DR are predicted to have a higher probability of being diabetic nephropathy; the next is the course of diabetes, with the longer the course of diabetes the probability that the patient is judged to be a positive sample, i.e., the probability of being predicted as DN, increasing. HP conversely, as the concentration of HP decreases, the probability of a sample being predicted as a positive sample increases, and the probability of DN increases. In the simplified predictive model, the top 5 characteristic importance ranks also DR, diabetes course (duration of diabetes), Systolic Blood Pressure (SBP), Blood Urea Nitrogen (BUN), and fully glycosylated hemoglobin (HbA1c), which still account for over 40% of the weight in the DN and NDRD predictions.

The present invention uses a full sample SHAP value to present the probability that all patients are DN and a single sample SHAP value to present the probability that a single patient is DN. As shown in fig. 6, dark represents a positive contribution of a feature to the predicted DN, light represents a negative contribution to the predicted DN, and vertical length represents the influence of the feature. For a single sample, as in fig. 7, the main contributing features of the dark color are: diabetes mellitus course of disease, blood urea nitrogen, systolic pressure, HP, etc., which increase the probability that a sample is predicted to be a positive sample, i.e., the probability of being predicted to be DN is high, the main contributing features of the light color are: hb. SBP, RBC, and BUN reduce the probability that a sample is predicted as DN. Overall, the dark color is much greater than the light color, and the probability that the individual is predicted to be a DN is high.

The present invention also employs a SHAP dependency graph to understand how individual features affect the output of the predictive model. A specific characteristic of a shield value exceeding 0 indicates an increased probability of predicting a DN, as shown in fig. 8, with patients with DR (panel a), a longer course of diabetes (panel b), a higher SBP (panel c), a higher BUN (panel d), a higher HbA1c value, a lower RBC, a lower Hb concentration, a higher Cr, a lower blood HP, a higher probability of predicting a DN. The SHAP value can be interpreted in a predictive model or a simplified predictive model.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A diabetic nephropathy prediction method based on machine learning, the method comprising:

collecting data and combing the characteristics of the data;

establishing a training set according to the characteristics, wherein sample labels of the training set comprise diabetic nephropathy and non-diabetic nephropathy;

training through a training set based on a machine learning method to obtain a prediction model;

and predicting the data to be predicted through the prediction model to obtain a tag value.

2. The method of predicting diabetic nephropathy according to claim 1, further comprising a method of constructing a simplified model:

ranking the importance of the features of the prediction model;

sorting and screening features according to the importance;

3. The method of predicting diabetic nephropathy according to claim 2, wherein the method of screening characteristics according to the importance ranking comprises:

detecting the accuracy of the learning model through a test set;

features are screened through learning curves.

4. The method of predicting diabetic nephropathy according to claim 2, wherein the features to be screened include any one or a combination of the following: diabetic retinopathy, diabetic course, systolic blood pressure, blood urea nitrogen, fully glycosylated hemoglobin, blood creatinine, urine red blood cell count, and blood haptoglobin.

5. The diabetic nephropathy prediction method of claim 1, wherein the machine learning method comprises any one of the following algorithms: the method comprises the following steps of Catboost, random forest, XGboost, neural network, support vector machine, decision tree and logistic regression.

6. The method of predicting diabetic nephropathy according to claim 1, further comprising a method of predicting probability of diabetic nephropathy by SHApley value:

obtaining SHApley values of features in the prediction model;

predicting the probability of diabetic nephropathy based on the SHApley value.

7. The method of predicting diabetic nephropathy according to claim 1, wherein the method of combing said data features comprises:

after preprocessing and cleaning the collected data, screening characteristics by adopting any one of the following algorithms: recursive feature elimination, analysis of variance, and principal component analysis.

8. The method of claim 1, wherein the characteristics of the data include any one or a combination of the following variables:

acute phase protein, diabetic course, body mass index, diabetic course, blood pressure, smoking, drinking, history of hypertension, fatty liver, diabetic retinopathy, diabetic peripheral neuropathy, hyperlipidemia, hyperuricemia, coronary heart disease, nephrotic syndrome, haptoglobin, C-reactive protein, cuprammin, serum transferrin, serum albumin, hemoglobin, hologlycosylated hemoglobin, plasma fibrinogen, blood urea nitrogen, urinary creatinine, glomerular filtration rate, urinary protein quantification, blood phosphorus, blood potassium, blood calcium, blood sodium, glutamic-pyruvic transaminase, glutamic-oxaloacetic transaminase, total protein, total cholesterol, and triglyceride.

9. A system for implementing the diabetic nephropathy prediction method of any one of claims 1 to 8, comprising a feature module, a training set module, a model building module and a prediction module;

the characteristic module is used for carding the characteristics of the acquired data;

the training set module is used for establishing a training set according to the characteristics, and the labels of the training set comprise diabetic nephropathy and non-diabetic nephropathy;

the model construction module is used for training through a training set based on a machine learning method to obtain a prediction model;

the prediction module is used for predicting data to be predicted through the prediction model to obtain a tag value.

10. A prediction apparatus comprising a memory and a processor;

the memory is configured to store instructions for predicting data to be predicted by the prediction model to obtain a tag value, and the prediction model obtained by the diabetic nephropathy prediction method according to any one of claims 1 to 8;

the processor is to execute the instructions.