CN115116615A - Method and system for analyzing and predicting risk of non-alcoholic fatty liver disease - Google Patents

Method and system for analyzing and predicting risk of non-alcoholic fatty liver disease Download PDF

Info

Publication number
CN115116615A
CN115116615A CN202210810842.8A CN202210810842A CN115116615A CN 115116615 A CN115116615 A CN 115116615A CN 202210810842 A CN202210810842 A CN 202210810842A CN 115116615 A CN115116615 A CN 115116615A
Authority
CN
China
Prior art keywords
data set
data
module
training
risk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210810842.8A
Other languages
Chinese (zh)
Inventor
周卫红
李康
陈亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Yahuan Software Co ltd
Original Assignee
Jiangsu Yahuan Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Yahuan Software Co ltd filed Critical Jiangsu Yahuan Software Co ltd
Priority to CN202210810842.8A priority Critical patent/CN115116615A/en
Publication of CN115116615A publication Critical patent/CN115116615A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Primary Health Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

The invention provides a method and a system for analyzing and predicting the risk of non-alcoholic fatty liver, which are used for calling characteristic data of a plurality of users from an information system of an electronic medical record to form a data set; normalizing the characteristic data, and dividing the data set into a training data set, a test data set and an original verification data set; performing feature selection on the data set by using the importance of the kini in the MRMR module and the ETC module; and training and predicting the feature data after feature selection by using a preset machine learning algorithm, and judging whether the user corresponding to the feature data has NAFLD. By using the patient data set and by using a more comprehensive feature selection process to identify statistical data, physical and ancestral features of the user, analytical prediction of the risk of the non-alcoholic fatty liver is achieved, erroneous data that may affect the accuracy and robustness of the machine learning model are removed, and the accuracy of the prediction is improved.

Description

Method and system for analyzing and predicting risk of non-alcoholic fatty liver disease
Technical Field
The invention relates to the technical field of fatty liver prediction, in particular to an analysis and prediction method and system for non-alcoholic fatty liver risk.
Background
Nonalcoholic fatty liver disease (NAFLD) is becoming a global disease burden. NAFLD is characterized by adipocytes that exceed 5% of the weight of the liver without extensive alcohol consumption. NAFLD is a generic term that includes simple steatosis, which is an inflammation of the liver, and nonalcoholic steatohepatitis (NASH). NAFLD typically undergoes simple steatosis, NASH, fibrosis, cirrhosis and finally a stage of hepatocellular carcinoma. Patients with simple steatosis are at lower risk of progression and therefore early diagnosis of NAFLD is beneficial in preventing disease progression or even reversing disease.
Early screening of NAFLD has been challenging due to patient deficiency. Invasive surgery is hampered by its cost, sampling error, and surgery-related complications. Alternatively, less invasive procedures have been performed, such as blood biomarkers (serum transaminase levels and specific cytokines as markers of apoptosis) and imaging tests (ultrasound, computed tomography and magnetic resonance imaging), but these methods are not reliable compared to liver biopsies. Therefore, recent attention has focused on the development of machine learning models for screening NAFLD using surrogate biomarkers — Fatty Liver Index (FLI), hepatic steatosis index (HIS), and NAFLD liver fat score (NAFLD-LFS), as well as the university of zhejiang index (ZJU), with promising results. However, these models are limited by their traditional statistical techniques, which are limited in their ability to effectively interpret nonlinear relationships and variable-variable interactions, and they rely on assumptions that may be incorrect in the human biological system.
Disclosure of Invention
The invention provides an analysis and prediction method for the risk of the non-alcoholic fatty liver, which avoids the limitation of a machine learning algorithm model by the traditional statistical technology and improves the accuracy of analysis and prediction.
The method for analyzing and predicting the risk of the nonalcoholic fatty liver disease comprises the following steps:
step one, calling characteristic data of a plurality of users from an information system of an electronic medical record to form a data set;
step two, carrying out normalization processing on the characteristic data, and dividing the data set into a training data set, a test data set and an original verification data set;
thirdly, selecting characteristics of the data set by using the importance of the kini in the MRMR module and the ETC module;
and step four, training and predicting feature data after feature selection by utilizing a preset machine learning algorithm, and judging whether the user corresponding to the feature data has NAFLD.
It should be further noted that the first step further includes: setting a screening condition of the user characteristic data, wherein the screening condition is as follows:
blood pressure is limited to between 20 and 300 mm hg;
height limits of 100 to 210 centimeters;
the weight limit is 20 to 200 kg.
It should be further noted that, in step two, the z-score method is used to normalize all the feature data:
Figure BDA0003740746680000021
where x is the sample, μ is the overall mean, and σ is the overall standard deviation.
Further, in the second step, 80% of the feature data in the data set is divided into a training data set for training the model;
dividing 10% of feature data in the data set into test data sets for internal model during training
The test divides 10% of the feature data in the data set into an original verification data set for completing the trained model external verification.
It should be further noted that, in step three, two data sets are configured;
preprocessing one data set by using an MRMR module, and recording the first 20 characteristics in the data set;
preprocessing one data set by using an ETC module, and recording the first 20 characteristics in the data set;
wherein the top level features in the MRMR module are ranked according to the MIQ score; ranking the top-level features in the ETC module according to the importance of the kini;
putting the preprocessed features of the MRMR module and the ETC module together with all feature data in the data set;
the top 20 features of the rank are listed in the appendix.
It should be further noted that step three further includes: the top 5, 10 and 15 features obtained from the data set selection were trained;
to see the general trend of accuracy versus feature number, model training was performed on all features between 1 and 20.
It should be further noted that, in step four, a machine learning training process is performed on the training data set, and verification is performed on the test data set;
in the training process, when a model is obtained, the training and testing data sets are recombined and are split into new training and testing data sets;
the splitting mode is that the training data set accounts for 80% of the size of the whole data set, and the testing data set accounts for 10% of the whole data set;
recombination was performed 50 times to obtain 50 different models;
the accuracy of 50 models was tested against the validation dataset, and the accuracy index was evaluated and recorded.
It should be further noted that, step four further includes: the manner in which the accuracy score is calculated,
the accuracy score mode is as follows:
Figure BDA0003740746680000031
wherein TP is true positive, TN is true negative, FP is false positive and FN is false negative;
the area under the curve AUC score for the receiver operating characteristic ROC curve is: measuring the diagnostic capability of the binary classifier system by plotting the relationship between the true positive rate TPR and the false positive rate FPR under various threshold settings;
ROC is a probability curve, AUC represents the degree of separability;
the F1 score is: the harmonic mean of Precision and Recall measures the classification capability of the model under the positive and negative conditions; the calculation of F1 is:
Figure BDA0003740746680000032
then all the results are put together for comparison, so that a machine learning algorithm with the performance meeting the preset requirement is determined according to the accuracy index;
and (4) performing characteristic importance analysis by using the SHAP value of the machine learning algorithm model meeting the preset requirement.
It is further noted that the preset machine learning algorithm includes: a Gaussian naive Bayes GNB algorithm, a logistic regression LR algorithm, a random forest RF algorithm, an extreme gradient boosting decision tree XGB algorithm, a support vector machine SVM algorithm, a multi-layer perceptron MLP algorithm and a LASSO integration algorithm.
The invention also provides an analysis and prediction system for the risk of the non-alcoholic fatty liver, which comprises the following components: the device comprises a feature acquisition module, a data preprocessing module, a feature selection module and a machine learning prediction module;
the characteristic acquisition module is used for calling characteristic data of a plurality of users from an information system of the electronic medical record to form a data set;
the data preprocessing module is used for carrying out normalization processing on the characteristic data and dividing the data set into a training data set, a test data set and an original verification data set;
the characteristic selection module is used for selecting the characteristics of the data set by using the importance of the kini in the MRMR module and the ETC module;
the machine learning prediction module is used for training and predicting feature data after feature selection by utilizing a preset machine learning algorithm, and judging whether a user corresponding to the feature data has NAFLD or not.
According to the technical scheme, the invention has the following advantages:
the method for analyzing and predicting the risk of the nonalcoholic fatty liver disease uses various machine learning technologies to predict the risk of NAFLD and research the performance of NAFLD. A predictive model will be constructed using demographic data, physical data and blood biomarker synthesis panels from a health check center (such as population N-81,552) to predict the diagnosis of NAFLD and quantify the accuracy of the model modeled on an independent portion of the data set. The present invention will use a feature selection technique such as maximum correlation maximum redundancy (MRMR) and kini importance to select a set of top-level features. Several models will then be created and optimized, including logistic regression, random forests, multi-layer perceptrons, and XGBoost. Finally, whether integrating the LASSO of these models would yield an accuracy that exceeds the best performing model.
The goal of the present invention is to identify the most important demographic, physical and ancestral features by using a broader set of patient data and by using a more comprehensive feature selection process, as compared to existing studies. Thereby enabling the analytical prediction of risk of non-alcoholic fatty liver disease, avoiding the machine learning algorithm model from being limited by its traditional statistical techniques, and also solving the problem of being limited in the ability to effectively interpret nonlinear relationships and variable-variable interactions, and they rely on assumptions that may be incorrect in the human biological system.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings used in the description will be briefly introduced, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a flow chart of a method for analyzing and predicting risk of nonalcoholic fatty liver disease;
FIG. 2 is a graph of a verification accuracy measurement of a baseline Logistic regression model;
fig. 3 is a schematic diagram of an analysis and prediction system for the risk of nonalcoholic fatty liver.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The method for analyzing and predicting the risk of the nonalcoholic fatty liver is to create a model and predict the risk of NAFLD by using various machine learning technologies.
Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
The method for analyzing and predicting the risk of the non-alcoholic fatty liver disease provided by the embodiment of the weight can be applied to computer equipment, and the computer equipment can be a terminal or a server.
The terminal can be but not limited to a mobile phone, a computer, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal, an internet of things device and a portable wearable device. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and a big data and artificial intelligence platform.
In the related art, for example, the risk of nonalcoholic fatty liver disease can be analyzed and predicted by a machine learning method, a deep learning method, or the like. Fig. 1 schematically shows a flow chart of an analytical prediction method for risk of non-alcoholic fatty liver disease according to an embodiment of the present disclosure.
The method comprises the following steps:
s101, calling characteristic data of a plurality of users from an information system of the electronic medical record to form a data set;
s102, carrying out normalization processing on the characteristic data, and dividing the data set into a training data set, a test data set and an original verification data set;
s103, carrying out feature selection on the data set by using the importance of the Gini in the MRMR module and the ETC module;
and S104, training and predicting feature data after feature selection by utilizing a preset machine learning algorithm, and judging whether a user corresponding to the feature data has NAFLD.
It will be appreciated that the present invention creates a model to predict the risk of NAFLD and study its performance using various machine learning techniques. A predictive model will be built using demographic data, physical data and blood biomarker synthesis panels from a health check center (such as population N. 81,552) to predict the diagnosis of NAFLD and quantify the accuracy of the model modeled on independent portions of the data set. The present invention will use a feature selection technique such as maximum correlation maximum redundancy (MRMR) and kini importance to select a set of top-level features. Several models will then be created and optimized, including logistic regression, random forests, multi-layer perceptrons, and XGBoost. Finally, whether integrating the LASSO of these models would yield an accuracy that exceeds the best performing model. The goal of the present invention is to identify the most important demographic, physical and ancestral features by using a broader set of patient data and by using a more comprehensive feature selection process, as compared to existing studies. Thereby enabling the analytical prediction of risk of non-alcoholic fatty liver disease, avoiding the machine learning algorithm model from being limited by its traditional statistical techniques, and also solving the problem of being limited in the ability to effectively interpret nonlinear relationships and variable-variable interactions, and they rely on assumptions that may be incorrect in the human biological system.
In one embodiment of the present invention, a possible embodiment is given below to illustrate, without limitation, a specific embodiment thereof.
S201, calling feature data of a plurality of users from an information system of the electronic medical record to form a data set;
in particular, the data set may be an information system from an Electronic Medical Record (EMR) for monitoring chronic diseases. There are 33 features in the dataset, including various demographic, physical and blood measurements, and the target is a column indicating the status of liver disease (0 for healthy, others for liver disease).
The raw data set contains 120,376 samples or patient entries. However, to remove erroneous data that may affect the accuracy and robustness of the machine learning model, the data set is filtered according to several conditions:
1. blood pressure is limited to between 20 and 300 mm hg.
2. The height limits are 100 and 210 centimeters.
3. The weight is limited to 20 and 200 kg.
4. Liver disease is limited to 0 (healthy) and 1 (NAFLD).
This reduced the number of samples to 81,552, with 42,049 being males and 39,503 being females. There were 62,553 healthy samples (0) and 18,999 NAFLD (1) samples. The mean age of the participants was 40 years with a standard deviation of 14.7.
S202, carrying out normalization processing on the characteristic data, and dividing the data set into a training data set, a test data set and an original verification data set;
all classification features, such as gender, are converted to numerical values (e.g., 0 for female and 1 for male).
Then, all scalar features (e.g. age) are normalized using the z-score method:
Figure BDA0003740746680000071
where x is the sample, μ is the overall mean, and σ is the overall standard deviation.
Finally, the features and labels are separated and the data set is divided into training, testing and raw validation data sets.
In the present invention, 80% of the data set (65,242 samples) was used for training the model, 10% (8,155 samples) was used for internal model testing during training, and the remaining 10% (8,155 samples) was retained and used for external validation of the model after completion of training. The training set and the internal test set are reshuffled and repartitioned throughout the training process to ensure model consistency. The original external validation set was never used for training or testing. The performance of each model was then evaluated against the original external validation set. This ensures that the discovery of the present invention can be generalized to new data sets.
S203, selecting characteristics of the data set by using the importance of the kini in the MRMR module and the ETC module;
in order to find the most important features that can efficiently and accurately determine whether a patient has NAFLD present, a feature selection is performed on the patient data set. Two feature selection techniques were used: (1) minimum redundant maximum correlation (MRMR) and (2) kini importance in the additional tree classifier (ETC).
The MRMR module is an unsupervised feature selection technique, and MRMR considers only the relative importance between features, and not the tags. The criteria used by MRMR is the Mutual Information Quotient (MIQ) score. MRMR selects the most relevant features based on the pairwise correlations or mutual information for each pair of variables in the data set, while minimizing redundancy between variables.
ETC modules are a supervised feature selection technique, which means that both features and labels are taken into account. It fits several random decision trees (also called extra trees) on the subsamples of the dataset and averages the results. The feature importance is obtained by calculating the normalized total reduction of the criteria brought about by the feature, which is called the kini importance. Gini Import, also known as Mean increment in Impurity (MDI), computes the Importance of each feature as the sum of the number of splits of all decision trees containing the feature, proportional to the number of samples it splits.
First, both the MRMR module and the ETC module are applied to both raw and processed data sets. Two data sets are used so that feature differences from the pre-processing can be normalized. The first 20 signatures in the two data sets of MRMR and ETC selections were then recorded. The top-level features in the MRMR are ranked by MIQ score, while for ETC, the top-level features are ranked according to their Keyney importance. The ranking of the functions is also recorded for reference.
Then, all important features in both methods and datasets are put together. The present invention ranks the top 20 functions used according to their number of occurrences in order to select them. The more times a feature occurs, the more important it is. The maximum occurrence number is 3 because the result is difficult to obtain by MRMR unprocessed data; the minimum number of occurrences considered by the present invention is 2. Since many features occur twice, the features occurring in the MRMR module and the ETC module are prioritized, and the remaining features are ranked according to feature importance.
The top 20 features were selected and ranked according to these criteria and are listed in appendix N. After the following machine learning step, the top-level features are found by taking the SHAP (Shapley additive ExPlation) value from the best performance model. This step is to again identify the relative importance of the items and to check which values of certain items contribute most to the overall prediction. The SHAP value is the Shapley value of the conditional expectation function of the model. The sharley value evolved from the cooperative game theory in which each cooperative game was assigned a unique distribution (among participants) of the total remainder generated by the union of all participants. This helps determine the relative importance of each player from the total.
Illustratively, the present invention uses a total of 35 different features to identify the most important features for predicting NAFLD status when performing feature selection. The feature selection process is done on the processed data set using minimum redundancy maximum correlation (MRMR) and the kini importance of the additional tree classifier. The first 20 features selected by both methods on each dataset can be found in table one.
Watch 1
Feature rating Function name Feature importance score
1 Height 0.290
2 Critical point of fasting blood glucose 5 5.865
3 Hemoglobin 5.805
4 Eosinophil count 3.359
5 High density lipoprotein cholesterol 4.127
6 Weight (D) 3.045
7 Total bilirubin 2.761
8 Erythrocyte count 2.600
9 Systolic pressure 2.182
10 Leucine aminopeptidase 2.211
11 Mean hemoglobin concentration 2.184
12 Alkaline phosphatase 1.999
13 Alanine aminotransferase 1.994
14 Platelet count 1.687
15 Cholinesterase 1.404
16 Body mass index 1.374
17 Diastolic blood pressure 1.259
18 Globulin protein 1.141
19 Triglycerides 1.157
20 Albumin 1.072
The most important features are selected according to the number of times they appear in the feature importance ranking list. The most important features appearing in both lists are: { weight, diastolic blood pressure, triglyceride, platelet count, systolic blood pressure, height, High Density Lipoprotein (HDL) cholesterol, hemoglobin, BMI, erythrocytes, alanine aminotransferase, fasting plasma glucose }.
Less relevant features that appear in only one of the lists are: { leukocyte count, age, neutrophil count, lymphocyte count, aspartate aminotransferase, mean hemoglobin concentration, sex }.
After merging the two lists, the first 20 features are in turn: { body weight, diastolic blood pressure, triglyceride, platelet count, systolic blood pressure, height, High Density Lipoprotein (HDL) cholesterol, hemoglobin, BMI, erythrocytes, alanine aminotransferase, fasting plasma glucose, leukocyte count, age, neutrophil count, lymphocyte count, aspartate aminotransferase, mean hemoglobin concentration, sex, eosinophil count }.
And S204, training and predicting feature data after feature selection by utilizing a preset machine learning algorithm, and judging whether the user corresponding to the feature data has NAFLD.
The following machine learning algorithm is used to train a model that predicts whether a person has NAFLD:
gaussian Naive Bayes (GNB): bayes theorem is applied under the naive assumption of independence between features. Gaussian na iotave bayes the likelihood of a feature being gaussian.
Logistic Regression (LR): statistical models that model categorical variables using logistic functions are typically binary dependent variables.
Random Forest (RF): an ensemble learning method for classification, regression, and other tasks by constructing a large number of decision trees during training and outputting individual trees of classes as patterns (classification) or average predictions (regression) of the classes.
Extreme gradient boosting decision tree (XGB): a variation of the gradient boost technique is directed to improving system performance. The decision tree is combined with random gradient boosting and regularized gradient boosting.
Support Vector Machine (SVM): a clustering technique applies support vector statistics to classify unlabeled data by determining a set of hyperplanes separating different classifications.
Multilayer perceptron (MLP): a feedforward Artificial Neural Network (ANN) is composed of a plurality of layers of nodes with bias and activation thresholds and edges with weights.
LASSO integration (stacked generalization integration combining all the above methods together using a LASSO regression classifier): integration aims at combining predictions from multiple basic estimators to improve the generalization and robustness of a single estimator.
For each algorithm, the model was trained on the top 5, 10 and 15 features obtained from the feature selection. Then, in order to see the overall trend of accuracy versus feature number, model training was performed on all features between 1 and 20. Starting with the 20 most important features, the least important features are deleted each time, and then the model is run.
For all model training, the data set was divided into training, testing and raw validation data sets, accounting for 80%, 10% and 10% of the entire data set, respectively. Machine learning training is done entirely on the training data set and verified on the test data set. For a fair comparison, the training, testing, and raw validation data sets for all different algorithms remain the same.
In training, once a model is obtained, the training and testing data sets are recombined and split into new training and testing data sets. They were split in the same manner as before, training accounted for 80% of the entire dataset size, and testing accounted for 10% of the entire dataset. Recombination was performed 50 times to obtain 50 different models. The 50 different reorganizations of the data set also remain the same in the different algorithms.
These 50 models were then tested against the validation dataset for accuracy testing. The following accuracy indices were evaluated and recorded.
Accuracy score:
Figure BDA0003740746680000121
among them, TP is true positive, TN is true negative, FP is false positive and FN is false negative.
Area under the curve (AUC) score for Receiver Operating Characteristics (ROC) curve: the diagnostic ability of the binary classifier system was measured by plotting the True Positive Rate (TPR) versus the False Positive Rate (FPR) at various threshold settings. ROC is a probability curve and AUC represents the degree of separability.
F1 score: precision and Recall, measure the classification ability of the model in both positive and negative cases.
Figure BDA0003740746680000131
All results are then put together for direct comparison to determine the best performing machine learning algorithm based on the accuracy index. After selecting the best performing algorithm, the present invention determines the best number of features to use. The optimal feature number is determined by examining the accuracy trend.
Finally, the present invention performs feature importance analysis using SHAP (Shapley additive ExPlanation) values from the best performance model, where higher values indicate more feature importance. The interpretation of the model is provided by examining its features.
In the machine learning algorithm training process related to the invention, 3 groups of models are trained by using top-level features selected from feature selection. The first set of models was on the first 5 features, the second set was on the first 10 features, and the last set was on the first 15 features. This is used to estimate how many features the machine learning model predicts with sufficient accuracy the presence of NAFLD. For these model sets, the default hyper-parameter settings in Scikit-leann are maintained for fair and direct comparisons between different machine learning techniques.
As can be seen from table 2, the accuracy gap between the top 10 and 15 features is less than the gap between the top 5 and 10 features for all models, indicating a diminishing return for increasing the number of features above 10. Among machine learning techniques, LASSO Ensemble and random forests perform best; however, this is due to the default stopping criteria of these models, which makes the model size very large relative to other models. As expected, the LASSO Ensemble model performs best.
Table 2: validation AUC ROC scores for various baseline machine learning techniques, spanning different numbers of the most important features.
Figure BDA0003740746680000141
To further use how many features to accurately predict NAFLD status, the present invention plots the accuracy measure of each baseline model over more and more of the most important features (from 1 feature to 20 features). Figure 2 shows an example of Logistic regression showing that accuracy starts to decline after 6 features. After 11 features, it stops dropping and maintains approximately the same accuracy. Other models show similar trends, so for the rest of the machine learning, the present invention will use only the top 10 most important features or less.
Since the default machine learning model does not maximize performance and there is no agreement between models, the present invention adjusts the hyper-parameters of each model to maximize its accuracy, regardless of model size. The results are shown in Table 3.
Table 3 verified AUC ROC scores for various tuned machine learning techniques for the 10 most important features
Figure BDA0003740746680000151
This confirms that the object of the present invention, namely LASSO integration technique, is the best overall technique, since it can combine the supplementary information of the individual models and improve the overall performance.
Table 4 shows the results of the SHAP (simple Additive expansion) analysis by showing the average SHAP value of each of the first 20 features of the LASSO integration model. From the results of the SHAP analysis, it appears that BMI is the most important factor in predicting non-fatty alcoholic liver disease, followed by alanine aminotransferase and triglycerides. Age is also a very important factor.
Table 4: average shap (simple Additive exposure) values and 95% confidence interval models for each of the first 20 features of LASSO integration.
Figure BDA0003740746680000152
Figure BDA0003740746680000161
The present invention uses 7 advanced machine learning techniques to predict the presence of NAFLD over the 10 most important features. Using MRMR and ETC, the first 10 most important features are: body weight, BMI, alanine aminotransferase, HDL cholesterol, fasting glucose, triglycerides, diastolic blood pressure, hemoglobin, systolic blood pressure, and red blood cell count. The results of the present invention indicate that the LASSO integration technique has the best performance in predicting NAFLD.
The present invention uses a LASSO regression classifier (LASSO Ensemble) to apply a stacked generalized set of various models to predict NAFLD states. The advantage of using an integrated model in the present invention is that predictions from numerous underlying estimators are combined to improve the generalization and robustness of the individual estimators and minimize bias between individual machine learning models. Accuracy measures superior to a single machine learning model are achieved.
Based on the above method for analyzing and predicting risk of non-alcoholic fatty liver, the present invention further provides a system for analyzing and predicting risk of non-alcoholic fatty liver, as shown in fig. 3, the system includes: the device comprises a feature acquisition module, a data preprocessing module, a feature selection module and a machine learning prediction module;
the characteristic acquisition module is used for calling characteristic data of a plurality of users from an information system of the electronic medical record to form a data set;
the data preprocessing module is used for carrying out normalization processing on the characteristic data and dividing the data set into a training data set, a test data set and an original verification data set;
the characteristic selection module is used for selecting the characteristics of the data set by using the importance of the kini in the MRMR module and the ETC module;
the machine learning prediction module is used for training and predicting feature data after feature selection by utilizing a preset machine learning algorithm, and judging whether a user corresponding to the feature data has NAFLD or not.
It is an object of the present invention to identify the most important demographic, physical and pedigree characteristics by using a broader set of patient data and by using a more comprehensive feature selection process. Thereby enabling analytical prediction of the risk of non-alcoholic fatty liver disease, avoiding the machine learning algorithm models from being limited by their traditional statistical techniques, and also solving the problem of being limited in their ability to effectively interpret nonlinear relationships and variable-variable interactions, and they rely on assumptions that may be incorrect in the human biological system.
The elements and algorithm steps of the various examples described in the embodiments disclosed in the method and system for analyzing and predicting risk of non-alcoholic fatty liver disease provided by the present invention can be implemented in electronic hardware, computer software, or a combination of both, and in the above description, the components and steps of the various examples have been generally described in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The block diagram shown in the attached drawing of the system for analyzing and predicting the risk of the nonalcoholic fatty liver is only a functional entity and does not necessarily correspond to a physically independent entity. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The present invention provides a method and system for analyzing and predicting risk of non-alcoholic fatty liver disease, which is implemented by electronic hardware, computer software, or a combination thereof, in combination with the units and algorithm steps of the examples described in the embodiments disclosed herein. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. An analytical prediction method for risk of nonalcoholic fatty liver disease, which is characterized by comprising the following steps:
step one, calling characteristic data of a plurality of users from an information system of an electronic medical record to form a data set;
step two, carrying out normalization processing on the characteristic data, and dividing the data set into a training data set, a test data set and an original verification data set;
thirdly, selecting characteristics of the data set by using the importance of the kini in the MRMR module and the ETC module;
and step four, training and predicting feature data after feature selection by utilizing a preset machine learning algorithm, and judging whether the user corresponding to the feature data has NAFLD.
2. The method of claim 1, wherein the risk of non-alcoholic fatty liver disease is analyzed and predicted,
the first step further comprises the following steps: setting a screening condition of the user characteristic data, wherein the screening condition is as follows:
blood pressure is limited to between 20 and 300 mm hg;
height limits of 100 to 210 centimeters;
the weight limit is 20 to 200 kg.
3. The method according to claim 1, wherein the risk of the non-alcoholic fatty liver disease is analyzed and predicted,
in step two, the z-score method is used to normalize all the characteristic data:
Figure FDA0003740746670000011
where x is the sample, μ is the overall mean, and σ is the overall standard deviation.
4. The method of claim 1, wherein the risk of non-alcoholic fatty liver disease is analyzed and predicted,
in the second step, 80% of the characteristic data in the data set is divided into a training data set for training the model;
dividing 10% of feature data in the data set into test data sets for internal model during training
The test divides 10% of the feature data in the data set into an original verification data set for completing the trained model external verification.
5. The method of claim 1, wherein the risk of non-alcoholic fatty liver disease is analyzed and predicted,
in the third step, two data sets are configured;
preprocessing one data set by using an MRMR module, and recording the first 20 characteristics in the data set;
preprocessing one data set by using an ETC module, and recording the first 20 characteristics in the data set;
wherein the top level features in the MRMR module are ranked according to the MIQ score; ranking the top-level features in the ETC module according to the importance of the kini;
putting the characteristics preprocessed by the MRMR module and the ETC module together with all characteristic data in the data set;
the top 20 features of the rank are listed in the appendix.
6. The method of claim 5, wherein the risk of non-alcoholic fatty liver disease is analyzed and predicted,
the third step also comprises: the top 5, 10 and 15 features obtained from the data set selection were trained;
to see the general trend of accuracy versus feature number, model training was performed on all features between 1 and 20.
7. The method of claim 6, wherein the risk of non-alcoholic fatty liver disease is analyzed and predicted,
in step four, a machine learning training process is performed on the training data set and verification is performed on the test data set;
in the training process, when a model is obtained, the training and testing data sets are recombined and are split into new training and testing data sets;
the splitting mode is that the training data set accounts for 80% of the size of the whole data set, and the testing data set accounts for 10% of the whole data set;
recombination was performed 50 times to obtain 50 different models;
the 50 models were tested for accuracy against a validation dataset, and accuracy indices were evaluated and recorded.
8. The method for analyzing and predicting risk of nonalcoholic fatty liver disease according to claim 7, further comprising, at the fourth step: the manner in which the accuracy score is calculated,
the accuracy score mode is as follows:
Figure FDA0003740746670000031
wherein TP is true positive, TN is true negative, FP is false positive and FN is false negative;
the area under the curve AUC score for the receiver operating characteristic ROC curve is: measuring the diagnostic capability of the binary classifier system by plotting the relationship between the true positive rate TPR and the false positive rate FPR under various threshold settings;
ROC is a probability curve, AUC represents the degree of separability;
the F1 score is: the harmonic mean of Precision and Recall measures the classification capability of the model under the positive and negative conditions; the calculation of F1 is:
Figure FDA0003740746670000032
then all the results are put together for comparison, so that a machine learning algorithm with the performance meeting the preset requirement is determined according to the accuracy index;
and (4) performing characteristic importance analysis by using the SHAP value of the machine learning algorithm model meeting the preset requirement.
9. The method of claim 1, wherein the risk of non-alcoholic fatty liver disease is analyzed and predicted,
the preset machine learning algorithm is utilized and comprises the following steps: a Gaussian naive Bayes GNB algorithm, a logistic regression LR algorithm, a random forest RF algorithm, an extreme gradient boosting decision tree XGB algorithm, a support vector machine SVM algorithm, a multi-layer perceptron MLP algorithm and a LASSO integration algorithm.
10. An analysis and prediction system for the risk of nonalcoholic fatty liver, which is characterized by adopting the analysis and prediction method for the risk of nonalcoholic fatty liver according to any one of claims 1 to 9;
the system comprises: the device comprises a feature acquisition module, a data preprocessing module, a feature selection module and a machine learning prediction module;
the characteristic acquisition module is used for calling characteristic data of a plurality of users from an information system of the electronic medical record to form a data set;
the data preprocessing module is used for carrying out normalization processing on the characteristic data and dividing the data set into a training data set, a test data set and an original verification data set;
the characteristic selection module is used for selecting the characteristics of the data set by using the importance of the kini in the MRMR module and the ETC module;
the machine learning prediction module is used for training and predicting feature data after feature selection by utilizing a preset machine learning algorithm, and judging whether a user corresponding to the feature data has NAFLD or not.
CN202210810842.8A 2022-07-11 2022-07-11 Method and system for analyzing and predicting risk of non-alcoholic fatty liver disease Pending CN115116615A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210810842.8A CN115116615A (en) 2022-07-11 2022-07-11 Method and system for analyzing and predicting risk of non-alcoholic fatty liver disease

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210810842.8A CN115116615A (en) 2022-07-11 2022-07-11 Method and system for analyzing and predicting risk of non-alcoholic fatty liver disease

Publications (1)

Publication Number Publication Date
CN115116615A true CN115116615A (en) 2022-09-27

Family

ID=83333038

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210810842.8A Pending CN115116615A (en) 2022-07-11 2022-07-11 Method and system for analyzing and predicting risk of non-alcoholic fatty liver disease

Country Status (1)

Country Link
CN (1) CN115116615A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117577214A (en) * 2023-05-19 2024-02-20 广东工业大学 Compound blood brain barrier permeability prediction method based on stack learning algorithm

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117577214A (en) * 2023-05-19 2024-02-20 广东工业大学 Compound blood brain barrier permeability prediction method based on stack learning algorithm
CN117577214B (en) * 2023-05-19 2024-04-12 广东工业大学 Compound blood brain barrier permeability prediction method based on stack learning algorithm

Similar Documents

Publication Publication Date Title
Senan et al. Diagnosis of chronic kidney disease using effective classification algorithms and recursive feature elimination techniques
Gjoreski et al. Machine learning and end-to-end deep learning for the detection of chronic heart failure from heart sounds
Nourmohammadi-Khiarak et al. New hybrid method for heart disease diagnosis utilizing optimization algorithm in feature selection
Alizadehsani et al. Coronary artery disease detection using computational intelligence methods
Kumar et al. Performance analysis of machine learning algorithms on diabetes dataset using big data analytics
LaFreniere et al. Using machine learning to predict hypertension from a clinical dataset
Cueto-López et al. A comparative study on feature selection for a risk prediction model for colorectal cancer
Shahid et al. A novel approach for coronary artery disease diagnosis using hybrid particle swarm optimization based emotional neural network
Assegie et al. An empirical study on machine learning algorithms for heart disease prediction
CN115116615A (en) Method and system for analyzing and predicting risk of non-alcoholic fatty liver disease
Teju et al. Detection of diabetes melittus, kidney disease with ML
CN113128654B (en) Improved random forest model for coronary heart disease pre-diagnosis and pre-diagnosis system thereof
Javid et al. Optimally organized GRU-deep learning model with Chi 2 feature selection for heart disease prediction
Rathi et al. Early Prediction of Diabetes Using Machine Learning Techniques
Firdous et al. A survey on diabetes risk prediction using machine learning approaches
AU2021102593A4 (en) A Method for Detection of a Disease
SOLMAZ et al. Mobile diagnosis of thyroid based on ensemble classifier
Sumathi et al. Machine learning based pattern detection technique for diabetes mellitus prediction
Mareeswari et al. Predicting Chronic Kidney Disease Using KNN Algorithm
Hakim Performance Evaluation of Machine Learning Techniques for Early Prediction of Brain Strokes
Iqbal et al. Prediction of breast cancer using machine learning techniques
Setyawati et al. Feature selection for the classification of clinical data of stroke patients
Lim et al. Machine learning classification of polycystic ovary syndrome based on radial pulse wave analysis
Murthy An efficient diabetes prediction system for better diagnosis
Nath et al. Diabetes prediction and validation model using ML classification algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination