CN112825275A

CN112825275A - Method for predicting health state through physical examination indexes based on machine learning

Info

Publication number: CN112825275A
Application number: CN202011311946.1A
Authority: CN
Inventors: 黄璐琳; 帅平; 刘玉萍; 邓燕辉; 王海鑫
Original assignee: Sichuan Provincial Peoples Hospital
Current assignee: Sichuan Provincial Peoples Hospital
Priority date: 2019-11-21
Filing date: 2020-11-20
Publication date: 2021-05-21
Also published as: WO2021098842A1

Abstract

The invention discloses a method for predicting health state through physical examination indexes based on machine learning, which predicts the health state of a sampling sample through the physical examination indexes by adopting a random forest algorithm, samples in healthy and unhealthy states by adopting a down-sampling strategy to sample the randomly used sample to obtain the sampling sample, and balances data by randomly selecting a data subset of a target class by adopting a random under-sampling method. The invention has good prediction performance, can help to predict the real situation of patients with disease deterioration, identify 'high-risk' patients and provide the most relevant follow-up examination for the affected individuals.

Description

Method for predicting health state through physical examination indexes based on machine learning

Technical Field

The invention relates to the field of medical treatment, in particular to a method for predicting health status through physical examination indexes based on machine learning.

Background

Because the current physical examination conclusion is generally based on a relatively independent single or multiple leading indicators to give suggestions on the physical examination result, the given results are ambiguous.

Due to the lack of systematic study of the correlation between Physical Examination Indicators (PEI), most currently use it independently for disease forewarning. This results in a very limited diagnostic value for general physical examinations.

Compared to clinical medical treatment, the overall basic medical system has a greater impact on human health. Health checks can help healthy people to gain insight into their own bodily functions, maintain health, and inform health by changing unhealthy habits and avoiding dangerous factors that may lead to disease [2 ]. Physical examination can minimize disease disturbance. With the growing population size and age, the need for healthcare is increasing and healthcare services are becoming more and more sophisticated and costly.

Health checks are a common element of healthcare in developed countries. These tests include general blood tests, urine tests, blood sugar tests, blood fat tests, renal function tests, and the like. However, currently, physical examination reports are mainly evaluated based on one or two independent Physical Examination Indicators (PEI), which only provide very limited information about the health status or disease diagnosis of the physical examiners [6 ]. Although it is desirable to provide valuable information for public health care by defining a small number of PEI that are easily measured, the correlation between PEI's in different physical states (i.e. healthy, hypertension, diabetes) has not been systematically studied. Used for accurately diagnosing diseases before the diseases occur.

Recently available health data has proliferated, and improvements in healthcare by improving quality of care are expected to improve population health while inhibiting cost increases. The health check center may generate large data for the system that may reveal potential health issues not otherwise discovered. Clinically, more and more investments are being made in developing medical big data applications, such as Artificial Intelligence (AI) -based big data applications, for diagnosing diseases based on clinical images. While AI can save costs and improve efficiency, particularly for early diagnosis and prevention of chronic diseases, due to the lack of systematic analysis of PEI under conditions, no predictive model for PEI based condition prediction has been generated so far.

Disclosure of Invention

The invention aims to provide a method for predicting health status through physical examination indexes based on machine learning, which can help predict the actual situation of a patient with disease deterioration, identify 'high-risk' patients and provide the most relevant follow-up examination for affected individuals.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

the invention discloses a method for predicting health states through physical examination indexes based on machine learning.

Preferably, for the healthy and unhealthy state samples, a down-sampling strategy is adopted to sample randomly used samples, sample samples are obtained, and a random under-sampling method is adopted to balance data by randomly selecting a data subset of a target class.

Preferably, a physical examination index feature extraction strategy is adopted to extract the physical examination index which contributes most to each healthy and unhealthy state.

Preferably, the physical examination index feature extraction strategy univariate statistics strategy performs automatic feature selection by using feature _ selection in scidit spare.

Preferably, in each of the healthy and unhealthy states, the first 15% or 16% of the representative physical indicators are extracted by feature extraction for prediction.

Preferably, the representative detection indexes are 30.

Preferably, the forest algorithm model is established by randomly grouping data, 30% of the data form a test set, and the rest 70% of the data are randomly grouped again, wherein 70% of the data are used as a training set of a training model, and 30% of the data are used as a verification set of an evaluation model

Preferably, in the process of improving the model generalization performance by adjusting the parameters, a cross validation method of grid search is adopted.

Preferably, the cross-validation method is implemented using GridSearchCV supplied by scinitlern.

The invention has good prediction performance, can help to predict the real situation of patients with disease deterioration, identify 'high-risk' patients and provide the most relevant follow-up examination for the affected individuals.

The present invention enables the determination of correlations between PEI in a healthy state and in an unhealthy state (i.e., a state with underlying chronic disease); elucidating the relationship between chronic disease and normal individuals of these PEI's to find candidate disease markers; with the machine learning model of the present invention, the health of an individual can be predicted using only a complete set of PEI's without the need for detailed clinical examination information.

Drawings

FIG. 1 shows the machine learning prediction results of 35 body states based on the random forest algorithm.

FIG. 2 is a graph showing the correlation of PEI detected in a healthy population.

Fig. 3 is a representative candidate signature for an unhealthy physical condition.

FIG. 4 is a graph showing the ratio of the plasma HDL-C concentration in patients with normal physical conditions to that in patients with diabetes.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.

On the basis of general physical examination, the method can predict the common disease onset algorithm, and the prediction performance of physical examination indexes is obtained by testing three machine learning models, namely a nucleation Support Vector Machine (SVM), a multilayer perceptron (MLP) and a random forest. Then, a forest algorithm is further selected, and the random forest algorithm is optimized. First, due to the uneven distribution of healthy and unhealthy sample numbers and the law of large numbers, the inventors adopted a downsampling strategy for randomly used samples. Due to the fact that data have serious class imbalance, a random undersampling method is adopted, and data are balanced through randomly selecting a data subset of a target class. Secondly, a physical examination index feature extraction strategy is adopted to extract physical examination indexes which contribute most to each healthy and unhealthy state. And the feature extraction adopts a univariate statistical strategy to automatically select features. Univariate statistics the features with high confidence are selected according to the statistical significance of the relationship between each feature and the target. This process can be implemented by using feature _ select in scinitlern. Finally, in each of the healthy and unhealthy states, the first 15% or 16% of the representative examination indexes (about 30) are extracted by feature extraction for prediction. The advantage of this approach is that it is typically very fast and completely independent of the model applied after feature selection, and then randomly groups the data, 30% constituting the test set, the remaining 70% being randomly grouped again, 70% as the training set for training the model and 30% as the validation set for evaluating the model. In the process of improving the model generalization performance by adjusting parameters, a cross validation method of grid search is adopted and implemented by using GridSearchCV provided by scinitlern.

As shown in fig. 1, table 1, AUC reached 66% -99% (mean 87.6%) in the random forest algorithm predictions for each pair of healthy and unhealthy physical states. For classification, AUC values above 90% indicate good performance, 80% to 90% indicate good performance. The inventors' algorithm provides a high accuracy prediction for 18 of the 34 unhealthy physical states (auc > 90%) and good performance for the other 9 unhealthy physical states (90% > auc > 80%). In the inventor's algorithm, patients with heart-related diseases showed excellent performance. These results show that by performing feature extraction of a small number of samples for physical examination indexes of 15-16% of the 221 individual examination indexes, the random forest algorithm of the inventor provides good performance for the prediction of most unhealthy physical states.

Because the current physical examination conclusion is generally that suggestions are given to physical examination results based on a relatively independent single or multiple leading indicators, a plurality of given results are ambiguous, the value of judging the health condition of examinees is very limited, and a more accurate index system and method are urgently needed to judge the health condition of physical examinees. The inventor develops a random forest machine learning algorithm, can predict diseases by 15% -16% of 221 individual detection indexes, and has good prediction performance (auc: 66% -99%, average 86%). For each disease, the inventors defined about 30 contributing physical indicators by feature extraction. The finding that only a few hundred samples provide good predictive performance for many chronic diseases in most of the inventors' predictive algorithms suggests that machine learning using physical examination index data can help predict the true condition of a patient with an exacerbation, identify "high risk" patients, and provide the most relevant follow-up examinations for affected individuals. The machine learning algorithm developed by the inventors can be immediately applied to clinical practice to assist in the determination of the results of a physical examination.

TABLE 1 prediction effectiveness of the model. roc, respectively; auc area under the curve

As medical improvement has made a remarkable progress in expanding the coverage of insurance, the general physical examination industry is now accumulating large data. By using a huge general health check data set of chinese population, the present invention has three main goals: determining a correlation between healthy and unhealthy (i.e. patients with underlying chronic disease) PEI; elucidating the relationship between chronic disease and normal in these PEI's to find candidate disease markers; a set of machine learning models is developed that can be used to predict the health of an individual using a sophisticated set of PEI's. To address these issues, the inventors included 80,3614 individual physical examination data that accessed a health check center between 2013 and 2018. The inventors included data for 221 PEI's associated with 35 physical conditions, most of which were unhealthy due to chronic disease.

As shown in fig. 2, 3, 4, participants represented 35 health conditions based on health condition or underlying disease condition (unhealthy condition). Specifically, the study population included 711,928 healthy participants, 46,981 hypertensive patients, 11,745 diabetic patients and 32,960 other unhealthy states. The correlation between two PEI's in R was calculated using the Pearson Correlation Coefficient (PCC) method. Linear regression models (lm) were used to compare PEI between reported health in the R package and unhealthy conditions adjusted for gender and age. A random forest machine learning algorithm is used for health state prediction. There is a wide correlation between PEI's in different body conditions. Abundant PEI differences were observed between healthy and unhealthy physical conditions. Machine learning algorithms can be used to predict physical state by using a set of PEI's for routine physical examination. The inventors found that there were abundant correlations between PEI's in healthy physical condition (7,662 significant correlations, accounting for 31.5% of all correlations). However, PEI association changes under disease conditions. The inventors further focused on these PEI differences between healthy and 35 unhealthy states and found 1,239 significant PEI differences suggesting that they are candidate disease markers. Finally, the inventors have developed a machine learning algorithm to predict health using 15% -16% PEI through feature extraction, with 66% -99% accuracy prediction from physical states.

This new PEI related encyclopedia provides rich information for the diagnosis of chronic diseases. The machine learning algorithm developed by the inventor can generate far-reaching influence in the common physical examination practice

The inventors included 803,614 individuals who participated in the health management center and physical examination center between 2013 and 2018. Participants represented 35 health conditions depending on health condition or underlying disease condition (unhealthy condition). Specifically, the study population included 711,928 healthy participants, 46,981 hypertensive patients, 11,745 diabetic patients and 32,960 other unhealthy conditions (mainly chronic disease) (table 1). The inventors included 221 PEI's in the analysis, including patient demographic information (age and gender) and lifestyle indicators (tobacco and alcohol, tobacco usage, etc.).

PEI correlation of participants in a healthy condition

The primary goal of the inventors was to explore PEI relevance under healthy conditions in hopes of creating a landscape. Of the 221 PEI's, the inventors found 7,662 significant correlations among the correlations (31.5%) of 24,322PEI pairs among all persons in a healthy condition (P <0.05/24,322PEI pairs ═ 2X 10-6). N711,928, average age 41.4, female 45.7%). This finding indicates that there is a wide correlation between PEI. The first 50 relevant PEI's include gender, age, red blood cell count, Prealbumin (PAB), history of alcohol consumption (alcohol consumption, drinking), alkaline phosphatase level (ALP), tobacco usage (smoking), etc. The number of significantly related PEI's among the 221 PEI's also indicated that there was a rich correlation between the PEI's). Some of these established correlations are consistent with the reported literature in healthy PEI, but most are newly discovered in this study.

Census PEI showed high correlation with each other or other PEI. For example, gender showed the most abundant PEI association (151 PEI pairs, male versus female), including hemoglobin (Hb), creatinine, Uric Acid (UA), alcohol consumption, smoking, Body Mass Index (BMI), etc., reflecting differences in physical form, physical constitution and lifestyle among men and women. Age also showed strong PEI correlations (125 PEI pairs), such as estimated glomerular filtration rate (egfr b), Systolic Blood Pressure (SBP), Diastolic Blood Pressure (DBP), albumin (Alb), and low density lipoprotein (LDL-C). These findings indicate that with age, there is a systemic change in body function (fig. 1, fig. 2, supplementary table 1). The inventors also found that the correlation of 124 PEI's with BMI reflects a strong influence of body shape on PEI, including UA, high density lipoprotein (HDL-C), SBP and DBP. Blood Pressure (BP) has many physiological implications, and the inventors have identified a group of PEI's that correlate with Blood Pressure (BP), including 125 PEI's for DBP and 124 PEI's for SBP (fig. 1, fig. 2, supplementary table 1). . Intraocular pressure (IOP) is an important factor in the diagnosis of glaucoma [12 ]. The inventors found that 79 PEI's were weakly associated with IOP (IOP-L) in the left eye, including IOP (IOP-R) SBP, DBP, Alb, BMI, TG, ApoB, alcohol and TC in the right eye. Similar to IOP-L, 73 PEI's were weakly associated with IOP-R.

As expected, the lipid PEI showed many correlations. For example, 119 PEI's are associated with Triglycerides (TG). The inventors discovered 122 HDLs-C related PEI's that have many negative correlations including TG, UA and BMI (fig. 1, fig. 2, supplementary table 2). The correlation pattern between LDL and HDL shows a specific opposite trend. Unexpectedly, lifestyle has had a profound effect on the inventors' body. Consistently, the inventors detected 130 drinking-related PEI, such as gender, smoking, Hb, and UA (. again, 128 PEI are associated with smoking, including drinking, gender, and age. the inventors also detected that 58 PEI are weakly associated with motor habits (e habits) (including age, eGFB, and SBP). expression of tumor markers may indicate the development and progression of tumors.

PEI correlation in persons with unhealthy physical condition

Next, the inventors examined PEI correlations in 34 unhealthy physical states. In this analysis, the inventors also determined rich correlations in these unhealthy physical states. The inventors found that the significant correlation of PEI was lower in unhealthy physical states compared to healthy physical states, probably due to sample size effects (Table 1, supplementary Table 2-S35). Each unhealthy physical state has its unique associated spectrum, and most of them are newly discovered in this study. For example, in the hypertensive population, the inventors found 4,413 significant correlations (18.3%) among 221 PEI's in 24,322PEI pairs. PEI with enhanced correlation included mononuclear cells (MON) (70 cases in hypertension, 6 cases in healthy body state, the same below), quantitative detection of hepatitis B virus DNA (HBV-DNA) (76 cases 33 pairs), quantitative detection of hepatitis C virus RNA (HCV-RNA) (49 pairs 8), and the like (supplementary Table 2). The RH blood group correlation was increased in people with hypertension and coronary heart disease (hypertension + coronary artery) compared to healthy people (41 vs 9 in normal people). In contrast, the number of homocysteine (Hcy) associations in unhealthy versus healthy patients is greatly reduced (2 vs 120). In diabetes, 10 pairs of PEI increased, while the remaining 195 pairs decreased. Increased PEI includes MON (41 vs 6), HCV-RNA (42 vs 8), anti-Sc 70(59 vs 31) and HCV-cAg (35 vs 10) (supplementary Table 17). These results indicate that in unhealthy conditions, the PEI has undergone systemic changes. Each disease has its specific PEI profile.

Next, the inventors explored the relevant networks between PEI using qgraph [13], which shows LinkMode between PEI. In a healthy state, the inventors found that PEI showed abundant interactions in both the positive and negative directions. In unhealthy physical conditions, they each show a unique network of interactions with PEI. These results indicate that there is a dependency between multiple indicators for each physical state, which can be used in conjunction with physical fitness assessment.

Candidate PEI markers for unhealthy body conditions

To verify and discover the impact of new candidate biomarkers or lifestyle habits on early diagnosis of disease, the inventors next calculated the 221 PEI differences between healthy and unhealthy states. In summary, the inventors found 1,239 significantly different PEI pairs between healthy and 34 unhealthy states (P < 0.05/34 ═ 0.0014, adjusted for 34 unhealthy states). For example, the PEI difference between hypertensive and healthy persons was 112, the PEI difference between hypertensive and diabetic and healthy persons was 100, and the PEI difference between diabetic and healthy persons was 91. Some of which are consistent with previous findings, while others are newly discovered.

For many of the 221 PEI's, the inventors detected differences between healthy and unhealthy states, especially in PEI's related to physique, lifestyle and blood lipids (supplementary table 36). For example-BMI, the inventors found differences between 16 healthy and unhealthy physical conditions of 34 unhealthy physical conditions, including hypertensive patients (P ═ 0) and gout patients (P ═ 6.48 × 10-90). Exercise habits (E habits) show 19 differences between healthy and unhealthy states, including hyperlipidemia (P ═ 1.28 × 10-277) and diabetes (P ═ 4.20 × 10-29). Dietary habits also showed 10 differences in unhealthy conditions, including chronic pharyngitis (P ═ 2.59 x 10-19) and cholecystolithiasis (P ═ 9.43 x 10-18). The inventors examined the difference in alcohol intake habits among 20 unhealthy conditions, including hyperlipidemia (P ═ 0), coronary heart disease (P ═ 4.06 × 10-24), diabetes (P ═ 1.09 × 10-22), and parkinson's disease (P ═ 1.43) × 10-17. The inventors also observed differences in smoking habits among 18 unhealthy conditions compared to unhealthy conditions, including hypertension (P ═ 2.74 × 10-114), hyperlipidemia (P ═ 2.69 × 10-62), and parkinson's syndrome (P ═ 5.12 × 10-29). The inventors found that there were differences between IOP-R and healthy status in five unhealthy states, including hypertension (P ═ 3.63 × 10-85) and diabetes (P ═ 2.01 × 10-73); similar findings have been made for IOP-L (. for lipid PEI, the inventors also observed differences between 34 unhealthy and healthy states.e. LDL-C was detected in 21 unhealthy states including hypertension (P ═ 0) and diabetes (P ═ 2.95 × 10-212). HDL-C was detected in 17 unhealthy states including diabetes (P ═ 1.92 × 10-177). further detailed analysis of HDL-C and diabetes by the inventors found that populations with low HDL-C are at significantly higher risk for diabetes than the average of this population (1.26-1.75 mmol/L).

Tumor associated antigens also show significant differences between healthy and unhealthy states. For example, CYFRA21-1 is detected as 10 unhealthy states, including hypertension + diabetes (P ═ 3.71 × 10-97) and diabetes (P ═ 4.52 × 10-70). CEA1 was detected in 12 unhealthy states, including hypertension + coronary arteries (P ═ 9.59 × 10-29) and diabetes (P ═ 1.73 × 10-18). Alpha-fetoprotein (AFP) was detected in liver disease (P ═ 1.08 × 10-28). C-PSA was detected in hypertension + coronary arteries (P ═ 8.38 × 10-20). Finally, the carbohydrate antigen CA724(CA 72-4) was detected in asthma (P ═ 9.92 × 10-13), gout (P ═ 3.53 × 10-7) and coronary heart disease + diabetes (P ═ 4.06 × 10-5) (supplementary table 36). In other PEI's, the inventors have also discovered a significant difference between the healthy and unhealthy state. For example, the inventors have found that there are differences in urine glucose levels (U-GLU) among 9 unhealthy states, including diabetes and its related diseases. Eosinophil rates (eo%) were found in five unhealthy states including asthma (P ═ 1.38 × 10-129) and nasal allergy (P ═ 4.05 × 10-18). Whole blood iron levels (WB-Fe) are in 11 unhealthy states, including hypertension (P ═ 2.52 × 10-69). The inventors detected 11 PH's with poor health status including diabetes (P ═ 1.97 × 10-239), hypertension (P ═ 2.41 × 10-166), hypertension + diabetes (P ═ 9.90 × 10-32) and gout (P ═ 9.82 × 10-15). The inventors found potassium (K +) to be in five unhealthy states, including hypertension (P ═ 1.98 × 10-119) and hepatitis b (P ═ 3.13 × 10-10). The inventors also detected differences in magnesium (Mg2+) in hypertension + diabetes (P ═ 3.14 × 10-58) and diabetes (P ═ 5.10 × 10-52). Hcy (an index of cardiovascular disease) was detected in eight unhealthy states, including hypertension (P ═ 1.97 × 10-136) and parkinson's syndrome (P ═ 1.76 × 10-7) (supplementary table 36). These results provide a set of candidate markers for early diagnosis of chronic disease.

A key objective of the present invention is to apply PEI data and machine learning techniques to develop algorithms that can predict the onset of common illnesses based on routine physical examination. The inventors tried three machine learning models, including kernel Support Vector Machines (SVMs), multi-layer perceptrons (MLPs) and random forests. Since SVM and MLP prediction models give only very low accuracy and sensitivity in the inventors' initial training data, the inventors excluded these models for further training. In the initial training, random forests showed better performance than SVMs and MLPs. However, it does not provide good performance in multi-class classification of all body conditions. Finally, the inventors tried to classify each pair of healthy and unhealthy body conditions (e.g., hypertension and healthy people; Parkinson's syndrome and healthy people) using binary classification, and obtained better performance than multiple classifications. The inventors then tried to optimize this prediction algorithm. Since data has severe class imbalance characteristics, a random undersampling method is employed that balances data by randomly selecting a subset of data of a target class. In each physical state, the highest 15% or 16% of representative PEI was extracted for prediction by feature extraction. The advantage of this approach is that it is usually very fast and completely independent of the model applied after feature selection.

Finally, in the random forest algorithm prediction for each pair of healthy and unhealthy body conditions, the area under the curve (AUC) of the receiver operating characteristic curve depends on the unhealthy body condition (87.6% on average). For classification, AUC values above 90% indicate excellent performance, and AUC values from 80% to 90% indicate good performance. The inventors' algorithm provides high accuracy predictions for 18 of the 34 unhealthy body conditions (AUC > 90%) and good predictions for the other 9 unhealthy body conditions (90% > AUC > 80%). In the inventors' algorithm, patients with heart disease showed excellent performance. For example, by extracting 30 PEI features (age, white blood cell count, monocytes, Mon%, mean red blood cell volume, red blood cell count, red blood cell distribution width, lymphocyte rate, platelet count, low density lipoprotein, high density lipoprotein, total cholesterol, carcinoembryonic antigen 1, albumin globulin, cystatin c, glucose, urine glucose, urinary creatinine, estimated glomerular filtration rate, creatinine, urea, waist circumference, aaist-hip ratio, body mass index, surgical history, systolic blood pressure, height, neck size, and medical history), using only 909 training samples and 387 validation samples (f 1-score (95% CI), 0.96(0.95-0.96), accuracy (95% CI)), hypertension + diabetes + coronary heart disease can provide 99% AUC: 0.95 (0.94-0.97); specificity (95% CI): 0.95 (0.94-0.95); recall (sensitivity) (95% CI): 0.95(0.94-0.97). In our algorithm, Parkinson's syndrome patients provided 97% AUC (f 1-score (95% CI), 0.91 (0.90-0.91)) using 192 training samples and 83 validation samples, accuracy (95% CI): 0.90 (0.89-0.90), specificity (95% CI): 0.87(0.79-0.94), recall (95% CI): 0.90 (0.89-0.91). for liver fat infiltration, our algorithm also provided good predictive performance using 803 training samples and 115 validation samples (f 1-score (95% CI), 0.82(0.78-0.87), accuracy (95% CI): 0.81 (0.76-0.86)), specificity (95% CI): 0.75(0.67-0.82), recall (95% CI) 0.82(0.77-0.87) and AUC (95-0.92): 0.94-0.94), the inventors concluded the lowest predicted performance in this study (AUC (95% CI): 0.66 (0.60-0.72)). The inventors' algorithm also provides a good prediction when all unhealthy physical conditions are classified as one "unhealthy" condition: f1 score (95% CI: 0.83 (0.83-0.83); accuracy (95% CI): 0.82 (0.82-0.82); specificity (95% CI): 0.81 (0.81-0.81); sensitivity (95% CI): 0.84(0.84-0.84)) and AUC (95% CI): 0.9(0.90-0.90). These results show that by using feature extraction of PEI (15-16% of all 221 PEI) with only a small number of samples, the inventors' random forest algorithm provided good performance for most unhealthy body condition predictions.

The present invention plots 221 conventional PEI's using physical examination data obtained from 803,614 individuals in china with 35 healthy or unhealthy physical conditions (primarily chronic disease). The inventors have detected a large number of correlations between PEI in healthy or unhealthy physical states; furthermore, these correlations differ depending on the 34 unhealthy physical conditions analyzed. Most of the correlations were newly discovered in this study. The inventors found that there is a wide range of associations between PEI, such as gender, age, BMI, blood lipids, blood pressure, cancer related indicators, lifestyle (including drinking, smoking, electronic habits). Increasing the understanding of these PEI interactions by the inventors will help explain the mechanisms and pathogenesis of the disease. The inventors' results fill the gap in systematic PEI analysis and provide rich information on how PEI reflects basic health conditions. These findings provide abundant information for further improvement of healthcare research and clinical practice.

One of the unexpected findings in the inventors' analysis is that hypertensive patients show a higher correlation between HBV-DNA and HCV-RNA with other PEI compared to healthy people. Also, the inventors found that there was a strong correlation between hepatitis c virus and other PEI in diabetes, indicating that patients infected with hepatitis c may be more susceptible to diabetes. This finding suggests a phenomenon whereby viral infection may make individuals more susceptible to chronic disease. For these people, antiviral therapy is considered in the treatment of hypertension and diabetes.

The discovery and development of biomarkers for clinical research, diagnosis and therapy monitoring in clinical trials is a key area of medicine and healthcare [14 ]. In this study, the inventors propose a number of candidate markers for chronic diseases. For example, the inventors have found that IOP markers are considered to be a relatively independent marker of glaucoma [15], and are closely associated with hypertension, diabetes and diabetic hypertension. These results indicate that intraocular pressure may be affected to some extent by systemic diseases and may be used as one of the clinical markers for early diagnosis of these diseases. The inventors' results demonstrate that low levels of HDL-C are a risk factor for diabetes, particularly in women [16 ]. This result suggests that increasing HDL-C levels through dietary supplementation may be an effective method for preventing diabetes in patients with low HDL-C levels. However, according to the inventors' results, over-supplementation of HDL-C is also a risk factor. Therefore, supplementation of HDL-C should be aimed at bringing HDL-C levels within the normal range [17 ]. When comparing healthy populations, the inventors found a significant increase in AFP in liver disease, confirming that increased AFP is an increased risk factor for primary liver cancer in liver disease [18 ]. Potassium ions have a significant effect on hypertension [19] and chloride ions, while magnesium ions have a significant effect on diabetes, suggesting that modulation of these ions may have an effect on these diseases. The living habits of sports, smoking and drinking have a deeper impact on the body than the inventor expects. For example, the history of exercise, alcohol consumption or smoking has a strong impact on hyperlipidemia [20], as evidenced by comparison to health. This finding suggests that hyperlipidemia should be improved by adjusting these lifestyle habits.

Because current physical examination conclusions are typically based on relatively independent single or few previous indicators to suggest physical examination results, many of the results presented are ambiguous and the value of judging the health status of the examinee is very limited [21 ]. There is a pressing need for a more accurate index system and method for determining the health condition of a physical examiner. In the final part of the study, the inventors developed a random forest machine learning algorithm that predicts disease by 15% -16% of all 221 PEI's with good predictive performance (AUC: 66% -99%; average 86%). For each disease, the inventors defined about 30 contributing PEI's by feature extraction. In most of the inventors' prediction algorithms, only a few hundred samples are required to provide good prediction performance for many chronic diseases. This finding suggests that machine learning based on PEI data can be used to help predict the true condition of the examiner, identify "high risk" patients and indicate the follow-up physical examination most relevant to the affected individual.

In summary, the inventors systematically explored various PEI's and their relationship to chronic disease and established a machine learning predictive model to predict health. This study provides rich information to better understand the physiological and pathological characteristics of the human body as a system. Importantly, the inventors have determined modifiable factors and directions for the prediction, diagnosis and treatment of disease. The machine learning algorithm developed by the inventor may be affected

PEI data was from 803614 han patients visited by the health management center and physical examination center of people's hospital in sichuan province between 2013 and 2018. The cohort captured a total of 35 participants in different health conditions, including 711928 healthy participants and 91686 unhealthy participants. The unhealthy population included 46981 hypertensive patients, 11745 diabetic patients and 32960 other patients in an unhealthy state.

The PEI detected, the study experiment included only PEI recorded by the same method. A total of 229 PEI were initially collected: few detected 8 PEI were excluded, leaving 221 PEI for further analysis (table 1). These PEI's include biochemical marker levels and blood test results. The patient's lifestyle and disease status were also investigated during physical examination.

Data processing

PEI with string variables are converted to integer variables for data analysis. The classification variables are digitally encoded for further calculation. The mean interpolation method is used for missing data. For individuals participating in more than one physical examination, the average value for each PEI was used for data analysis.

Statistical analysis

The Pearson Correlation Coefficient (PCC) method is used to calculate the correlation between two PEI's (e.g., x and y) in R; the method measures a linear correlation between two variables. The PCC correlation (r) (1) and P value (2) were calculated using the following equation [22 ]:

(1)

(2) P＝1-F.DIST(((n-2)*r^2)/(1-r^2),1,n-2)

df＝n-2

number of x-y data pairs

When the correlation coefficient (r) is used, the total sample size required when α and β are 0.05 and 0.20 on both sides. If r is 0.05, 3,134 samples are required. If r is 0.10, 782 samples are required; if r is 0.25, 123 samples are needed; if r is 0.5, 29 samples are needed. The general formula for correlation sample calculation is as follows (3) [23 ]:

(3) r is the expected correlation coefficient

C＝0.5×ln[(l+r)/(l–r)]

Total number of must-be-repaired objective

Then the

N＝[(Za+Zb)÷C]2+3。

In the R package, linear regression models (lm) were used to compare PEIs between reported healthy and unhealthy states adjusted by gender and age [23 ]. The odds ratio of HDL-C levels was calculated by using a generalized linear model (glm) adjusted for age in R-package [24 ]. The related interaction network is performed using qgraph [25]

Machine learning

Three machine learning models, including kernel Support Vector Machines (SVM) [22, 26], multilayer perceptrons (MLP) [23] [23, 27-29] and random forests [30], were tested to obtain the predicted performance of PEI. Predicting health and each of the 34 unhealthy states (classifications) by using MLP algorithm prediction in neural networks does not work well. The inventors further attempted to predict health from each unhealthy statue by binary classification, predicting that the F1 value for each outcome was very close to zero. By using SVM algorithm prediction for multi-class prediction, the highest F1 value for cholecystolithiasis is 0.70, while the highest F1 value for most other types of disease is 0.00. The inventors also tried binary classification methods, but all the results were relatively poor. When using random forest algorithms for multi-class prediction (healthy and each of 34 unhealthy conditions), the F1 value for healthy conditions can reach 0.80-0.90, but the F1 value for unhealthy sat is about 0.00-0.40. Then, the inventor further selects a forest algorithm and optimizes a random forest algorithm. First, the inventors adopted a downsampling strategy for randomly used samples due to the uneven distribution of the number of samples for healthy and unhealthy states and due to the large number law [30 ]. Since data has severe class imbalance characteristics, a random undersampling method is employed that balances data by randomly selecting a subset of data of a target class. Second, the inventors used the PEI feature extraction strategy to extract the most contributing PEI for each healthy and unhealthy condition. Feature extraction adopts a univariate statistical strategy in automatic feature selection. Univariate statistics the features with high confidence are selected based on the statistical significance of the relationship between each feature and the target. This can be done by using feature _ selection in scimit-spare. Finally, the highest 15% or 16% of representative PEI was extracted for prediction by feature extraction in each of healthy and unhealthy states. The advantage of this approach is that it is usually very fast and completely independent of the model applied after feature selection. The data was then randomly partitioned into 30% comprising the test set, and the remaining 70% was again randomly partitioned, with 70% as the training set for the training model and 30% as the validation set for the evaluation model. In the process of improving the generalization performance of the model by adjusting parameters, a cross validation method with grid search is adopted, and the method can be realized by GridSearchCV provided by sciit-lean.

The present invention is capable of other embodiments, and various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention.

Claims

1. The method for predicting the health state through physical examination indexes based on machine learning is characterized by comprising the following steps: and predicting the health state of the sampling sample by physical examination indexes by adopting a random forest algorithm.

2. The machine learning-based method of predicting health status via physical examination indicators of claim 1, wherein: for healthy and unhealthy state samples, sampling randomly used samples by adopting a down-sampling strategy to obtain sampling samples, and balancing data by randomly selecting a data subset of a target class by adopting a random under-sampling method.

3. The machine learning based method of predicting health status by physical examination indicators of claim 2, wherein: and extracting the physical examination indexes which have the greatest contribution to each healthy and unhealthy state by adopting a physical examination index feature extraction strategy.

4. The machine learning based method of predicting health status by physical examination indicators of claim 3, wherein: the physical examination index feature extraction strategy univariate statistical strategy carries out automatic feature selection by using feature _ selection in scidit spare.

5. The machine learning-based method of predicting health status via physical examination indicators of claim 4, wherein: under each healthy and unhealthy state, the first 15% or 16% of the representative physical examination indexes are extracted through feature extraction for prediction.

6. The machine learning-based method of predicting health status via physical examination indicators of claim 5, wherein: the representative physical examination indexes are 30.

7. The machine learning-based method of predicting health status via physical examination indicators of claim 1, wherein: the forest algorithm model is established by randomly grouping data, wherein 30% of the data form a test set, and the rest 70% of the data are randomly grouped again, wherein 70% of the data are used as a training set of a training model, and 30% of the data are used as a verification set of an evaluation model.

8. The machine learning-based method of predicting health status via physical examination indicators of claim 7, wherein: in the process of improving the generalization performance of the model by adjusting the parameters, a cross validation method of grid search is adopted.

9. The machine learning-based method of predicting health status via physical examination indicators of claim 8, wherein: the cross-validation method was implemented using GridSearchCV supplied by scinitlern.