CN112825275A - Method for predicting health state through physical examination indexes based on machine learning - Google Patents

Method for predicting health state through physical examination indexes based on machine learning Download PDF

Info

Publication number
CN112825275A
CN112825275A CN202011311946.1A CN202011311946A CN112825275A CN 112825275 A CN112825275 A CN 112825275A CN 202011311946 A CN202011311946 A CN 202011311946A CN 112825275 A CN112825275 A CN 112825275A
Authority
CN
China
Prior art keywords
physical examination
pei
machine learning
unhealthy
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011311946.1A
Other languages
Chinese (zh)
Inventor
黄璐琳
帅平
刘玉萍
邓燕辉
王海鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Provincial Peoples Hospital
Original Assignee
Sichuan Provincial Peoples Hospital
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Provincial Peoples Hospital filed Critical Sichuan Provincial Peoples Hospital
Publication of CN112825275A publication Critical patent/CN112825275A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Primary Health Care (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Surgery (AREA)
  • Animal Behavior & Ethology (AREA)
  • Veterinary Medicine (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a method for predicting health state through physical examination indexes based on machine learning, which predicts the health state of a sampling sample through the physical examination indexes by adopting a random forest algorithm, samples in healthy and unhealthy states by adopting a down-sampling strategy to sample the randomly used sample to obtain the sampling sample, and balances data by randomly selecting a data subset of a target class by adopting a random under-sampling method. The invention has good prediction performance, can help to predict the real situation of patients with disease deterioration, identify 'high-risk' patients and provide the most relevant follow-up examination for the affected individuals.

Description

Method for predicting health state through physical examination indexes based on machine learning
Technical Field
The invention relates to the field of medical treatment, in particular to a method for predicting health status through physical examination indexes based on machine learning.
Background
Because the current physical examination conclusion is generally based on a relatively independent single or multiple leading indicators to give suggestions on the physical examination result, the given results are ambiguous.
Due to the lack of systematic study of the correlation between Physical Examination Indicators (PEI), most currently use it independently for disease forewarning. This results in a very limited diagnostic value for general physical examinations.
Compared to clinical medical treatment, the overall basic medical system has a greater impact on human health. Health checks can help healthy people to gain insight into their own bodily functions, maintain health, and inform health by changing unhealthy habits and avoiding dangerous factors that may lead to disease [2 ]. Physical examination can minimize disease disturbance. With the growing population size and age, the need for healthcare is increasing and healthcare services are becoming more and more sophisticated and costly.
Health checks are a common element of healthcare in developed countries. These tests include general blood tests, urine tests, blood sugar tests, blood fat tests, renal function tests, and the like. However, currently, physical examination reports are mainly evaluated based on one or two independent Physical Examination Indicators (PEI), which only provide very limited information about the health status or disease diagnosis of the physical examiners [6 ]. Although it is desirable to provide valuable information for public health care by defining a small number of PEI that are easily measured, the correlation between PEI's in different physical states (i.e. healthy, hypertension, diabetes) has not been systematically studied. Used for accurately diagnosing diseases before the diseases occur.
Recently available health data has proliferated, and improvements in healthcare by improving quality of care are expected to improve population health while inhibiting cost increases. The health check center may generate large data for the system that may reveal potential health issues not otherwise discovered. Clinically, more and more investments are being made in developing medical big data applications, such as Artificial Intelligence (AI) -based big data applications, for diagnosing diseases based on clinical images. While AI can save costs and improve efficiency, particularly for early diagnosis and prevention of chronic diseases, due to the lack of systematic analysis of PEI under conditions, no predictive model for PEI based condition prediction has been generated so far.
Disclosure of Invention
The invention aims to provide a method for predicting health status through physical examination indexes based on machine learning, which can help predict the actual situation of a patient with disease deterioration, identify 'high-risk' patients and provide the most relevant follow-up examination for affected individuals.
In order to achieve the purpose, the invention is realized by adopting the following technical scheme:
the invention discloses a method for predicting health states through physical examination indexes based on machine learning.
Preferably, for the healthy and unhealthy state samples, a down-sampling strategy is adopted to sample randomly used samples, sample samples are obtained, and a random under-sampling method is adopted to balance data by randomly selecting a data subset of a target class.
Preferably, a physical examination index feature extraction strategy is adopted to extract the physical examination index which contributes most to each healthy and unhealthy state.
Preferably, the physical examination index feature extraction strategy univariate statistics strategy performs automatic feature selection by using feature _ selection in scidit spare.
Preferably, in each of the healthy and unhealthy states, the first 15% or 16% of the representative physical indicators are extracted by feature extraction for prediction.
Preferably, the representative detection indexes are 30.
Preferably, the forest algorithm model is established by randomly grouping data, 30% of the data form a test set, and the rest 70% of the data are randomly grouped again, wherein 70% of the data are used as a training set of a training model, and 30% of the data are used as a verification set of an evaluation model
Preferably, in the process of improving the model generalization performance by adjusting the parameters, a cross validation method of grid search is adopted.
Preferably, the cross-validation method is implemented using GridSearchCV supplied by scinitlern.
The invention has good prediction performance, can help to predict the real situation of patients with disease deterioration, identify 'high-risk' patients and provide the most relevant follow-up examination for the affected individuals.
The present invention enables the determination of correlations between PEI in a healthy state and in an unhealthy state (i.e., a state with underlying chronic disease); elucidating the relationship between chronic disease and normal individuals of these PEI's to find candidate disease markers; with the machine learning model of the present invention, the health of an individual can be predicted using only a complete set of PEI's without the need for detailed clinical examination information.
Drawings
FIG. 1 shows the machine learning prediction results of 35 body states based on the random forest algorithm.
FIG. 2 is a graph showing the correlation of PEI detected in a healthy population.
Fig. 3 is a representative candidate signature for an unhealthy physical condition.
FIG. 4 is a graph showing the ratio of the plasma HDL-C concentration in patients with normal physical conditions to that in patients with diabetes.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.
On the basis of general physical examination, the method can predict the common disease onset algorithm, and the prediction performance of physical examination indexes is obtained by testing three machine learning models, namely a nucleation Support Vector Machine (SVM), a multilayer perceptron (MLP) and a random forest. Then, a forest algorithm is further selected, and the random forest algorithm is optimized. First, due to the uneven distribution of healthy and unhealthy sample numbers and the law of large numbers, the inventors adopted a downsampling strategy for randomly used samples. Due to the fact that data have serious class imbalance, a random undersampling method is adopted, and data are balanced through randomly selecting a data subset of a target class. Secondly, a physical examination index feature extraction strategy is adopted to extract physical examination indexes which contribute most to each healthy and unhealthy state. And the feature extraction adopts a univariate statistical strategy to automatically select features. Univariate statistics the features with high confidence are selected according to the statistical significance of the relationship between each feature and the target. This process can be implemented by using feature _ select in scinitlern. Finally, in each of the healthy and unhealthy states, the first 15% or 16% of the representative examination indexes (about 30) are extracted by feature extraction for prediction. The advantage of this approach is that it is typically very fast and completely independent of the model applied after feature selection, and then randomly groups the data, 30% constituting the test set, the remaining 70% being randomly grouped again, 70% as the training set for training the model and 30% as the validation set for evaluating the model. In the process of improving the model generalization performance by adjusting parameters, a cross validation method of grid search is adopted and implemented by using GridSearchCV provided by scinitlern.
As shown in fig. 1, table 1, AUC reached 66% -99% (mean 87.6%) in the random forest algorithm predictions for each pair of healthy and unhealthy physical states. For classification, AUC values above 90% indicate good performance, 80% to 90% indicate good performance. The inventors' algorithm provides a high accuracy prediction for 18 of the 34 unhealthy physical states (auc > 90%) and good performance for the other 9 unhealthy physical states (90% > auc > 80%). In the inventor's algorithm, patients with heart-related diseases showed excellent performance. These results show that by performing feature extraction of a small number of samples for physical examination indexes of 15-16% of the 221 individual examination indexes, the random forest algorithm of the inventor provides good performance for the prediction of most unhealthy physical states.
Because the current physical examination conclusion is generally that suggestions are given to physical examination results based on a relatively independent single or multiple leading indicators, a plurality of given results are ambiguous, the value of judging the health condition of examinees is very limited, and a more accurate index system and method are urgently needed to judge the health condition of physical examinees. The inventor develops a random forest machine learning algorithm, can predict diseases by 15% -16% of 221 individual detection indexes, and has good prediction performance (auc: 66% -99%, average 86%). For each disease, the inventors defined about 30 contributing physical indicators by feature extraction. The finding that only a few hundred samples provide good predictive performance for many chronic diseases in most of the inventors' predictive algorithms suggests that machine learning using physical examination index data can help predict the true condition of a patient with an exacerbation, identify "high risk" patients, and provide the most relevant follow-up examinations for affected individuals. The machine learning algorithm developed by the inventors can be immediately applied to clinical practice to assist in the determination of the results of a physical examination.
Figure RE-GDA0002988961680000051
Figure RE-GDA0002988961680000061
Figure RE-GDA0002988961680000071
TABLE 1 prediction effectiveness of the model. roc, respectively; auc area under the curve
As medical improvement has made a remarkable progress in expanding the coverage of insurance, the general physical examination industry is now accumulating large data. By using a huge general health check data set of chinese population, the present invention has three main goals: determining a correlation between healthy and unhealthy (i.e. patients with underlying chronic disease) PEI; elucidating the relationship between chronic disease and normal in these PEI's to find candidate disease markers; a set of machine learning models is developed that can be used to predict the health of an individual using a sophisticated set of PEI's. To address these issues, the inventors included 80,3614 individual physical examination data that accessed a health check center between 2013 and 2018. The inventors included data for 221 PEI's associated with 35 physical conditions, most of which were unhealthy due to chronic disease.
As shown in fig. 2, 3, 4, participants represented 35 health conditions based on health condition or underlying disease condition (unhealthy condition). Specifically, the study population included 711,928 healthy participants, 46,981 hypertensive patients, 11,745 diabetic patients and 32,960 other unhealthy states. The correlation between two PEI's in R was calculated using the Pearson Correlation Coefficient (PCC) method. Linear regression models (lm) were used to compare PEI between reported health in the R package and unhealthy conditions adjusted for gender and age. A random forest machine learning algorithm is used for health state prediction. There is a wide correlation between PEI's in different body conditions. Abundant PEI differences were observed between healthy and unhealthy physical conditions. Machine learning algorithms can be used to predict physical state by using a set of PEI's for routine physical examination. The inventors found that there were abundant correlations between PEI's in healthy physical condition (7,662 significant correlations, accounting for 31.5% of all correlations). However, PEI association changes under disease conditions. The inventors further focused on these PEI differences between healthy and 35 unhealthy states and found 1,239 significant PEI differences suggesting that they are candidate disease markers. Finally, the inventors have developed a machine learning algorithm to predict health using 15% -16% PEI through feature extraction, with 66% -99% accuracy prediction from physical states.
This new PEI related encyclopedia provides rich information for the diagnosis of chronic diseases. The machine learning algorithm developed by the inventor can generate far-reaching influence in the common physical examination practice
The inventors included 803,614 individuals who participated in the health management center and physical examination center between 2013 and 2018. Participants represented 35 health conditions depending on health condition or underlying disease condition (unhealthy condition). Specifically, the study population included 711,928 healthy participants, 46,981 hypertensive patients, 11,745 diabetic patients and 32,960 other unhealthy conditions (mainly chronic disease) (table 1). The inventors included 221 PEI's in the analysis, including patient demographic information (age and gender) and lifestyle indicators (tobacco and alcohol, tobacco usage, etc.).
PEI correlation of participants in a healthy condition
The primary goal of the inventors was to explore PEI relevance under healthy conditions in hopes of creating a landscape. Of the 221 PEI's, the inventors found 7,662 significant correlations among the correlations (31.5%) of 24,322PEI pairs among all persons in a healthy condition (P <0.05/24,322PEI pairs ═ 2X 10-6). N711,928, average age 41.4, female 45.7%). This finding indicates that there is a wide correlation between PEI. The first 50 relevant PEI's include gender, age, red blood cell count, Prealbumin (PAB), history of alcohol consumption (alcohol consumption, drinking), alkaline phosphatase level (ALP), tobacco usage (smoking), etc. The number of significantly related PEI's among the 221 PEI's also indicated that there was a rich correlation between the PEI's). Some of these established correlations are consistent with the reported literature in healthy PEI, but most are newly discovered in this study.
Census PEI showed high correlation with each other or other PEI. For example, gender showed the most abundant PEI association (151 PEI pairs, male versus female), including hemoglobin (Hb), creatinine, Uric Acid (UA), alcohol consumption, smoking, Body Mass Index (BMI), etc., reflecting differences in physical form, physical constitution and lifestyle among men and women. Age also showed strong PEI correlations (125 PEI pairs), such as estimated glomerular filtration rate (egfr b), Systolic Blood Pressure (SBP), Diastolic Blood Pressure (DBP), albumin (Alb), and low density lipoprotein (LDL-C). These findings indicate that with age, there is a systemic change in body function (fig. 1, fig. 2, supplementary table 1). The inventors also found that the correlation of 124 PEI's with BMI reflects a strong influence of body shape on PEI, including UA, high density lipoprotein (HDL-C), SBP and DBP. Blood Pressure (BP) has many physiological implications, and the inventors have identified a group of PEI's that correlate with Blood Pressure (BP), including 125 PEI's for DBP and 124 PEI's for SBP (fig. 1, fig. 2, supplementary table 1). . Intraocular pressure (IOP) is an important factor in the diagnosis of glaucoma [12 ]. The inventors found that 79 PEI's were weakly associated with IOP (IOP-L) in the left eye, including IOP (IOP-R) SBP, DBP, Alb, BMI, TG, ApoB, alcohol and TC in the right eye. Similar to IOP-L, 73 PEI's were weakly associated with IOP-R.
As expected, the lipid PEI showed many correlations. For example, 119 PEI's are associated with Triglycerides (TG). The inventors discovered 122 HDLs-C related PEI's that have many negative correlations including TG, UA and BMI (fig. 1, fig. 2, supplementary table 2). The correlation pattern between LDL and HDL shows a specific opposite trend. Unexpectedly, lifestyle has had a profound effect on the inventors' body. Consistently, the inventors detected 130 drinking-related PEI, such as gender, smoking, Hb, and UA (. again, 128 PEI are associated with smoking, including drinking, gender, and age. the inventors also detected that 58 PEI are weakly associated with motor habits (e habits) (including age, eGFB, and SBP). expression of tumor markers may indicate the development and progression of tumors.
PEI correlation in persons with unhealthy physical condition
Next, the inventors examined PEI correlations in 34 unhealthy physical states. In this analysis, the inventors also determined rich correlations in these unhealthy physical states. The inventors found that the significant correlation of PEI was lower in unhealthy physical states compared to healthy physical states, probably due to sample size effects (Table 1, supplementary Table 2-S35). Each unhealthy physical state has its unique associated spectrum, and most of them are newly discovered in this study. For example, in the hypertensive population, the inventors found 4,413 significant correlations (18.3%) among 221 PEI's in 24,322PEI pairs. PEI with enhanced correlation included mononuclear cells (MON) (70 cases in hypertension, 6 cases in healthy body state, the same below), quantitative detection of hepatitis B virus DNA (HBV-DNA) (76 cases 33 pairs), quantitative detection of hepatitis C virus RNA (HCV-RNA) (49 pairs 8), and the like (supplementary Table 2). The RH blood group correlation was increased in people with hypertension and coronary heart disease (hypertension + coronary artery) compared to healthy people (41 vs 9 in normal people). In contrast, the number of homocysteine (Hcy) associations in unhealthy versus healthy patients is greatly reduced (2 vs 120). In diabetes, 10 pairs of PEI increased, while the remaining 195 pairs decreased. Increased PEI includes MON (41 vs 6), HCV-RNA (42 vs 8), anti-Sc 70(59 vs 31) and HCV-cAg (35 vs 10) (supplementary Table 17). These results indicate that in unhealthy conditions, the PEI has undergone systemic changes. Each disease has its specific PEI profile.
Next, the inventors explored the relevant networks between PEI using qgraph [13], which shows LinkMode between PEI. In a healthy state, the inventors found that PEI showed abundant interactions in both the positive and negative directions. In unhealthy physical conditions, they each show a unique network of interactions with PEI. These results indicate that there is a dependency between multiple indicators for each physical state, which can be used in conjunction with physical fitness assessment.
Candidate PEI markers for unhealthy body conditions
To verify and discover the impact of new candidate biomarkers or lifestyle habits on early diagnosis of disease, the inventors next calculated the 221 PEI differences between healthy and unhealthy states. In summary, the inventors found 1,239 significantly different PEI pairs between healthy and 34 unhealthy states (P < 0.05/34 ═ 0.0014, adjusted for 34 unhealthy states). For example, the PEI difference between hypertensive and healthy persons was 112, the PEI difference between hypertensive and diabetic and healthy persons was 100, and the PEI difference between diabetic and healthy persons was 91. Some of which are consistent with previous findings, while others are newly discovered.
For many of the 221 PEI's, the inventors detected differences between healthy and unhealthy states, especially in PEI's related to physique, lifestyle and blood lipids (supplementary table 36). For example-BMI, the inventors found differences between 16 healthy and unhealthy physical conditions of 34 unhealthy physical conditions, including hypertensive patients (P ═ 0) and gout patients (P ═ 6.48 × 10-90). Exercise habits (E habits) show 19 differences between healthy and unhealthy states, including hyperlipidemia (P ═ 1.28 × 10-277) and diabetes (P ═ 4.20 × 10-29). Dietary habits also showed 10 differences in unhealthy conditions, including chronic pharyngitis (P ═ 2.59 x 10-19) and cholecystolithiasis (P ═ 9.43 x 10-18). The inventors examined the difference in alcohol intake habits among 20 unhealthy conditions, including hyperlipidemia (P ═ 0), coronary heart disease (P ═ 4.06 × 10-24), diabetes (P ═ 1.09 × 10-22), and parkinson's disease (P ═ 1.43) × 10-17. The inventors also observed differences in smoking habits among 18 unhealthy conditions compared to unhealthy conditions, including hypertension (P ═ 2.74 × 10-114), hyperlipidemia (P ═ 2.69 × 10-62), and parkinson's syndrome (P ═ 5.12 × 10-29). The inventors found that there were differences between IOP-R and healthy status in five unhealthy states, including hypertension (P ═ 3.63 × 10-85) and diabetes (P ═ 2.01 × 10-73); similar findings have been made for IOP-L (. for lipid PEI, the inventors also observed differences between 34 unhealthy and healthy states.e. LDL-C was detected in 21 unhealthy states including hypertension (P ═ 0) and diabetes (P ═ 2.95 × 10-212). HDL-C was detected in 17 unhealthy states including diabetes (P ═ 1.92 × 10-177). further detailed analysis of HDL-C and diabetes by the inventors found that populations with low HDL-C are at significantly higher risk for diabetes than the average of this population (1.26-1.75 mmol/L).
Tumor associated antigens also show significant differences between healthy and unhealthy states. For example, CYFRA21-1 is detected as 10 unhealthy states, including hypertension + diabetes (P ═ 3.71 × 10-97) and diabetes (P ═ 4.52 × 10-70). CEA1 was detected in 12 unhealthy states, including hypertension + coronary arteries (P ═ 9.59 × 10-29) and diabetes (P ═ 1.73 × 10-18). Alpha-fetoprotein (AFP) was detected in liver disease (P ═ 1.08 × 10-28). C-PSA was detected in hypertension + coronary arteries (P ═ 8.38 × 10-20). Finally, the carbohydrate antigen CA724(CA 72-4) was detected in asthma (P ═ 9.92 × 10-13), gout (P ═ 3.53 × 10-7) and coronary heart disease + diabetes (P ═ 4.06 × 10-5) (supplementary table 36). In other PEI's, the inventors have also discovered a significant difference between the healthy and unhealthy state. For example, the inventors have found that there are differences in urine glucose levels (U-GLU) among 9 unhealthy states, including diabetes and its related diseases. Eosinophil rates (eo%) were found in five unhealthy states including asthma (P ═ 1.38 × 10-129) and nasal allergy (P ═ 4.05 × 10-18). Whole blood iron levels (WB-Fe) are in 11 unhealthy states, including hypertension (P ═ 2.52 × 10-69). The inventors detected 11 PH's with poor health status including diabetes (P ═ 1.97 × 10-239), hypertension (P ═ 2.41 × 10-166), hypertension + diabetes (P ═ 9.90 × 10-32) and gout (P ═ 9.82 × 10-15). The inventors found potassium (K +) to be in five unhealthy states, including hypertension (P ═ 1.98 × 10-119) and hepatitis b (P ═ 3.13 × 10-10). The inventors also detected differences in magnesium (Mg2+) in hypertension + diabetes (P ═ 3.14 × 10-58) and diabetes (P ═ 5.10 × 10-52). Hcy (an index of cardiovascular disease) was detected in eight unhealthy states, including hypertension (P ═ 1.97 × 10-136) and parkinson's syndrome (P ═ 1.76 × 10-7) (supplementary table 36). These results provide a set of candidate markers for early diagnosis of chronic disease.
A key objective of the present invention is to apply PEI data and machine learning techniques to develop algorithms that can predict the onset of common illnesses based on routine physical examination. The inventors tried three machine learning models, including kernel Support Vector Machines (SVMs), multi-layer perceptrons (MLPs) and random forests. Since SVM and MLP prediction models give only very low accuracy and sensitivity in the inventors' initial training data, the inventors excluded these models for further training. In the initial training, random forests showed better performance than SVMs and MLPs. However, it does not provide good performance in multi-class classification of all body conditions. Finally, the inventors tried to classify each pair of healthy and unhealthy body conditions (e.g., hypertension and healthy people; Parkinson's syndrome and healthy people) using binary classification, and obtained better performance than multiple classifications. The inventors then tried to optimize this prediction algorithm. Since data has severe class imbalance characteristics, a random undersampling method is employed that balances data by randomly selecting a subset of data of a target class. In each physical state, the highest 15% or 16% of representative PEI was extracted for prediction by feature extraction. The advantage of this approach is that it is usually very fast and completely independent of the model applied after feature selection.
Finally, in the random forest algorithm prediction for each pair of healthy and unhealthy body conditions, the area under the curve (AUC) of the receiver operating characteristic curve depends on the unhealthy body condition (87.6% on average). For classification, AUC values above 90% indicate excellent performance, and AUC values from 80% to 90% indicate good performance. The inventors' algorithm provides high accuracy predictions for 18 of the 34 unhealthy body conditions (AUC > 90%) and good predictions for the other 9 unhealthy body conditions (90% > AUC > 80%). In the inventors' algorithm, patients with heart disease showed excellent performance. For example, by extracting 30 PEI features (age, white blood cell count, monocytes, Mon%, mean red blood cell volume, red blood cell count, red blood cell distribution width, lymphocyte rate, platelet count, low density lipoprotein, high density lipoprotein, total cholesterol, carcinoembryonic antigen 1, albumin globulin, cystatin c, glucose, urine glucose, urinary creatinine, estimated glomerular filtration rate, creatinine, urea, waist circumference, aaist-hip ratio, body mass index, surgical history, systolic blood pressure, height, neck size, and medical history), using only 909 training samples and 387 validation samples (f 1-score (95% CI), 0.96(0.95-0.96), accuracy (95% CI)), hypertension + diabetes + coronary heart disease can provide 99% AUC: 0.95 (0.94-0.97); specificity (95% CI): 0.95 (0.94-0.95); recall (sensitivity) (95% CI): 0.95(0.94-0.97). In our algorithm, Parkinson's syndrome patients provided 97% AUC (f 1-score (95% CI), 0.91 (0.90-0.91)) using 192 training samples and 83 validation samples, accuracy (95% CI): 0.90 (0.89-0.90), specificity (95% CI): 0.87(0.79-0.94), recall (95% CI): 0.90 (0.89-0.91). for liver fat infiltration, our algorithm also provided good predictive performance using 803 training samples and 115 validation samples (f 1-score (95% CI), 0.82(0.78-0.87), accuracy (95% CI): 0.81 (0.76-0.86)), specificity (95% CI): 0.75(0.67-0.82), recall (95% CI) 0.82(0.77-0.87) and AUC (95-0.92): 0.94-0.94), the inventors concluded the lowest predicted performance in this study (AUC (95% CI): 0.66 (0.60-0.72)). The inventors' algorithm also provides a good prediction when all unhealthy physical conditions are classified as one "unhealthy" condition: f1 score (95% CI: 0.83 (0.83-0.83); accuracy (95% CI): 0.82 (0.82-0.82); specificity (95% CI): 0.81 (0.81-0.81); sensitivity (95% CI): 0.84(0.84-0.84)) and AUC (95% CI): 0.9(0.90-0.90). These results show that by using feature extraction of PEI (15-16% of all 221 PEI) with only a small number of samples, the inventors' random forest algorithm provided good performance for most unhealthy body condition predictions.
The present invention plots 221 conventional PEI's using physical examination data obtained from 803,614 individuals in china with 35 healthy or unhealthy physical conditions (primarily chronic disease). The inventors have detected a large number of correlations between PEI in healthy or unhealthy physical states; furthermore, these correlations differ depending on the 34 unhealthy physical conditions analyzed. Most of the correlations were newly discovered in this study. The inventors found that there is a wide range of associations between PEI, such as gender, age, BMI, blood lipids, blood pressure, cancer related indicators, lifestyle (including drinking, smoking, electronic habits). Increasing the understanding of these PEI interactions by the inventors will help explain the mechanisms and pathogenesis of the disease. The inventors' results fill the gap in systematic PEI analysis and provide rich information on how PEI reflects basic health conditions. These findings provide abundant information for further improvement of healthcare research and clinical practice.
One of the unexpected findings in the inventors' analysis is that hypertensive patients show a higher correlation between HBV-DNA and HCV-RNA with other PEI compared to healthy people. Also, the inventors found that there was a strong correlation between hepatitis c virus and other PEI in diabetes, indicating that patients infected with hepatitis c may be more susceptible to diabetes. This finding suggests a phenomenon whereby viral infection may make individuals more susceptible to chronic disease. For these people, antiviral therapy is considered in the treatment of hypertension and diabetes.
The discovery and development of biomarkers for clinical research, diagnosis and therapy monitoring in clinical trials is a key area of medicine and healthcare [14 ]. In this study, the inventors propose a number of candidate markers for chronic diseases. For example, the inventors have found that IOP markers are considered to be a relatively independent marker of glaucoma [15], and are closely associated with hypertension, diabetes and diabetic hypertension. These results indicate that intraocular pressure may be affected to some extent by systemic diseases and may be used as one of the clinical markers for early diagnosis of these diseases. The inventors' results demonstrate that low levels of HDL-C are a risk factor for diabetes, particularly in women [16 ]. This result suggests that increasing HDL-C levels through dietary supplementation may be an effective method for preventing diabetes in patients with low HDL-C levels. However, according to the inventors' results, over-supplementation of HDL-C is also a risk factor. Therefore, supplementation of HDL-C should be aimed at bringing HDL-C levels within the normal range [17 ]. When comparing healthy populations, the inventors found a significant increase in AFP in liver disease, confirming that increased AFP is an increased risk factor for primary liver cancer in liver disease [18 ]. Potassium ions have a significant effect on hypertension [19] and chloride ions, while magnesium ions have a significant effect on diabetes, suggesting that modulation of these ions may have an effect on these diseases. The living habits of sports, smoking and drinking have a deeper impact on the body than the inventor expects. For example, the history of exercise, alcohol consumption or smoking has a strong impact on hyperlipidemia [20], as evidenced by comparison to health. This finding suggests that hyperlipidemia should be improved by adjusting these lifestyle habits.
Because current physical examination conclusions are typically based on relatively independent single or few previous indicators to suggest physical examination results, many of the results presented are ambiguous and the value of judging the health status of the examinee is very limited [21 ]. There is a pressing need for a more accurate index system and method for determining the health condition of a physical examiner. In the final part of the study, the inventors developed a random forest machine learning algorithm that predicts disease by 15% -16% of all 221 PEI's with good predictive performance (AUC: 66% -99%; average 86%). For each disease, the inventors defined about 30 contributing PEI's by feature extraction. In most of the inventors' prediction algorithms, only a few hundred samples are required to provide good prediction performance for many chronic diseases. This finding suggests that machine learning based on PEI data can be used to help predict the true condition of the examiner, identify "high risk" patients and indicate the follow-up physical examination most relevant to the affected individual.
In summary, the inventors systematically explored various PEI's and their relationship to chronic disease and established a machine learning predictive model to predict health. This study provides rich information to better understand the physiological and pathological characteristics of the human body as a system. Importantly, the inventors have determined modifiable factors and directions for the prediction, diagnosis and treatment of disease. The machine learning algorithm developed by the inventor may be affected
PEI data was from 803614 han patients visited by the health management center and physical examination center of people's hospital in sichuan province between 2013 and 2018. The cohort captured a total of 35 participants in different health conditions, including 711928 healthy participants and 91686 unhealthy participants. The unhealthy population included 46981 hypertensive patients, 11745 diabetic patients and 32960 other patients in an unhealthy state.
The PEI detected, the study experiment included only PEI recorded by the same method. A total of 229 PEI were initially collected: few detected 8 PEI were excluded, leaving 221 PEI for further analysis (table 1). These PEI's include biochemical marker levels and blood test results. The patient's lifestyle and disease status were also investigated during physical examination.
Data processing
PEI with string variables are converted to integer variables for data analysis. The classification variables are digitally encoded for further calculation. The mean interpolation method is used for missing data. For individuals participating in more than one physical examination, the average value for each PEI was used for data analysis.
Statistical analysis
The Pearson Correlation Coefficient (PCC) method is used to calculate the correlation between two PEI's (e.g., x and y) in R; the method measures a linear correlation between two variables. The PCC correlation (r) (1) and P value (2) were calculated using the following equation [22 ]:
(1)
Figure RE-GDA0002988961680000171
(2) P=1-F.DIST(((n-2)*r^2)/(1-r^2),1,n-2)
df=n-2
number of x-y data pairs
When the correlation coefficient (r) is used, the total sample size required when α and β are 0.05 and 0.20 on both sides. If r is 0.05, 3,134 samples are required. If r is 0.10, 782 samples are required; if r is 0.25, 123 samples are needed; if r is 0.5, 29 samples are needed. The general formula for correlation sample calculation is as follows (3) [23 ]:
(3) r is the expected correlation coefficient
C=0.5×ln[(l+r)/(l–r)]
Total number of must-be-repaired objective
Then the
N=[(Za+Zb)÷C]2+3。
In the R package, linear regression models (lm) were used to compare PEIs between reported healthy and unhealthy states adjusted by gender and age [23 ]. The odds ratio of HDL-C levels was calculated by using a generalized linear model (glm) adjusted for age in R-package [24 ]. The related interaction network is performed using qgraph [25]
Machine learning
Three machine learning models, including kernel Support Vector Machines (SVM) [22, 26], multilayer perceptrons (MLP) [23] [23, 27-29] and random forests [30], were tested to obtain the predicted performance of PEI. Predicting health and each of the 34 unhealthy states (classifications) by using MLP algorithm prediction in neural networks does not work well. The inventors further attempted to predict health from each unhealthy statue by binary classification, predicting that the F1 value for each outcome was very close to zero. By using SVM algorithm prediction for multi-class prediction, the highest F1 value for cholecystolithiasis is 0.70, while the highest F1 value for most other types of disease is 0.00. The inventors also tried binary classification methods, but all the results were relatively poor. When using random forest algorithms for multi-class prediction (healthy and each of 34 unhealthy conditions), the F1 value for healthy conditions can reach 0.80-0.90, but the F1 value for unhealthy sat is about 0.00-0.40. Then, the inventor further selects a forest algorithm and optimizes a random forest algorithm. First, the inventors adopted a downsampling strategy for randomly used samples due to the uneven distribution of the number of samples for healthy and unhealthy states and due to the large number law [30 ]. Since data has severe class imbalance characteristics, a random undersampling method is employed that balances data by randomly selecting a subset of data of a target class. Second, the inventors used the PEI feature extraction strategy to extract the most contributing PEI for each healthy and unhealthy condition. Feature extraction adopts a univariate statistical strategy in automatic feature selection. Univariate statistics the features with high confidence are selected based on the statistical significance of the relationship between each feature and the target. This can be done by using feature _ selection in scimit-spare. Finally, the highest 15% or 16% of representative PEI was extracted for prediction by feature extraction in each of healthy and unhealthy states. The advantage of this approach is that it is usually very fast and completely independent of the model applied after feature selection. The data was then randomly partitioned into 30% comprising the test set, and the remaining 70% was again randomly partitioned, with 70% as the training set for the training model and 30% as the validation set for the evaluation model. In the process of improving the generalization performance of the model by adjusting parameters, a cross validation method with grid search is adopted, and the method can be realized by GridSearchCV provided by sciit-lean.
The present invention is capable of other embodiments, and various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention.

Claims (9)

1. The method for predicting the health state through physical examination indexes based on machine learning is characterized by comprising the following steps: and predicting the health state of the sampling sample by physical examination indexes by adopting a random forest algorithm.
2. The machine learning-based method of predicting health status via physical examination indicators of claim 1, wherein: for healthy and unhealthy state samples, sampling randomly used samples by adopting a down-sampling strategy to obtain sampling samples, and balancing data by randomly selecting a data subset of a target class by adopting a random under-sampling method.
3. The machine learning based method of predicting health status by physical examination indicators of claim 2, wherein: and extracting the physical examination indexes which have the greatest contribution to each healthy and unhealthy state by adopting a physical examination index feature extraction strategy.
4. The machine learning based method of predicting health status by physical examination indicators of claim 3, wherein: the physical examination index feature extraction strategy univariate statistical strategy carries out automatic feature selection by using feature _ selection in scidit spare.
5. The machine learning-based method of predicting health status via physical examination indicators of claim 4, wherein: under each healthy and unhealthy state, the first 15% or 16% of the representative physical examination indexes are extracted through feature extraction for prediction.
6. The machine learning-based method of predicting health status via physical examination indicators of claim 5, wherein: the representative physical examination indexes are 30.
7. The machine learning-based method of predicting health status via physical examination indicators of claim 1, wherein: the forest algorithm model is established by randomly grouping data, wherein 30% of the data form a test set, and the rest 70% of the data are randomly grouped again, wherein 70% of the data are used as a training set of a training model, and 30% of the data are used as a verification set of an evaluation model.
8. The machine learning-based method of predicting health status via physical examination indicators of claim 7, wherein: in the process of improving the generalization performance of the model by adjusting the parameters, a cross validation method of grid search is adopted.
9. The machine learning-based method of predicting health status via physical examination indicators of claim 8, wherein: the cross-validation method was implemented using GridSearchCV supplied by scinitlern.
CN202011311946.1A 2019-11-21 2020-11-20 Method for predicting health state through physical examination indexes based on machine learning Pending CN112825275A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911151420 2019-11-21
CN2019111514209 2019-11-21

Publications (1)

Publication Number Publication Date
CN112825275A true CN112825275A (en) 2021-05-21

Family

ID=75906556

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011311946.1A Pending CN112825275A (en) 2019-11-21 2020-11-20 Method for predicting health state through physical examination indexes based on machine learning

Country Status (2)

Country Link
CN (1) CN112825275A (en)
WO (1) WO2021098842A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109671507A (en) * 2018-12-24 2019-04-23 万达信息股份有限公司 A kind of obstetrics' disease that calls for specialized treatment coupling index method for digging based on Electronic Health Record
CN109785976A (en) * 2018-12-11 2019-05-21 青岛中科慧康科技有限公司 A kind of goat based on Soft-Voting forecasting system by stages
CN110097975A (en) * 2019-04-28 2019-08-06 湖南省蓝蜻蜓网络科技有限公司 A kind of nosocomial infection intelligent diagnosing method and system based on multi-model fusion

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5438983A (en) * 1993-09-13 1995-08-08 Hewlett-Packard Company Patient alarm detection using trend vector analysis
US8377031B2 (en) * 2007-10-23 2013-02-19 Abbott Diabetes Care Inc. Closed loop control system with safety parameters and methods
CN107194138B (en) * 2016-01-31 2023-05-16 北京万灵盘古科技有限公司 Fasting blood glucose prediction method based on physical examination data modeling
CN107403072A (en) * 2017-08-07 2017-11-28 北京工业大学 A kind of diabetes B prediction and warning method based on machine learning
US20190108912A1 (en) * 2017-10-05 2019-04-11 Iquity, Inc. Methods for predicting or detecting disease
CN109119130A (en) * 2018-07-11 2019-01-01 上海夏先机电科技发展有限公司 A kind of big data based on cloud computing is health management system arranged and method
CN109119167B (en) * 2018-07-11 2020-11-20 山东师范大学 Sepsis mortality prediction system based on integrated model
CN109378072A (en) * 2018-10-13 2019-02-22 中山大学 A kind of abnormal fasting blood sugar method for early warning based on integrated study Fusion Model
CN110299205A (en) * 2019-07-23 2019-10-01 上海图灵医疗科技有限公司 Biomedicine signals characteristic processing and evaluating method, device and application based on artificial intelligence

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109785976A (en) * 2018-12-11 2019-05-21 青岛中科慧康科技有限公司 A kind of goat based on Soft-Voting forecasting system by stages
CN109671507A (en) * 2018-12-24 2019-04-23 万达信息股份有限公司 A kind of obstetrics' disease that calls for specialized treatment coupling index method for digging based on Electronic Health Record
CN110097975A (en) * 2019-04-28 2019-08-06 湖南省蓝蜻蜓网络科技有限公司 A kind of nosocomial infection intelligent diagnosing method and system based on multi-model fusion

Also Published As

Publication number Publication date
WO2021098842A1 (en) 2021-05-27

Similar Documents

Publication Publication Date Title
Oh et al. Diabetic retinopathy risk prediction for fundus examination using sparse learning: a cross-sectional study
Hong et al. Predictors of esophageal varices in patients with HBV-related cirrhosis: a retrospective study
JP2012064087A (en) Diagnostic prediction device of lifestyle related disease, diagnostic prediction method of lifestyle related disease, and program
CN113362954A (en) Postoperative infection complication risk early warning model for old patients and establishment method thereof
CN114724716A (en) Method, model training and apparatus for risk prediction of progression to type 2 diabetes
CN113128654B (en) Improved random forest model for coronary heart disease pre-diagnosis and pre-diagnosis system thereof
US11869633B2 (en) Analytics and machine learning framework for actionable intelligence from clinical and omics data
CN108604464A (en) Determine the method with variation in subject between the subject of biomarker signal
CN115116615A (en) Method and system for analyzing and predicting risk of non-alcoholic fatty liver disease
Bae et al. Comparison of biological age prediction models using clinical biomarkers commonly measured in clinical practice settings: Ai techniques vs. traditional statistical methods
CN113345592B (en) Construction and diagnosis equipment for acute myeloid leukemia prognosis risk model
Tladi et al. Determination of optimal cut-off values for waist circumferences used for the diagnosis of the metabolic syndrome among Batswana adults (ELS 32)
CN112825275A (en) Method for predicting health state through physical examination indexes based on machine learning
Arya et al. Explainable AI for enhanced interpretation of liver cirrhosis biomarkers
CN110739072A (en) Bleeding event occurrence evaluation method and system
Khankhoje Hybrid Model for Improved Heart Disease Prediction
Roversi et al. Predicting hypertension onset using logistic regression models with labs and/or easily accessible variables: The role of blood pressure measurements
JP7157941B2 (en) CANCER INFECTION DETERMINATION METHOD, APPARATUS, AND PROGRAM
CN117577330B (en) Device and storage medium for predicting liver fibrosis degree of nonalcoholic fatty liver disease
JP7385873B2 (en) Simulation system and program
CN118116579A (en) Method for constructing disease early screening model based on multidimensional test data
Liang et al. Predict the Risk of Dyslipidemia via Deep Neural Networks for Survival Data
Meyer et al. A Supervised Machine Learning Approach with Feature Selection for Sex-Specific Biomarker Prediction
Kim The characteristics of risk factors in Korean CAD patients comparing to American counterpart and its implications to prevention of CAD
EP2433232A1 (en) Biomarkers based on sets of molecular signatures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210521