CN112825275A - Method for predicting health state through physical examination indexes based on machine learning - Google Patents
Method for predicting health state through physical examination indexes based on machine learning Download PDFInfo
- Publication number
- CN112825275A CN112825275A CN202011311946.1A CN202011311946A CN112825275A CN 112825275 A CN112825275 A CN 112825275A CN 202011311946 A CN202011311946 A CN 202011311946A CN 112825275 A CN112825275 A CN 112825275A
- Authority
- CN
- China
- Prior art keywords
- physical examination
- pei
- machine learning
- unhealthy
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000036541 health Effects 0.000 title claims abstract description 38
- 238000010801 machine learning Methods 0.000 title claims abstract description 34
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 36
- 238000007637 random forest analysis Methods 0.000 claims abstract description 17
- 238000005070 sampling Methods 0.000 claims abstract description 11
- 238000000605 extraction Methods 0.000 claims description 18
- 238000012549 training Methods 0.000 claims description 14
- 230000003862 health status Effects 0.000 claims description 12
- 238000012360 testing method Methods 0.000 claims description 9
- 238000002790 cross-validation Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 5
- 238000013210 evaluation model Methods 0.000 claims description 3
- 238000012795 verification Methods 0.000 claims description 2
- 201000010099 disease Diseases 0.000 abstract description 32
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 abstract description 32
- 230000006866 deterioration Effects 0.000 abstract description 3
- 206010012601 diabetes mellitus Diseases 0.000 description 32
- 206010020772 Hypertension Diseases 0.000 description 29
- 208000017667 Chronic Disease Diseases 0.000 description 15
- 108010023302 HDL Cholesterol Proteins 0.000 description 12
- 238000012706 support-vector machine Methods 0.000 description 9
- 230000001631 hypertensive effect Effects 0.000 description 8
- 230000000391 smoking effect Effects 0.000 description 8
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N Ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 7
- 230000035488 systolic blood pressure Effects 0.000 description 7
- 210000004369 blood Anatomy 0.000 description 6
- 239000008280 blood Substances 0.000 description 6
- DDRJAANPRJIHGJ-UHFFFAOYSA-N creatinine Chemical compound CN1CC(=O)NC1=N DDRJAANPRJIHGJ-UHFFFAOYSA-N 0.000 description 6
- 230000004410 intraocular pressure Effects 0.000 description 6
- 208000031226 Hyperlipidaemia Diseases 0.000 description 5
- LEHOTFFKMJEONL-UHFFFAOYSA-N Uric Acid Chemical compound N1C(=O)NC(=O)C2=C1NC(=O)N2 LEHOTFFKMJEONL-UHFFFAOYSA-N 0.000 description 5
- TVWHNULVHGKJHS-UHFFFAOYSA-N Uric acid Natural products N1C(=O)NC(=O)C2NC(=O)NC21 TVWHNULVHGKJHS-UHFFFAOYSA-N 0.000 description 5
- 230000036772 blood pressure Effects 0.000 description 5
- 238000003745 diagnosis Methods 0.000 description 5
- 230000035487 diastolic blood pressure Effects 0.000 description 5
- 230000035622 drinking Effects 0.000 description 5
- 229940116269 uric acid Drugs 0.000 description 5
- 238000010200 validation analysis Methods 0.000 description 5
- 206010028980 Neoplasm Diseases 0.000 description 4
- 206010034010 Parkinsonism Diseases 0.000 description 4
- 102000013529 alpha-Fetoproteins Human genes 0.000 description 4
- 108010026331 alpha-Fetoproteins Proteins 0.000 description 4
- 230000008901 benefit Effects 0.000 description 4
- 208000029078 coronary artery disease Diseases 0.000 description 4
- 238000001514 detection method Methods 0.000 description 4
- 238000013399 early diagnosis Methods 0.000 description 4
- 210000003743 erythrocyte Anatomy 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 150000002632 lipids Chemical class 0.000 description 4
- 150000003626 triacylglycerols Chemical class 0.000 description 4
- WQZGKKKJIJFFOK-GASJEMHNSA-N Glucose Natural products OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O WQZGKKKJIJFFOK-GASJEMHNSA-N 0.000 description 3
- 201000005569 Gout Diseases 0.000 description 3
- 108010010234 HDL Lipoproteins Proteins 0.000 description 3
- 102000015779 HDL Lipoproteins Human genes 0.000 description 3
- XEEYBQQBJWHFJM-UHFFFAOYSA-N Iron Chemical compound [Fe] XEEYBQQBJWHFJM-UHFFFAOYSA-N 0.000 description 3
- FFFHZYDWPBMWHY-VKHMYHEASA-N L-homocysteine Chemical compound OC(=O)[C@@H](N)CCS FFFHZYDWPBMWHY-VKHMYHEASA-N 0.000 description 3
- 108010007622 LDL Lipoproteins Proteins 0.000 description 3
- 102000007330 LDL Lipoproteins Human genes 0.000 description 3
- 241000208125 Nicotiana Species 0.000 description 3
- 235000002637 Nicotiana tabacum Nutrition 0.000 description 3
- 238000013459 approach Methods 0.000 description 3
- 238000004820 blood count Methods 0.000 description 3
- 210000004351 coronary vessel Anatomy 0.000 description 3
- 229940109239 creatinine Drugs 0.000 description 3
- 239000008103 glucose Substances 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 208000019423 liver disease Diseases 0.000 description 3
- 210000005087 mononuclear cell Anatomy 0.000 description 3
- 230000035945 sensitivity Effects 0.000 description 3
- 230000009897 systematic effect Effects 0.000 description 3
- 230000009885 systemic effect Effects 0.000 description 3
- 102000009027 Albumins Human genes 0.000 description 2
- 108010088751 Albumins Proteins 0.000 description 2
- 208000004845 Cholecystolithiasis Diseases 0.000 description 2
- 208000010412 Glaucoma Diseases 0.000 description 2
- 241000711549 Hepacivirus C Species 0.000 description 2
- 108010028554 LDL Cholesterol Proteins 0.000 description 2
- JLVVSXFLKOJNIY-UHFFFAOYSA-N Magnesium ion Chemical compound [Mg+2] JLVVSXFLKOJNIY-UHFFFAOYSA-N 0.000 description 2
- 102000007584 Prealbumin Human genes 0.000 description 2
- 108010071690 Prealbumin Proteins 0.000 description 2
- 239000000427 antigen Substances 0.000 description 2
- 102000036639 antigens Human genes 0.000 description 2
- 108091007433 antigens Proteins 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 208000006673 asthma Diseases 0.000 description 2
- 238000009534 blood test Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 201000001883 cholelithiasis Diseases 0.000 description 2
- HVYWMOMLDIMFJA-DPAQBDIFSA-N cholesterol Chemical compound C1C=C2C[C@@H](O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H]([C@H](C)CCCC(C)C)[C@@]1(C)CC2 HVYWMOMLDIMFJA-DPAQBDIFSA-N 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000024924 glomerular filtration Effects 0.000 description 2
- 208000006454 hepatitis Diseases 0.000 description 2
- 231100000283 hepatitis Toxicity 0.000 description 2
- 238000012417 linear regression Methods 0.000 description 2
- 229910001425 magnesium ion Inorganic materials 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000002560 therapeutic procedure Methods 0.000 description 2
- 210000002700 urine Anatomy 0.000 description 2
- 102000002260 Alkaline Phosphatase Human genes 0.000 description 1
- 108020004774 Alkaline Phosphatase Proteins 0.000 description 1
- 102000018616 Apolipoproteins B Human genes 0.000 description 1
- 108010027006 Apolipoproteins B Proteins 0.000 description 1
- 102000012406 Carcinoembryonic Antigen Human genes 0.000 description 1
- 108010022366 Carcinoembryonic Antigen Proteins 0.000 description 1
- 208000024172 Cardiovascular disease Diseases 0.000 description 1
- VEXZGXHMUGYJMC-UHFFFAOYSA-M Chloride anion Chemical compound [Cl-] VEXZGXHMUGYJMC-UHFFFAOYSA-M 0.000 description 1
- 102000015833 Cystatin Human genes 0.000 description 1
- 108060006698 EGF receptor Proteins 0.000 description 1
- 102000006395 Globulins Human genes 0.000 description 1
- 108010044091 Globulins Proteins 0.000 description 1
- 102000001554 Hemoglobins Human genes 0.000 description 1
- 108010054147 Hemoglobins Proteins 0.000 description 1
- 241000700721 Hepatitis B virus Species 0.000 description 1
- 206010020751 Hypersensitivity Diseases 0.000 description 1
- FYYHWMGAXLPEAU-UHFFFAOYSA-N Magnesium Chemical compound [Mg] FYYHWMGAXLPEAU-UHFFFAOYSA-N 0.000 description 1
- 208000018737 Parkinson disease Diseases 0.000 description 1
- 201000007100 Pharyngitis Diseases 0.000 description 1
- 229920002873 Polyethylenimine Polymers 0.000 description 1
- NPYPAHLBTDXSSS-UHFFFAOYSA-N Potassium ion Chemical compound [K+] NPYPAHLBTDXSSS-UHFFFAOYSA-N 0.000 description 1
- 101100495393 Rattus norvegicus Ceacam3 gene Proteins 0.000 description 1
- XSQUKJJJFZCRTK-UHFFFAOYSA-N Urea Chemical compound NC(N)=O XSQUKJJJFZCRTK-UHFFFAOYSA-N 0.000 description 1
- 208000036142 Viral infection Diseases 0.000 description 1
- 208000026935 allergic disease Diseases 0.000 description 1
- 230000007815 allergy Effects 0.000 description 1
- 230000000840 anti-viral effect Effects 0.000 description 1
- 108010036226 antigen CYFRA21.1 Proteins 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 239000003150 biochemical marker Substances 0.000 description 1
- 239000000090 biomarker Substances 0.000 description 1
- 239000000091 biomarker candidate Substances 0.000 description 1
- 230000037237 body shape Effects 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 239000004202 carbamide Substances 0.000 description 1
- 150000001720 carbohydrates Chemical class 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 235000012000 cholesterol Nutrition 0.000 description 1
- 238000009535 clinical urine test Methods 0.000 description 1
- 108050004038 cystatin Proteins 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 235000015872 dietary supplement Nutrition 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 235000006694 eating habits Nutrition 0.000 description 1
- 230000000225 effect on diabetes Effects 0.000 description 1
- 230000002892 effect on hypertension Effects 0.000 description 1
- 210000003979 eosinophil Anatomy 0.000 description 1
- 235000019441 ethanol Nutrition 0.000 description 1
- 230000005713 exacerbation Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 208000019622 heart disease Diseases 0.000 description 1
- 230000008595 infiltration Effects 0.000 description 1
- 238000001764 infiltration Methods 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 150000002500 ions Chemical class 0.000 description 1
- 229910052742 iron Inorganic materials 0.000 description 1
- 230000003907 kidney function Effects 0.000 description 1
- 210000000265 leukocyte Anatomy 0.000 description 1
- 210000004185 liver Anatomy 0.000 description 1
- 201000007270 liver cancer Diseases 0.000 description 1
- 210000004698 lymphocyte Anatomy 0.000 description 1
- 239000011777 magnesium Substances 0.000 description 1
- 229910052749 magnesium Inorganic materials 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 210000001616 monocyte Anatomy 0.000 description 1
- 230000006911 nucleation Effects 0.000 description 1
- 238000010899 nucleation Methods 0.000 description 1
- 230000008506 pathogenesis Effects 0.000 description 1
- 230000001575 pathological effect Effects 0.000 description 1
- 229920001601 polyetherimide Polymers 0.000 description 1
- 230000005195 poor health Effects 0.000 description 1
- 229910001414 potassium ion Inorganic materials 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000005180 public health Effects 0.000 description 1
- 230000005476 size effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000009469 supplementation Effects 0.000 description 1
- 230000002485 urinary effect Effects 0.000 description 1
- 230000009385 viral infection Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Public Health (AREA)
- Biomedical Technology (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Pathology (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Primary Health Care (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Heart & Thoracic Surgery (AREA)
- Surgery (AREA)
- Animal Behavior & Ethology (AREA)
- Veterinary Medicine (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention discloses a method for predicting health state through physical examination indexes based on machine learning, which predicts the health state of a sampling sample through the physical examination indexes by adopting a random forest algorithm, samples in healthy and unhealthy states by adopting a down-sampling strategy to sample the randomly used sample to obtain the sampling sample, and balances data by randomly selecting a data subset of a target class by adopting a random under-sampling method. The invention has good prediction performance, can help to predict the real situation of patients with disease deterioration, identify 'high-risk' patients and provide the most relevant follow-up examination for the affected individuals.
Description
Technical Field
The invention relates to the field of medical treatment, in particular to a method for predicting health status through physical examination indexes based on machine learning.
Background
Because the current physical examination conclusion is generally based on a relatively independent single or multiple leading indicators to give suggestions on the physical examination result, the given results are ambiguous.
Due to the lack of systematic study of the correlation between Physical Examination Indicators (PEI), most currently use it independently for disease forewarning. This results in a very limited diagnostic value for general physical examinations.
Compared to clinical medical treatment, the overall basic medical system has a greater impact on human health. Health checks can help healthy people to gain insight into their own bodily functions, maintain health, and inform health by changing unhealthy habits and avoiding dangerous factors that may lead to disease [2 ]. Physical examination can minimize disease disturbance. With the growing population size and age, the need for healthcare is increasing and healthcare services are becoming more and more sophisticated and costly.
Health checks are a common element of healthcare in developed countries. These tests include general blood tests, urine tests, blood sugar tests, blood fat tests, renal function tests, and the like. However, currently, physical examination reports are mainly evaluated based on one or two independent Physical Examination Indicators (PEI), which only provide very limited information about the health status or disease diagnosis of the physical examiners [6 ]. Although it is desirable to provide valuable information for public health care by defining a small number of PEI that are easily measured, the correlation between PEI's in different physical states (i.e. healthy, hypertension, diabetes) has not been systematically studied. Used for accurately diagnosing diseases before the diseases occur.
Recently available health data has proliferated, and improvements in healthcare by improving quality of care are expected to improve population health while inhibiting cost increases. The health check center may generate large data for the system that may reveal potential health issues not otherwise discovered. Clinically, more and more investments are being made in developing medical big data applications, such as Artificial Intelligence (AI) -based big data applications, for diagnosing diseases based on clinical images. While AI can save costs and improve efficiency, particularly for early diagnosis and prevention of chronic diseases, due to the lack of systematic analysis of PEI under conditions, no predictive model for PEI based condition prediction has been generated so far.
Disclosure of Invention
The invention aims to provide a method for predicting health status through physical examination indexes based on machine learning, which can help predict the actual situation of a patient with disease deterioration, identify 'high-risk' patients and provide the most relevant follow-up examination for affected individuals.
In order to achieve the purpose, the invention is realized by adopting the following technical scheme:
the invention discloses a method for predicting health states through physical examination indexes based on machine learning.
Preferably, for the healthy and unhealthy state samples, a down-sampling strategy is adopted to sample randomly used samples, sample samples are obtained, and a random under-sampling method is adopted to balance data by randomly selecting a data subset of a target class.
Preferably, a physical examination index feature extraction strategy is adopted to extract the physical examination index which contributes most to each healthy and unhealthy state.
Preferably, the physical examination index feature extraction strategy univariate statistics strategy performs automatic feature selection by using feature _ selection in scidit spare.
Preferably, in each of the healthy and unhealthy states, the first 15% or 16% of the representative physical indicators are extracted by feature extraction for prediction.
Preferably, the representative detection indexes are 30.
Preferably, the forest algorithm model is established by randomly grouping data, 30% of the data form a test set, and the rest 70% of the data are randomly grouped again, wherein 70% of the data are used as a training set of a training model, and 30% of the data are used as a verification set of an evaluation model
Preferably, in the process of improving the model generalization performance by adjusting the parameters, a cross validation method of grid search is adopted.
Preferably, the cross-validation method is implemented using GridSearchCV supplied by scinitlern.
The invention has good prediction performance, can help to predict the real situation of patients with disease deterioration, identify 'high-risk' patients and provide the most relevant follow-up examination for the affected individuals.
The present invention enables the determination of correlations between PEI in a healthy state and in an unhealthy state (i.e., a state with underlying chronic disease); elucidating the relationship between chronic disease and normal individuals of these PEI's to find candidate disease markers; with the machine learning model of the present invention, the health of an individual can be predicted using only a complete set of PEI's without the need for detailed clinical examination information.
Drawings
FIG. 1 shows the machine learning prediction results of 35 body states based on the random forest algorithm.
FIG. 2 is a graph showing the correlation of PEI detected in a healthy population.
Fig. 3 is a representative candidate signature for an unhealthy physical condition.
FIG. 4 is a graph showing the ratio of the plasma HDL-C concentration in patients with normal physical conditions to that in patients with diabetes.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.
On the basis of general physical examination, the method can predict the common disease onset algorithm, and the prediction performance of physical examination indexes is obtained by testing three machine learning models, namely a nucleation Support Vector Machine (SVM), a multilayer perceptron (MLP) and a random forest. Then, a forest algorithm is further selected, and the random forest algorithm is optimized. First, due to the uneven distribution of healthy and unhealthy sample numbers and the law of large numbers, the inventors adopted a downsampling strategy for randomly used samples. Due to the fact that data have serious class imbalance, a random undersampling method is adopted, and data are balanced through randomly selecting a data subset of a target class. Secondly, a physical examination index feature extraction strategy is adopted to extract physical examination indexes which contribute most to each healthy and unhealthy state. And the feature extraction adopts a univariate statistical strategy to automatically select features. Univariate statistics the features with high confidence are selected according to the statistical significance of the relationship between each feature and the target. This process can be implemented by using feature _ select in scinitlern. Finally, in each of the healthy and unhealthy states, the first 15% or 16% of the representative examination indexes (about 30) are extracted by feature extraction for prediction. The advantage of this approach is that it is typically very fast and completely independent of the model applied after feature selection, and then randomly groups the data, 30% constituting the test set, the remaining 70% being randomly grouped again, 70% as the training set for training the model and 30% as the validation set for evaluating the model. In the process of improving the model generalization performance by adjusting parameters, a cross validation method of grid search is adopted and implemented by using GridSearchCV provided by scinitlern.
As shown in fig. 1, table 1, AUC reached 66% -99% (mean 87.6%) in the random forest algorithm predictions for each pair of healthy and unhealthy physical states. For classification, AUC values above 90% indicate good performance, 80% to 90% indicate good performance. The inventors' algorithm provides a high accuracy prediction for 18 of the 34 unhealthy physical states (auc > 90%) and good performance for the other 9 unhealthy physical states (90% > auc > 80%). In the inventor's algorithm, patients with heart-related diseases showed excellent performance. These results show that by performing feature extraction of a small number of samples for physical examination indexes of 15-16% of the 221 individual examination indexes, the random forest algorithm of the inventor provides good performance for the prediction of most unhealthy physical states.
Because the current physical examination conclusion is generally that suggestions are given to physical examination results based on a relatively independent single or multiple leading indicators, a plurality of given results are ambiguous, the value of judging the health condition of examinees is very limited, and a more accurate index system and method are urgently needed to judge the health condition of physical examinees. The inventor develops a random forest machine learning algorithm, can predict diseases by 15% -16% of 221 individual detection indexes, and has good prediction performance (auc: 66% -99%, average 86%). For each disease, the inventors defined about 30 contributing physical indicators by feature extraction. The finding that only a few hundred samples provide good predictive performance for many chronic diseases in most of the inventors' predictive algorithms suggests that machine learning using physical examination index data can help predict the true condition of a patient with an exacerbation, identify "high risk" patients, and provide the most relevant follow-up examinations for affected individuals. The machine learning algorithm developed by the inventors can be immediately applied to clinical practice to assist in the determination of the results of a physical examination.
TABLE 1 prediction effectiveness of the model. roc, respectively; auc area under the curve
As medical improvement has made a remarkable progress in expanding the coverage of insurance, the general physical examination industry is now accumulating large data. By using a huge general health check data set of chinese population, the present invention has three main goals: determining a correlation between healthy and unhealthy (i.e. patients with underlying chronic disease) PEI; elucidating the relationship between chronic disease and normal in these PEI's to find candidate disease markers; a set of machine learning models is developed that can be used to predict the health of an individual using a sophisticated set of PEI's. To address these issues, the inventors included 80,3614 individual physical examination data that accessed a health check center between 2013 and 2018. The inventors included data for 221 PEI's associated with 35 physical conditions, most of which were unhealthy due to chronic disease.
As shown in fig. 2, 3, 4, participants represented 35 health conditions based on health condition or underlying disease condition (unhealthy condition). Specifically, the study population included 711,928 healthy participants, 46,981 hypertensive patients, 11,745 diabetic patients and 32,960 other unhealthy states. The correlation between two PEI's in R was calculated using the Pearson Correlation Coefficient (PCC) method. Linear regression models (lm) were used to compare PEI between reported health in the R package and unhealthy conditions adjusted for gender and age. A random forest machine learning algorithm is used for health state prediction. There is a wide correlation between PEI's in different body conditions. Abundant PEI differences were observed between healthy and unhealthy physical conditions. Machine learning algorithms can be used to predict physical state by using a set of PEI's for routine physical examination. The inventors found that there were abundant correlations between PEI's in healthy physical condition (7,662 significant correlations, accounting for 31.5% of all correlations). However, PEI association changes under disease conditions. The inventors further focused on these PEI differences between healthy and 35 unhealthy states and found 1,239 significant PEI differences suggesting that they are candidate disease markers. Finally, the inventors have developed a machine learning algorithm to predict health using 15% -16% PEI through feature extraction, with 66% -99% accuracy prediction from physical states.
This new PEI related encyclopedia provides rich information for the diagnosis of chronic diseases. The machine learning algorithm developed by the inventor can generate far-reaching influence in the common physical examination practice
The inventors included 803,614 individuals who participated in the health management center and physical examination center between 2013 and 2018. Participants represented 35 health conditions depending on health condition or underlying disease condition (unhealthy condition). Specifically, the study population included 711,928 healthy participants, 46,981 hypertensive patients, 11,745 diabetic patients and 32,960 other unhealthy conditions (mainly chronic disease) (table 1). The inventors included 221 PEI's in the analysis, including patient demographic information (age and gender) and lifestyle indicators (tobacco and alcohol, tobacco usage, etc.).
PEI correlation of participants in a healthy condition
The primary goal of the inventors was to explore PEI relevance under healthy conditions in hopes of creating a landscape. Of the 221 PEI's, the inventors found 7,662 significant correlations among the correlations (31.5%) of 24,322PEI pairs among all persons in a healthy condition (P <0.05/24,322PEI pairs ═ 2X 10-6). N711,928, average age 41.4, female 45.7%). This finding indicates that there is a wide correlation between PEI. The first 50 relevant PEI's include gender, age, red blood cell count, Prealbumin (PAB), history of alcohol consumption (alcohol consumption, drinking), alkaline phosphatase level (ALP), tobacco usage (smoking), etc. The number of significantly related PEI's among the 221 PEI's also indicated that there was a rich correlation between the PEI's). Some of these established correlations are consistent with the reported literature in healthy PEI, but most are newly discovered in this study.
Census PEI showed high correlation with each other or other PEI. For example, gender showed the most abundant PEI association (151 PEI pairs, male versus female), including hemoglobin (Hb), creatinine, Uric Acid (UA), alcohol consumption, smoking, Body Mass Index (BMI), etc., reflecting differences in physical form, physical constitution and lifestyle among men and women. Age also showed strong PEI correlations (125 PEI pairs), such as estimated glomerular filtration rate (egfr b), Systolic Blood Pressure (SBP), Diastolic Blood Pressure (DBP), albumin (Alb), and low density lipoprotein (LDL-C). These findings indicate that with age, there is a systemic change in body function (fig. 1, fig. 2, supplementary table 1). The inventors also found that the correlation of 124 PEI's with BMI reflects a strong influence of body shape on PEI, including UA, high density lipoprotein (HDL-C), SBP and DBP. Blood Pressure (BP) has many physiological implications, and the inventors have identified a group of PEI's that correlate with Blood Pressure (BP), including 125 PEI's for DBP and 124 PEI's for SBP (fig. 1, fig. 2, supplementary table 1). . Intraocular pressure (IOP) is an important factor in the diagnosis of glaucoma [12 ]. The inventors found that 79 PEI's were weakly associated with IOP (IOP-L) in the left eye, including IOP (IOP-R) SBP, DBP, Alb, BMI, TG, ApoB, alcohol and TC in the right eye. Similar to IOP-L, 73 PEI's were weakly associated with IOP-R.
As expected, the lipid PEI showed many correlations. For example, 119 PEI's are associated with Triglycerides (TG). The inventors discovered 122 HDLs-C related PEI's that have many negative correlations including TG, UA and BMI (fig. 1, fig. 2, supplementary table 2). The correlation pattern between LDL and HDL shows a specific opposite trend. Unexpectedly, lifestyle has had a profound effect on the inventors' body. Consistently, the inventors detected 130 drinking-related PEI, such as gender, smoking, Hb, and UA (. again, 128 PEI are associated with smoking, including drinking, gender, and age. the inventors also detected that 58 PEI are weakly associated with motor habits (e habits) (including age, eGFB, and SBP). expression of tumor markers may indicate the development and progression of tumors.
PEI correlation in persons with unhealthy physical condition
Next, the inventors examined PEI correlations in 34 unhealthy physical states. In this analysis, the inventors also determined rich correlations in these unhealthy physical states. The inventors found that the significant correlation of PEI was lower in unhealthy physical states compared to healthy physical states, probably due to sample size effects (Table 1, supplementary Table 2-S35). Each unhealthy physical state has its unique associated spectrum, and most of them are newly discovered in this study. For example, in the hypertensive population, the inventors found 4,413 significant correlations (18.3%) among 221 PEI's in 24,322PEI pairs. PEI with enhanced correlation included mononuclear cells (MON) (70 cases in hypertension, 6 cases in healthy body state, the same below), quantitative detection of hepatitis B virus DNA (HBV-DNA) (76 cases 33 pairs), quantitative detection of hepatitis C virus RNA (HCV-RNA) (49 pairs 8), and the like (supplementary Table 2). The RH blood group correlation was increased in people with hypertension and coronary heart disease (hypertension + coronary artery) compared to healthy people (41 vs 9 in normal people). In contrast, the number of homocysteine (Hcy) associations in unhealthy versus healthy patients is greatly reduced (2 vs 120). In diabetes, 10 pairs of PEI increased, while the remaining 195 pairs decreased. Increased PEI includes MON (41 vs 6), HCV-RNA (42 vs 8), anti-Sc 70(59 vs 31) and HCV-cAg (35 vs 10) (supplementary Table 17). These results indicate that in unhealthy conditions, the PEI has undergone systemic changes. Each disease has its specific PEI profile.
Next, the inventors explored the relevant networks between PEI using qgraph [13], which shows LinkMode between PEI. In a healthy state, the inventors found that PEI showed abundant interactions in both the positive and negative directions. In unhealthy physical conditions, they each show a unique network of interactions with PEI. These results indicate that there is a dependency between multiple indicators for each physical state, which can be used in conjunction with physical fitness assessment.
Candidate PEI markers for unhealthy body conditions
To verify and discover the impact of new candidate biomarkers or lifestyle habits on early diagnosis of disease, the inventors next calculated the 221 PEI differences between healthy and unhealthy states. In summary, the inventors found 1,239 significantly different PEI pairs between healthy and 34 unhealthy states (P < 0.05/34 ═ 0.0014, adjusted for 34 unhealthy states). For example, the PEI difference between hypertensive and healthy persons was 112, the PEI difference between hypertensive and diabetic and healthy persons was 100, and the PEI difference between diabetic and healthy persons was 91. Some of which are consistent with previous findings, while others are newly discovered.
For many of the 221 PEI's, the inventors detected differences between healthy and unhealthy states, especially in PEI's related to physique, lifestyle and blood lipids (supplementary table 36). For example-BMI, the inventors found differences between 16 healthy and unhealthy physical conditions of 34 unhealthy physical conditions, including hypertensive patients (P ═ 0) and gout patients (P ═ 6.48 × 10-90). Exercise habits (E habits) show 19 differences between healthy and unhealthy states, including hyperlipidemia (P ═ 1.28 × 10-277) and diabetes (P ═ 4.20 × 10-29). Dietary habits also showed 10 differences in unhealthy conditions, including chronic pharyngitis (P ═ 2.59 x 10-19) and cholecystolithiasis (P ═ 9.43 x 10-18). The inventors examined the difference in alcohol intake habits among 20 unhealthy conditions, including hyperlipidemia (P ═ 0), coronary heart disease (P ═ 4.06 × 10-24), diabetes (P ═ 1.09 × 10-22), and parkinson's disease (P ═ 1.43) × 10-17. The inventors also observed differences in smoking habits among 18 unhealthy conditions compared to unhealthy conditions, including hypertension (P ═ 2.74 × 10-114), hyperlipidemia (P ═ 2.69 × 10-62), and parkinson's syndrome (P ═ 5.12 × 10-29). The inventors found that there were differences between IOP-R and healthy status in five unhealthy states, including hypertension (P ═ 3.63 × 10-85) and diabetes (P ═ 2.01 × 10-73); similar findings have been made for IOP-L (. for lipid PEI, the inventors also observed differences between 34 unhealthy and healthy states.e. LDL-C was detected in 21 unhealthy states including hypertension (P ═ 0) and diabetes (P ═ 2.95 × 10-212). HDL-C was detected in 17 unhealthy states including diabetes (P ═ 1.92 × 10-177). further detailed analysis of HDL-C and diabetes by the inventors found that populations with low HDL-C are at significantly higher risk for diabetes than the average of this population (1.26-1.75 mmol/L).
Tumor associated antigens also show significant differences between healthy and unhealthy states. For example, CYFRA21-1 is detected as 10 unhealthy states, including hypertension + diabetes (P ═ 3.71 × 10-97) and diabetes (P ═ 4.52 × 10-70). CEA1 was detected in 12 unhealthy states, including hypertension + coronary arteries (P ═ 9.59 × 10-29) and diabetes (P ═ 1.73 × 10-18). Alpha-fetoprotein (AFP) was detected in liver disease (P ═ 1.08 × 10-28). C-PSA was detected in hypertension + coronary arteries (P ═ 8.38 × 10-20). Finally, the carbohydrate antigen CA724(CA 72-4) was detected in asthma (P ═ 9.92 × 10-13), gout (P ═ 3.53 × 10-7) and coronary heart disease + diabetes (P ═ 4.06 × 10-5) (supplementary table 36). In other PEI's, the inventors have also discovered a significant difference between the healthy and unhealthy state. For example, the inventors have found that there are differences in urine glucose levels (U-GLU) among 9 unhealthy states, including diabetes and its related diseases. Eosinophil rates (eo%) were found in five unhealthy states including asthma (P ═ 1.38 × 10-129) and nasal allergy (P ═ 4.05 × 10-18). Whole blood iron levels (WB-Fe) are in 11 unhealthy states, including hypertension (P ═ 2.52 × 10-69). The inventors detected 11 PH's with poor health status including diabetes (P ═ 1.97 × 10-239), hypertension (P ═ 2.41 × 10-166), hypertension + diabetes (P ═ 9.90 × 10-32) and gout (P ═ 9.82 × 10-15). The inventors found potassium (K +) to be in five unhealthy states, including hypertension (P ═ 1.98 × 10-119) and hepatitis b (P ═ 3.13 × 10-10). The inventors also detected differences in magnesium (Mg2+) in hypertension + diabetes (P ═ 3.14 × 10-58) and diabetes (P ═ 5.10 × 10-52). Hcy (an index of cardiovascular disease) was detected in eight unhealthy states, including hypertension (P ═ 1.97 × 10-136) and parkinson's syndrome (P ═ 1.76 × 10-7) (supplementary table 36). These results provide a set of candidate markers for early diagnosis of chronic disease.
A key objective of the present invention is to apply PEI data and machine learning techniques to develop algorithms that can predict the onset of common illnesses based on routine physical examination. The inventors tried three machine learning models, including kernel Support Vector Machines (SVMs), multi-layer perceptrons (MLPs) and random forests. Since SVM and MLP prediction models give only very low accuracy and sensitivity in the inventors' initial training data, the inventors excluded these models for further training. In the initial training, random forests showed better performance than SVMs and MLPs. However, it does not provide good performance in multi-class classification of all body conditions. Finally, the inventors tried to classify each pair of healthy and unhealthy body conditions (e.g., hypertension and healthy people; Parkinson's syndrome and healthy people) using binary classification, and obtained better performance than multiple classifications. The inventors then tried to optimize this prediction algorithm. Since data has severe class imbalance characteristics, a random undersampling method is employed that balances data by randomly selecting a subset of data of a target class. In each physical state, the highest 15% or 16% of representative PEI was extracted for prediction by feature extraction. The advantage of this approach is that it is usually very fast and completely independent of the model applied after feature selection.
Finally, in the random forest algorithm prediction for each pair of healthy and unhealthy body conditions, the area under the curve (AUC) of the receiver operating characteristic curve depends on the unhealthy body condition (87.6% on average). For classification, AUC values above 90% indicate excellent performance, and AUC values from 80% to 90% indicate good performance. The inventors' algorithm provides high accuracy predictions for 18 of the 34 unhealthy body conditions (AUC > 90%) and good predictions for the other 9 unhealthy body conditions (90% > AUC > 80%). In the inventors' algorithm, patients with heart disease showed excellent performance. For example, by extracting 30 PEI features (age, white blood cell count, monocytes, Mon%, mean red blood cell volume, red blood cell count, red blood cell distribution width, lymphocyte rate, platelet count, low density lipoprotein, high density lipoprotein, total cholesterol, carcinoembryonic antigen 1, albumin globulin, cystatin c, glucose, urine glucose, urinary creatinine, estimated glomerular filtration rate, creatinine, urea, waist circumference, aaist-hip ratio, body mass index, surgical history, systolic blood pressure, height, neck size, and medical history), using only 909 training samples and 387 validation samples (f 1-score (95% CI), 0.96(0.95-0.96), accuracy (95% CI)), hypertension + diabetes + coronary heart disease can provide 99% AUC: 0.95 (0.94-0.97); specificity (95% CI): 0.95 (0.94-0.95); recall (sensitivity) (95% CI): 0.95(0.94-0.97). In our algorithm, Parkinson's syndrome patients provided 97% AUC (f 1-score (95% CI), 0.91 (0.90-0.91)) using 192 training samples and 83 validation samples, accuracy (95% CI): 0.90 (0.89-0.90), specificity (95% CI): 0.87(0.79-0.94), recall (95% CI): 0.90 (0.89-0.91). for liver fat infiltration, our algorithm also provided good predictive performance using 803 training samples and 115 validation samples (f 1-score (95% CI), 0.82(0.78-0.87), accuracy (95% CI): 0.81 (0.76-0.86)), specificity (95% CI): 0.75(0.67-0.82), recall (95% CI) 0.82(0.77-0.87) and AUC (95-0.92): 0.94-0.94), the inventors concluded the lowest predicted performance in this study (AUC (95% CI): 0.66 (0.60-0.72)). The inventors' algorithm also provides a good prediction when all unhealthy physical conditions are classified as one "unhealthy" condition: f1 score (95% CI: 0.83 (0.83-0.83); accuracy (95% CI): 0.82 (0.82-0.82); specificity (95% CI): 0.81 (0.81-0.81); sensitivity (95% CI): 0.84(0.84-0.84)) and AUC (95% CI): 0.9(0.90-0.90). These results show that by using feature extraction of PEI (15-16% of all 221 PEI) with only a small number of samples, the inventors' random forest algorithm provided good performance for most unhealthy body condition predictions.
The present invention plots 221 conventional PEI's using physical examination data obtained from 803,614 individuals in china with 35 healthy or unhealthy physical conditions (primarily chronic disease). The inventors have detected a large number of correlations between PEI in healthy or unhealthy physical states; furthermore, these correlations differ depending on the 34 unhealthy physical conditions analyzed. Most of the correlations were newly discovered in this study. The inventors found that there is a wide range of associations between PEI, such as gender, age, BMI, blood lipids, blood pressure, cancer related indicators, lifestyle (including drinking, smoking, electronic habits). Increasing the understanding of these PEI interactions by the inventors will help explain the mechanisms and pathogenesis of the disease. The inventors' results fill the gap in systematic PEI analysis and provide rich information on how PEI reflects basic health conditions. These findings provide abundant information for further improvement of healthcare research and clinical practice.
One of the unexpected findings in the inventors' analysis is that hypertensive patients show a higher correlation between HBV-DNA and HCV-RNA with other PEI compared to healthy people. Also, the inventors found that there was a strong correlation between hepatitis c virus and other PEI in diabetes, indicating that patients infected with hepatitis c may be more susceptible to diabetes. This finding suggests a phenomenon whereby viral infection may make individuals more susceptible to chronic disease. For these people, antiviral therapy is considered in the treatment of hypertension and diabetes.
The discovery and development of biomarkers for clinical research, diagnosis and therapy monitoring in clinical trials is a key area of medicine and healthcare [14 ]. In this study, the inventors propose a number of candidate markers for chronic diseases. For example, the inventors have found that IOP markers are considered to be a relatively independent marker of glaucoma [15], and are closely associated with hypertension, diabetes and diabetic hypertension. These results indicate that intraocular pressure may be affected to some extent by systemic diseases and may be used as one of the clinical markers for early diagnosis of these diseases. The inventors' results demonstrate that low levels of HDL-C are a risk factor for diabetes, particularly in women [16 ]. This result suggests that increasing HDL-C levels through dietary supplementation may be an effective method for preventing diabetes in patients with low HDL-C levels. However, according to the inventors' results, over-supplementation of HDL-C is also a risk factor. Therefore, supplementation of HDL-C should be aimed at bringing HDL-C levels within the normal range [17 ]. When comparing healthy populations, the inventors found a significant increase in AFP in liver disease, confirming that increased AFP is an increased risk factor for primary liver cancer in liver disease [18 ]. Potassium ions have a significant effect on hypertension [19] and chloride ions, while magnesium ions have a significant effect on diabetes, suggesting that modulation of these ions may have an effect on these diseases. The living habits of sports, smoking and drinking have a deeper impact on the body than the inventor expects. For example, the history of exercise, alcohol consumption or smoking has a strong impact on hyperlipidemia [20], as evidenced by comparison to health. This finding suggests that hyperlipidemia should be improved by adjusting these lifestyle habits.
Because current physical examination conclusions are typically based on relatively independent single or few previous indicators to suggest physical examination results, many of the results presented are ambiguous and the value of judging the health status of the examinee is very limited [21 ]. There is a pressing need for a more accurate index system and method for determining the health condition of a physical examiner. In the final part of the study, the inventors developed a random forest machine learning algorithm that predicts disease by 15% -16% of all 221 PEI's with good predictive performance (AUC: 66% -99%; average 86%). For each disease, the inventors defined about 30 contributing PEI's by feature extraction. In most of the inventors' prediction algorithms, only a few hundred samples are required to provide good prediction performance for many chronic diseases. This finding suggests that machine learning based on PEI data can be used to help predict the true condition of the examiner, identify "high risk" patients and indicate the follow-up physical examination most relevant to the affected individual.
In summary, the inventors systematically explored various PEI's and their relationship to chronic disease and established a machine learning predictive model to predict health. This study provides rich information to better understand the physiological and pathological characteristics of the human body as a system. Importantly, the inventors have determined modifiable factors and directions for the prediction, diagnosis and treatment of disease. The machine learning algorithm developed by the inventor may be affected
PEI data was from 803614 han patients visited by the health management center and physical examination center of people's hospital in sichuan province between 2013 and 2018. The cohort captured a total of 35 participants in different health conditions, including 711928 healthy participants and 91686 unhealthy participants. The unhealthy population included 46981 hypertensive patients, 11745 diabetic patients and 32960 other patients in an unhealthy state.
The PEI detected, the study experiment included only PEI recorded by the same method. A total of 229 PEI were initially collected: few detected 8 PEI were excluded, leaving 221 PEI for further analysis (table 1). These PEI's include biochemical marker levels and blood test results. The patient's lifestyle and disease status were also investigated during physical examination.
Data processing
PEI with string variables are converted to integer variables for data analysis. The classification variables are digitally encoded for further calculation. The mean interpolation method is used for missing data. For individuals participating in more than one physical examination, the average value for each PEI was used for data analysis.
Statistical analysis
The Pearson Correlation Coefficient (PCC) method is used to calculate the correlation between two PEI's (e.g., x and y) in R; the method measures a linear correlation between two variables. The PCC correlation (r) (1) and P value (2) were calculated using the following equation [22 ]:
(1)
(2) P=1-F.DIST(((n-2)*r^2)/(1-r^2),1,n-2)
df=n-2
number of x-y data pairs
When the correlation coefficient (r) is used, the total sample size required when α and β are 0.05 and 0.20 on both sides. If r is 0.05, 3,134 samples are required. If r is 0.10, 782 samples are required; if r is 0.25, 123 samples are needed; if r is 0.5, 29 samples are needed. The general formula for correlation sample calculation is as follows (3) [23 ]:
(3) r is the expected correlation coefficient
C=0.5×ln[(l+r)/(l–r)]
Total number of must-be-repaired objective
Then the
N=[(Za+Zb)÷C]2+3。
In the R package, linear regression models (lm) were used to compare PEIs between reported healthy and unhealthy states adjusted by gender and age [23 ]. The odds ratio of HDL-C levels was calculated by using a generalized linear model (glm) adjusted for age in R-package [24 ]. The related interaction network is performed using qgraph [25]
Machine learning
Three machine learning models, including kernel Support Vector Machines (SVM) [22, 26], multilayer perceptrons (MLP) [23] [23, 27-29] and random forests [30], were tested to obtain the predicted performance of PEI. Predicting health and each of the 34 unhealthy states (classifications) by using MLP algorithm prediction in neural networks does not work well. The inventors further attempted to predict health from each unhealthy statue by binary classification, predicting that the F1 value for each outcome was very close to zero. By using SVM algorithm prediction for multi-class prediction, the highest F1 value for cholecystolithiasis is 0.70, while the highest F1 value for most other types of disease is 0.00. The inventors also tried binary classification methods, but all the results were relatively poor. When using random forest algorithms for multi-class prediction (healthy and each of 34 unhealthy conditions), the F1 value for healthy conditions can reach 0.80-0.90, but the F1 value for unhealthy sat is about 0.00-0.40. Then, the inventor further selects a forest algorithm and optimizes a random forest algorithm. First, the inventors adopted a downsampling strategy for randomly used samples due to the uneven distribution of the number of samples for healthy and unhealthy states and due to the large number law [30 ]. Since data has severe class imbalance characteristics, a random undersampling method is employed that balances data by randomly selecting a subset of data of a target class. Second, the inventors used the PEI feature extraction strategy to extract the most contributing PEI for each healthy and unhealthy condition. Feature extraction adopts a univariate statistical strategy in automatic feature selection. Univariate statistics the features with high confidence are selected based on the statistical significance of the relationship between each feature and the target. This can be done by using feature _ selection in scimit-spare. Finally, the highest 15% or 16% of representative PEI was extracted for prediction by feature extraction in each of healthy and unhealthy states. The advantage of this approach is that it is usually very fast and completely independent of the model applied after feature selection. The data was then randomly partitioned into 30% comprising the test set, and the remaining 70% was again randomly partitioned, with 70% as the training set for the training model and 30% as the validation set for the evaluation model. In the process of improving the generalization performance of the model by adjusting parameters, a cross validation method with grid search is adopted, and the method can be realized by GridSearchCV provided by sciit-lean.
The present invention is capable of other embodiments, and various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention.
Claims (9)
1. The method for predicting the health state through physical examination indexes based on machine learning is characterized by comprising the following steps: and predicting the health state of the sampling sample by physical examination indexes by adopting a random forest algorithm.
2. The machine learning-based method of predicting health status via physical examination indicators of claim 1, wherein: for healthy and unhealthy state samples, sampling randomly used samples by adopting a down-sampling strategy to obtain sampling samples, and balancing data by randomly selecting a data subset of a target class by adopting a random under-sampling method.
3. The machine learning based method of predicting health status by physical examination indicators of claim 2, wherein: and extracting the physical examination indexes which have the greatest contribution to each healthy and unhealthy state by adopting a physical examination index feature extraction strategy.
4. The machine learning based method of predicting health status by physical examination indicators of claim 3, wherein: the physical examination index feature extraction strategy univariate statistical strategy carries out automatic feature selection by using feature _ selection in scidit spare.
5. The machine learning-based method of predicting health status via physical examination indicators of claim 4, wherein: under each healthy and unhealthy state, the first 15% or 16% of the representative physical examination indexes are extracted through feature extraction for prediction.
6. The machine learning-based method of predicting health status via physical examination indicators of claim 5, wherein: the representative physical examination indexes are 30.
7. The machine learning-based method of predicting health status via physical examination indicators of claim 1, wherein: the forest algorithm model is established by randomly grouping data, wherein 30% of the data form a test set, and the rest 70% of the data are randomly grouped again, wherein 70% of the data are used as a training set of a training model, and 30% of the data are used as a verification set of an evaluation model.
8. The machine learning-based method of predicting health status via physical examination indicators of claim 7, wherein: in the process of improving the generalization performance of the model by adjusting the parameters, a cross validation method of grid search is adopted.
9. The machine learning-based method of predicting health status via physical examination indicators of claim 8, wherein: the cross-validation method was implemented using GridSearchCV supplied by scinitlern.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911151420 | 2019-11-21 | ||
CN2019111514209 | 2019-11-21 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112825275A true CN112825275A (en) | 2021-05-21 |
Family
ID=75906556
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011311946.1A Pending CN112825275A (en) | 2019-11-21 | 2020-11-20 | Method for predicting health state through physical examination indexes based on machine learning |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112825275A (en) |
WO (1) | WO2021098842A1 (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109671507A (en) * | 2018-12-24 | 2019-04-23 | 万达信息股份有限公司 | A kind of obstetrics' disease that calls for specialized treatment coupling index method for digging based on Electronic Health Record |
CN109785976A (en) * | 2018-12-11 | 2019-05-21 | 青岛中科慧康科技有限公司 | A kind of goat based on Soft-Voting forecasting system by stages |
CN110097975A (en) * | 2019-04-28 | 2019-08-06 | 湖南省蓝蜻蜓网络科技有限公司 | A kind of nosocomial infection intelligent diagnosing method and system based on multi-model fusion |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5438983A (en) * | 1993-09-13 | 1995-08-08 | Hewlett-Packard Company | Patient alarm detection using trend vector analysis |
US8377031B2 (en) * | 2007-10-23 | 2013-02-19 | Abbott Diabetes Care Inc. | Closed loop control system with safety parameters and methods |
CN107194138B (en) * | 2016-01-31 | 2023-05-16 | 北京万灵盘古科技有限公司 | Fasting blood glucose prediction method based on physical examination data modeling |
CN107403072A (en) * | 2017-08-07 | 2017-11-28 | 北京工业大学 | A kind of diabetes B prediction and warning method based on machine learning |
US20190108912A1 (en) * | 2017-10-05 | 2019-04-11 | Iquity, Inc. | Methods for predicting or detecting disease |
CN109119130A (en) * | 2018-07-11 | 2019-01-01 | 上海夏先机电科技发展有限公司 | A kind of big data based on cloud computing is health management system arranged and method |
CN109119167B (en) * | 2018-07-11 | 2020-11-20 | 山东师范大学 | Sepsis mortality prediction system based on integrated model |
CN109378072A (en) * | 2018-10-13 | 2019-02-22 | 中山大学 | A kind of abnormal fasting blood sugar method for early warning based on integrated study Fusion Model |
CN110299205A (en) * | 2019-07-23 | 2019-10-01 | 上海图灵医疗科技有限公司 | Biomedicine signals characteristic processing and evaluating method, device and application based on artificial intelligence |
-
2020
- 2020-11-20 CN CN202011311946.1A patent/CN112825275A/en active Pending
- 2020-11-20 WO PCT/CN2020/130585 patent/WO2021098842A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109785976A (en) * | 2018-12-11 | 2019-05-21 | 青岛中科慧康科技有限公司 | A kind of goat based on Soft-Voting forecasting system by stages |
CN109671507A (en) * | 2018-12-24 | 2019-04-23 | 万达信息股份有限公司 | A kind of obstetrics' disease that calls for specialized treatment coupling index method for digging based on Electronic Health Record |
CN110097975A (en) * | 2019-04-28 | 2019-08-06 | 湖南省蓝蜻蜓网络科技有限公司 | A kind of nosocomial infection intelligent diagnosing method and system based on multi-model fusion |
Also Published As
Publication number | Publication date |
---|---|
WO2021098842A1 (en) | 2021-05-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Oh et al. | Diabetic retinopathy risk prediction for fundus examination using sparse learning: a cross-sectional study | |
Hong et al. | Predictors of esophageal varices in patients with HBV-related cirrhosis: a retrospective study | |
JP2012064087A (en) | Diagnostic prediction device of lifestyle related disease, diagnostic prediction method of lifestyle related disease, and program | |
CN113362954A (en) | Postoperative infection complication risk early warning model for old patients and establishment method thereof | |
CN114724716A (en) | Method, model training and apparatus for risk prediction of progression to type 2 diabetes | |
CN113128654B (en) | Improved random forest model for coronary heart disease pre-diagnosis and pre-diagnosis system thereof | |
US11869633B2 (en) | Analytics and machine learning framework for actionable intelligence from clinical and omics data | |
CN108604464A (en) | Determine the method with variation in subject between the subject of biomarker signal | |
CN115116615A (en) | Method and system for analyzing and predicting risk of non-alcoholic fatty liver disease | |
Bae et al. | Comparison of biological age prediction models using clinical biomarkers commonly measured in clinical practice settings: Ai techniques vs. traditional statistical methods | |
CN113345592B (en) | Construction and diagnosis equipment for acute myeloid leukemia prognosis risk model | |
Tladi et al. | Determination of optimal cut-off values for waist circumferences used for the diagnosis of the metabolic syndrome among Batswana adults (ELS 32) | |
CN112825275A (en) | Method for predicting health state through physical examination indexes based on machine learning | |
Arya et al. | Explainable AI for enhanced interpretation of liver cirrhosis biomarkers | |
CN110739072A (en) | Bleeding event occurrence evaluation method and system | |
Khankhoje | Hybrid Model for Improved Heart Disease Prediction | |
Roversi et al. | Predicting hypertension onset using logistic regression models with labs and/or easily accessible variables: The role of blood pressure measurements | |
JP7157941B2 (en) | CANCER INFECTION DETERMINATION METHOD, APPARATUS, AND PROGRAM | |
CN117577330B (en) | Device and storage medium for predicting liver fibrosis degree of nonalcoholic fatty liver disease | |
JP7385873B2 (en) | Simulation system and program | |
CN118116579A (en) | Method for constructing disease early screening model based on multidimensional test data | |
Liang et al. | Predict the Risk of Dyslipidemia via Deep Neural Networks for Survival Data | |
Meyer et al. | A Supervised Machine Learning Approach with Feature Selection for Sex-Specific Biomarker Prediction | |
Kim | The characteristics of risk factors in Korean CAD patients comparing to American counterpart and its implications to prevention of CAD | |
EP2433232A1 (en) | Biomarkers based on sets of molecular signatures |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210521 |