CN116313086A - Sub-health prediction model construction method, device, equipment and storage medium - Google Patents

Sub-health prediction model construction method, device, equipment and storage medium Download PDF

Info

Publication number
CN116313086A
CN116313086A CN202310220390.2A CN202310220390A CN116313086A CN 116313086 A CN116313086 A CN 116313086A CN 202310220390 A CN202310220390 A CN 202310220390A CN 116313086 A CN116313086 A CN 116313086A
Authority
CN
China
Prior art keywords
sub
health
initial
random forest
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310220390.2A
Other languages
Chinese (zh)
Inventor
党晓兵
杨志敏
邹佩芸
杨小波
黄鹂
原嘉民
陈贤帅
杜如虚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Jianchi Biotechnology Co ltd
Original Assignee
Guangdong Jianchi Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Jianchi Biotechnology Co ltd filed Critical Guangdong Jianchi Biotechnology Co ltd
Priority to CN202310220390.2A priority Critical patent/CN116313086A/en
Publication of CN116313086A publication Critical patent/CN116313086A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a sub-health prediction model construction method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring initial sub-health sample data; preprocessing and labeling the initial sub-health sample data to obtain training data; according to a random forest algorithm, training and optimizing a prediction model by utilizing the training data to obtain an initial random forest model; performing feature selection on the initial random forest model, and selecting an optimal feature variable combination; and carrying out random forest modeling and optimization again on the optimal characteristic variable combination to obtain a sub-health prediction model. The invention ensures the accuracy and rationality of training data, avoids objective sample data from being influenced by subjective judgment, improves the accuracy of sub-health prediction by the model through objective judgment of the model, and reduces the complexity of sub-health prediction and assessment.

Description

Sub-health prediction model construction method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of medical treatment, in particular to a sub-health prediction model construction method, a device, equipment and a storage medium.
Background
Sub-health refers to a state in which the human body is between healthy and diseased. With the increase of social pressure, sub-health state has become a serious problem for people's life. However, the methods for judging sub-health and disease states are different, and a patient in sub-health cannot meet the health standard, and the symptoms of reduced activity, reduced function and reduced adaptability in a certain period of time are represented, but the clinical or sub-clinical diagnosis standards of diseases related to modern medicine are not met.
The prior art can be used for judging the sub-health by utilizing the existing medical diagnosis methods, such as medical history acquisition, evaluation of neuropsychiatric conditions and overall functions, image and laboratory examination and the like, and the patient can be judged to be sub-health if the patient has symptoms which cannot be explained in the prior art for 3 months or more on the basis of excluding diseases which can be explained in the prior art in the medical science according to the comprehensive evaluation flow of the sub-health. The disadvantages of the prior art are mainly represented by: (1) For sub-health judgment standards, the standards for judging diseases are mainly referred, and a plurality of means such as medical images, laboratory examination and the like are combined, so that the evaluation mode is complex; (2) There is no unified method for predicting sub-health, and the influence of subjective judgment of doctors is great.
Disclosure of Invention
The invention provides a sub-health prediction model construction method, device, equipment and storage medium, which are used for solving the technical problems of low accuracy of sub-health prediction and great influence on subjective judgment in the prior art.
In order to solve the above technical problems, an embodiment of the present invention provides a method for constructing a sub-health prediction model, including:
acquiring initial sub-health sample data;
preprocessing and labeling the initial sub-health sample data to obtain training data;
according to a random forest algorithm, training and optimizing a prediction model by utilizing the training data to obtain an initial random forest model;
performing feature selection on the initial random forest model, and selecting an optimal feature variable combination;
and carrying out random forest modeling and optimization again on the optimal characteristic variable combination to obtain a sub-health prediction model.
It can be appreciated that compared with the prior art, the method can obtain training data by preprocessing and labeling the obtained initial sub-health sample data, ensure the accuracy and rationality of the training data, obtain an initial random forest model by training and optimizing a prediction model by utilizing the training data, obtain an objective optimal characteristic variable combination by selecting the characteristics of the initial random forest model, avoid the influence of subjective judgment on the objective sample data, and accurately obtain a sub-health prediction model by carrying out random forest modeling and optimization on the optimal characteristic variable combination again, thereby avoiding complex evaluation modes caused by a plurality of means such as medical images, laboratory inspection and the like in the prior art.
As a preferred scheme, the preprocessing and labeling operations are performed on the initial sub-health sample data to obtain training data, specifically:
removing irrelevant data in the initial sub-health sample data, and performing numerical conversion on text data in the initial sub-health sample data after the removing operation;
filling the feature missing value and combining the feature variables of the initial sub-health sample data after the numerical conversion to obtain initial training data;
and performing secondary elimination on the initial training data, and labeling the initial training data subjected to secondary elimination, so as to obtain training data.
It can be understood that the initial training data is obtained by eliminating irrelevant data from the initial sub-health sample and performing numerical conversion on text data in the sample data after the elimination operation, so that filling of the feature missing values and combination of the feature variables are performed, the influence of a large amount of irrelevant data on the accuracy and training time of subsequent model training is avoided, the operation resource is wasted, and meanwhile, the secondary elimination and labeling operation are performed on the initial training data, so that the training data is accurately obtained, and the model can be accurately trained.
As a preferred scheme, the labeling is performed on the initial training data after the secondary rejection, so as to obtain training data, which specifically includes:
classifying the initial training data after the secondary rejection into two types of sub-health and non-sub-health, and performing sub-health and non-sub-health labeling operation on the initial training data after the secondary rejection according to the classification result, thereby obtaining the training data.
It can be understood that the initial training data after the secondary rejection is classified into two types of sub-health and non-sub-health, so that the two types of data distribution are subjected to labeling operation, the training set and the testing set can be conveniently distinguished in the subsequent model training, and the efficiency and the accuracy of the model training are improved.
As a preferred scheme, the training data is utilized to perform prediction model training and optimization according to a random forest algorithm to obtain an initial random forest model, which specifically comprises the following steps:
splitting the training data into a training set and a testing set according to a preset proportion, and constructing a plurality of decision trees by using the training set according to a random forest algorithm;
verifying a plurality of decision trees according to the test set, thereby completing construction and training of a random forest model;
and searching the maximum tree body and the maximum characteristic quantity of each branch of the random forest model through grid search and five-fold cross verification, and performing model optimization to obtain an initial random forest model. It can be understood that the training data are split into the training set and the testing set, the training set is utilized to construct a plurality of decision trees according to the random forest algorithm, and the testing set is verified, so that the random forest model is obtained through construction and training, and the verification of the testing set ensures the accuracy of the random forest model. And searching the maximum tree body and the maximum characteristic quantity of each branch of the random forest model through grid search and five-fold cross verification, and further obtaining optimal optimization parameters so as to obtain the random forest model meeting the standard, and ensuring the accuracy of selecting the optimal characteristic variables.
As a preferred scheme, the feature selection is performed on the initial random forest model, and an optimal feature variable combination is selected, specifically:
and selecting a new feature set as an optimal feature variable combination through feature importance sequencing.
It can be understood that by sorting the feature importance, a new feature set is selected, so that an optimal feature variable combination is obtained, the feature parameter which has the greatest influence on the random forest model relation in the invention can be accurately selected, and the efficiency and accuracy of model construction due to excessive and complicated feature parameters are reduced.
As a preferred scheme, the optimal characteristic variable combination is subjected to random forest modeling and optimization again to obtain a sub-health prediction model, which is specifically as follows: establishing and training a final random forest model according to the optimal characteristic variable combination;
and optimizing and cross-verifying the trained final random forest model until the final random forest model reaches a preset standard, and obtaining a sub-health prediction model.
It can be understood that the final random forest model is constructed and trained through the optimal characteristic variable combination, so that the final random forest model has stronger applicability compared with the previous random forest model, and the sub-health prediction model meeting the preset standard can be accurately obtained through optimizing and cross-verifying the trained final random forest model, and the model for accurately predicting the sub-health can be obtained without complex equipment or evaluation modes.
Preferably, the initial sub-health sample data comprises: personal basic information, health status information, lifestyle information, stress response information, and family history information.
Correspondingly, the invention also provides a device for constructing the sub-health prediction model, which comprises the following steps: the system comprises a data acquisition module, a data processing module, an initial modeling optimization module, a feature selection module and a final modeling optimization module;
the data acquisition module is used for acquiring initial sub-health sample data;
the data processing module is used for preprocessing and labeling the initial sub-health sample data to obtain training data; the initial modeling optimization module is used for carrying out prediction model training and optimization by utilizing the training data according to a random forest algorithm to obtain an initial random forest model;
the feature selection module is used for carrying out feature selection on the initial random forest model and selecting an optimal feature variable combination;
and the final modeling optimization module is used for carrying out random forest modeling and optimization on the optimal characteristic variable combination again to obtain a sub-health prediction model.
As a preferred scheme, the preprocessing and labeling operations are performed on the initial sub-health sample data to obtain training data, specifically:
removing irrelevant data in the initial sub-health sample data, and performing numerical conversion on text data in the initial sub-health sample data after the removing operation;
filling the feature missing value and combining the feature variables of the initial sub-health sample data after the numerical conversion to obtain initial training data;
and performing secondary elimination on the initial training data, and labeling the initial training data subjected to secondary elimination, so as to obtain training data.
As a preferred scheme, the labeling is performed on the initial training data after the secondary rejection, so as to obtain training data, which specifically includes:
classifying the initial training data after the secondary rejection into two types of sub-health and non-sub-health, and performing sub-health and non-sub-health labeling operation on the initial training data after the secondary rejection according to the classification result, thereby obtaining the training data.
As a preferred scheme, the training data is utilized to perform prediction model training and optimization according to a random forest algorithm to obtain an initial random forest model, which specifically comprises the following steps:
splitting the training data into a training set and a testing set according to a preset proportion, and constructing a plurality of decision trees by using the training set according to a random forest algorithm;
verifying a plurality of decision trees according to the test set, thereby completing construction and training of a random forest model;
and searching the maximum tree body and the maximum characteristic quantity of each branch of the random forest model through grid search and five-fold cross verification, and performing model optimization to obtain an initial random forest model.
As a preferred scheme, the feature selection is performed on the initial random forest model, and an optimal feature variable combination is selected, specifically:
and selecting a new feature set as an optimal feature variable combination through feature importance sequencing.
As a preferred scheme, the optimal characteristic variable combination is subjected to random forest modeling and optimization again to obtain a sub-health prediction model, which is specifically as follows:
establishing and training a final random forest model according to the optimal characteristic variable combination;
and optimizing and cross-verifying the trained final random forest model until the final random forest model reaches a preset standard, and obtaining a sub-health prediction model.
Preferably, the initial sub-health sample data comprises: personal basic information, health status information, lifestyle information, stress response information, and family history information.
Correspondingly, the invention also provides a terminal device comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, wherein the processor realizes the sub-health prediction model construction method when executing the computer program.
Accordingly, the present invention also provides a computer-readable storage medium including a stored computer program; wherein the computer program, when run, controls a device in which the computer readable storage medium resides to perform the sub-health prediction model construction method as described above.
Drawings
Fig. 1: the method for constructing the sub-health prediction model provided by the embodiment of the invention comprises the following steps of;
fig. 2: the embodiment of the invention provides a specific flow chart of a sub-health prediction model construction method;
fig. 3: the embodiment of the invention provides a structural schematic diagram of a sub-health prediction model building device.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1 and 2, a sub-health prediction model construction method provided by an embodiment of the invention includes the following steps S101-S10:
s101: initial sub-health sample data is obtained.
As a preferred aspect of this embodiment, the initial sub-health sample data includes: personal basic information, health status information, lifestyle information, stress response information, and family history information.
It should be noted that, the initial sub-health sample data may be obtained from a sub-health database, and the influence characteristics of the sub-health sample data may be classified into individual basic information, health status information, life habit information, stress response information, family medical history information, and the like, wherein the individual basic information includes gender, age, height, weight, BMI (Body Mass Index), occupation, education level, job title, marital status, economic status, and the like, the health status information includes sub-health, non-sub-health, and the presence of other diseases, the life habit information includes life work condition, physical exercise condition, smoking condition, drinking habit, diet taste, life work environment, birth condition, and the like, the stress response information includes fatigue stress, economic stress, personal stress, disease stress, external environmental change stress, medical event stress, no stress event, and the like, and the family medical history information includes direct system diseases, allergic history, and the like.
S102: and preprocessing and labeling the initial sub-health sample data to obtain training data.
As a preferred scheme, the preprocessing and labeling operations are performed on the initial sub-health sample data to obtain training data, specifically:
removing irrelevant data in the initial sub-health sample data, and performing numerical conversion on text data in the initial sub-health sample data after the removing operation; filling the feature missing value and combining the feature variables of the initial sub-health sample data after the numerical conversion to obtain initial training data; and performing secondary elimination on the initial training data, and labeling the initial training data subjected to secondary elimination, so as to obtain training data.
It should be noted that, the rejection of the irrelevant data in the initial sub-health sample data is mainly to reject the sample data containing other diseases in the health status information, so as to avoid the problem that the accuracy of the model constructed later is reduced due to similar characteristics of sub-health status possibly existing in the sample caused by other diseases.
Further, because personal basic information, health state information, life habit information, stress response information and family history information in the initial sub-health sample data are all basically text data, in order to facilitate training of the model, numerical values are used for representing the state of characteristic items, and meanwhile qualitative characteristics are encoded, so that prediction result errors caused by data redundancy are reduced, for example: for the life work and rest conditions, including three levels of basic regularity, constant irregularity and day-night inversion, adopting 0 to represent basic regularity, 1 to represent constant irregularity and 2 to represent day-night inversion; the continuous variable feature is segmented, for example, the values of the age, height, weight, BMI, etc. features are segmented.
It should be noted that after performing the text-numerical conversion, there may be some cases where there are missing values in the feature variables, so it is necessary to fill the feature missing values in the initial sub-health sample data after the numerical conversion, and in this embodiment, for the features with the missing rate greater than 15%, the feature variables are deleted; and for the characteristics with the missing value less than 15%, supplementing missing data by adopting a neighboring value filling method, and ensuring the integrity of the data.
In this embodiment, the features are classified, and the data with the feature types that are obviously repeated are marked and combined, for example: four characteristic variables used for describing life work conditions are obviously repeated, the life work are generally usual, the life work is too easy and the life work is busy, the characteristic variable life work is summarized and summarized on the rest three characteristic variables, namely the life work is generally usual, the life work is too easy and the life work is busy, the characteristic variable life work is only reserved, the rest three characteristic variables, namely the life work is generally usual, the life work is too easy and the life work is busy, so that the characteristic variables are combined, and the accuracy of training data is improved.
In this embodiment, the step of performing secondary culling on the initial training data, for example, searching for multiple co-linearity feature variables except for the tag item (health status information), where correlation between the feature variables is too high, may affect accuracy of the prediction result, and preferably, multiple co-linearity analysis is used. And selecting characteristic variables with larger mutual correlation, performing correlation analysis on the selected variables and the tag characteristics, deleting the characteristic variables with lower correlation with the tag characteristics, and reserving the characteristic variables with higher correlation with the tag characteristics.
It can be understood that the initial training data is obtained by eliminating irrelevant data from the initial sub-health sample and performing numerical conversion on text data in the sample data after the elimination operation, so that filling of the feature missing values and combination of the feature variables are performed, the influence of a large amount of irrelevant data on the accuracy and training time of subsequent model training is avoided, the operation resource is wasted, and meanwhile, the secondary elimination and labeling operation are performed on the initial training data, so that the training data is accurately obtained, and the model can be accurately trained.
As a preferred scheme of this embodiment, the labeling is performed on the initial training data after the second culling, so as to obtain training data, which specifically includes:
classifying the initial training data after the secondary rejection into two types of sub-health and non-sub-health, and performing sub-health and non-sub-health labeling operation on the initial training data after the secondary rejection according to the classification result, thereby obtaining the training data.
In this embodiment, the data are illustratively labeled, and are classified into sub-health and non-sub-health, and sub-health and non-sub-health are respectively labeled with '1' and '0' in the data.
It can be understood that the initial training data after the secondary rejection is classified into two types of sub-health and non-sub-health, so that the two types of data distribution are subjected to labeling operation, the training set and the testing set can be conveniently distinguished in the subsequent model training, and the efficiency and the accuracy of the model training are improved.
S103: and according to a random forest algorithm, training and optimizing the prediction model by utilizing the training data to obtain an initial random forest model.
As a preferred solution of this embodiment, the training data is used to perform predictive model training and optimization according to a random forest algorithm to obtain an initial random forest model, which specifically includes:
splitting the training data into a training set and a testing set according to a preset proportion, and constructing a plurality of decision trees by using the training set according to a random forest algorithm; verifying a plurality of decision trees according to the test set, thereby completing construction and training of a random forest model; and searching the maximum tree body and the maximum characteristic quantity of each branch of the random forest model through grid search and five-fold cross verification, and performing model optimization to obtain an initial random forest model.
In this embodiment, training data is analyzed by using a random forest algorithm, and illustratively, sub-health data and non-sub-health data are 5230 cases and 2002 cases respectively, health states (0: non-sub-health, 1: sub-health) are set as labels y, remaining feature items are set as feature variables x, and data are expressed in terms of 7: the 3 scale is divided into training sets (xtrain, ytrain) and test sets (xtest, ytest). The data is split and combined into a plurality of decision trees. First, a sample with a put back is taken from the original dataset and split into a plurality of sub-datasets. And secondly, constructing a plurality of sub-decision trees by utilizing the sub-data set, wherein each sub-decision tree outputs a result. Finally, when new data is needed to obtain a classification result through the random forest, the voting result can finally form a random forest prediction result through voting on the judgment result of the sub decision tree. For example, in the multiple decision trees, more than 50% of the tree classification results are non-sub-health classes, and less than 50% of the tree classification results are sub-health classes, and the random forest classification results are non-sub-health classes; otherwise, the sub-health class is defined. Further, through grid search and five-fold cross verification, searching the maximum tree body and the maximum characteristic quantity of each branch of the random forest model, and performing model optimization to obtain optimal optimization parameters so as to obtain an optimized initial random forest model.
It can be understood that the training data are split into the training set and the testing set, the training set is utilized to construct a plurality of decision trees according to the random forest algorithm, the testing set is verified, the random forest model is further constructed and trained, the verification of the testing set ensures the accuracy of the random forest model, the grid search and the five-fold cross verification are utilized to search the maximum tree body and the maximum characteristic quantity of each branch of the random forest model, and then optimal optimization parameters are obtained, so that the random forest model conforming to the standard is obtained, and the accuracy of selecting the characteristic variables is ensured.
S104: and selecting the characteristics of the random forest model, and selecting the optimal characteristic variable combination.
As a preferred scheme, the feature selection is performed on the random forest model, and an optimal feature variable combination is selected, specifically:
and selecting a new feature set as an optimal feature variable combination through feature importance sequencing.
In the present embodiment, the optimal feature variable combination includes the first 12 important features, which are respectively: industry occupation, age, title, several fetuses, physical exercise, orthodox no disease, height segment, no stress event, weight segment, highest school level segment, gender, BMI segment.
It can be understood that by sorting the feature importance, a new feature set is selected, so that an optimal feature variable combination is obtained, the feature parameter which has the greatest influence on the random forest model relation in the invention can be accurately selected, and the efficiency and accuracy of model construction due to excessive and complicated feature parameters are reduced.
S105: and carrying out random forest modeling and optimization again on the optimal characteristic variable combination to obtain a sub-health prediction model.
As a preferred scheme, the optimal characteristic variable combination is subjected to random forest modeling and optimization again to obtain a sub-health prediction model, which is specifically as follows:
establishing and training a final random forest model according to the optimal characteristic variable combination; and optimizing and cross-verifying the trained final random forest model until the final random forest model reaches a preset standard, and obtaining a sub-health prediction model.
It should be noted that, in this embodiment, according to the selected optimal feature variable combination, that is, 12 features, random forest modeling and optimization are performed, a final random forest model is built and trained, and the trained final random forest model is optimized and cross-validated by re-adopting the optimization model method in step S103, so as to obtain a final sub-health prediction model.
Further, deployment of the final sub-health predictive model can provide sub-health analytical assessment for users of the system. Meanwhile, after the final sub-health prediction model is obtained, the accuracy of the sub-health prediction model can be further verified by inputting sample data for sub-health evaluation test.
It can be understood that the final random forest model is constructed and trained through the optimal characteristic variable combination, so that the final random forest model has stronger applicability compared with the previous random forest model, and the sub-health prediction model meeting the preset standard can be accurately obtained through optimizing and cross-verifying the trained final random forest model, and the model for accurately predicting the sub-health can be obtained without complex equipment or evaluation modes.
The implementation of the above embodiment has the following effects:
compared with the prior art, the embodiment of the invention can obtain training data by preprocessing and labeling the obtained initial sub-health sample data, ensure the accuracy and rationality of the training data, perform optimization and parameter adjustment after obtaining a random forest model by training the constructed random forest model by using the training data, perform feature selection, further obtain objective optimal feature variable combination, prevent objective sample data from being influenced by subjective judgment, and perform random forest modeling and optimization again on the optimal feature variable combination, thereby accurately obtaining a sub-health prediction model, avoiding complex evaluation modes caused by various means such as medical images, laboratory examination and the like in the prior art.
Example two
Referring to fig. 3, the present invention provides a sub-health prediction model construction device, which includes: a data acquisition module 201, a data processing module 202, an initial modeling optimization module 203, a feature selection module 204, and a final modeling optimization module 205.
The data acquisition module 201 is configured to acquire initial sub-health sample data.
The data processing module 202 is configured to perform preprocessing and labeling operations on the initial sub-health sample data to obtain training data.
The initial modeling optimization module 203 is configured to perform prediction model training and optimization by using the training data according to a random forest algorithm, so as to obtain an initial random forest model.
The feature selection module 204 is configured to perform feature selection on the initial random forest model, and select an optimal feature variable combination.
The final modeling optimization module 205 is configured to perform random forest modeling and optimization again on the optimal feature variable combination to obtain a sub-health prediction model.
As a preferred scheme, the preprocessing and labeling operations are performed on the initial sub-health sample data to obtain training data, specifically:
removing irrelevant data in the initial sub-health sample data, and performing numerical conversion on text data in the initial sub-health sample data after the removing operation; filling the feature missing value and combining the feature variables of the initial sub-health sample data after the numerical conversion to obtain initial training data; and performing secondary elimination on the initial training data, and labeling the initial training data subjected to secondary elimination, so as to obtain training data.
As a preferred scheme, the labeling is performed on the initial training data after the secondary rejection, so as to obtain training data, which specifically includes:
classifying the initial training data after the secondary rejection into two types of sub-health and non-sub-health, and performing sub-health and non-sub-health labeling operation on the initial training data after the secondary rejection according to the classification result, thereby obtaining the training data.
As a preferred scheme, the training data is utilized to perform prediction model training and optimization according to a random forest algorithm to obtain an initial random forest model, which specifically comprises the following steps:
splitting the training data into a training set and a testing set according to a preset proportion, and constructing a plurality of decision trees by using the training set according to a random forest algorithm; verifying a plurality of decision trees according to the test set, thereby completing construction and training of a random forest model; and searching the maximum tree body and the maximum characteristic quantity of each branch of the random forest model through grid search and five-fold cross verification, and performing model optimization to obtain an initial random forest model.
As a preferred scheme, the feature selection is performed on the initial random forest model, and an optimal feature variable combination is selected, specifically:
and selecting a new feature set as an optimal feature variable combination through feature importance sequencing.
As a preferred scheme, the optimal characteristic variable combination is subjected to random forest modeling and optimization again to obtain a sub-health prediction model, which is specifically as follows:
establishing and training a final random forest model according to the optimal characteristic variable combination; and optimizing and cross-verifying the trained final random forest model until the final random forest model reaches a preset standard, and obtaining a sub-health prediction model.
Preferably, the initial sub-health sample data comprises: personal basic information, health status information, lifestyle information, stress response information, and family history information.
It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding process in the foregoing method embodiment for the specific working process of the apparatus described above, which is not described herein again.
The implementation of the embodiment of the invention has the following effects:
compared with the prior art, the embodiment of the invention can obtain training data by preprocessing and labeling the obtained initial sub-health sample data, ensure the accuracy and rationality of the training data, perform optimization and parameter adjustment after obtaining a random forest model by training the constructed random forest model by using the training data, perform feature selection, further obtain objective optimal feature variable combination, prevent objective sample data from being influenced by subjective judgment, and perform random forest modeling and optimization again on the optimal feature variable combination, thereby accurately obtaining a sub-health prediction model, avoiding complex evaluation modes caused by various means such as medical images, laboratory examination and the like in the prior art.
Example III
Correspondingly, the invention also provides a terminal device, comprising: a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the sub-health prediction model construction method of any one of the embodiments above when the computer program is executed.
The terminal device of this embodiment includes: a processor, a memory, a computer program stored in the memory and executable on the processor, and computer instructions. The processor, when executing the computer program, implements the steps of the first embodiment described above, such as steps S101 to S105 shown in fig. 1. Alternatively, the processor, when executing the computer program, performs the functions of the modules/units of the apparatus embodiments described above, such as the data processing module 202.
The computer program may be divided into one or more modules/units, which are stored in the memory and executed by the processor to accomplish the present invention, for example. The one or more modules/units may be a series of computer program instruction segments capable of performing the specified functions, which instruction segments are used for describing the execution of the computer program in the terminal device. For example, the data processing module 202 is configured to perform preprocessing and labeling operations on the initial sub-health sample data to obtain training data.
The terminal equipment can be computing equipment such as a desktop computer, a notebook computer, a palm computer, a cloud server and the like. The terminal device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the schematic diagram is merely an example of a terminal device and does not constitute a limitation of the terminal device, and may include more or less components than illustrated, or may combine some components, or different components, e.g., the terminal device may further include an input-output device, a network access device, a bus, etc.
The processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is a control center of the terminal device, and which connects various parts of the entire terminal device using various interfaces and lines.
The memory may be used to store the computer program and/or the module, and the processor may implement various functions of the terminal device by running or executing the computer program and/or the module stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the mobile terminal, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
Wherein the terminal device integrated modules/units may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as stand alone products. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
Example IV
Correspondingly, the invention further provides a computer readable storage medium, which comprises a stored computer program, wherein the computer program is used for controlling equipment where the computer readable storage medium is located to execute the sub-health prediction model construction method according to any embodiment.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention, and are not to be construed as limiting the scope of the invention. It should be noted that any modifications, equivalent substitutions, improvements, etc. made by those skilled in the art without departing from the spirit and principles of the present invention are intended to be included in the scope of the present invention.

Claims (10)

1. The sub-health prediction model construction method is characterized by comprising the following steps of:
acquiring initial sub-health sample data;
preprocessing and labeling the initial sub-health sample data to obtain training data;
according to a random forest algorithm, training and optimizing a prediction model by utilizing the training data to obtain an initial random forest model;
performing feature selection on the initial random forest model, and selecting an optimal feature variable combination;
and carrying out random forest modeling and optimization again on the optimal characteristic variable combination to obtain a sub-health prediction model.
2. The method for constructing a sub-health prediction model according to claim 1, wherein the preprocessing and labeling operations are performed on the initial sub-health sample data to obtain training data, specifically:
removing irrelevant data in the initial sub-health sample data, and performing numerical conversion on text data in the initial sub-health sample data after the removing operation;
filling the feature missing value and combining the feature variables of the initial sub-health sample data after the numerical conversion to obtain initial training data;
and performing secondary elimination on the initial training data, and labeling the initial training data subjected to secondary elimination, so as to obtain training data.
3. The method for constructing a sub-health prediction model according to claim 2, wherein the labeling of the initial training data after the secondary culling is performed to obtain training data, specifically:
classifying the initial training data after the secondary rejection into two types of sub-health and non-sub-health, and performing sub-health and non-sub-health labeling operation on the initial training data after the secondary rejection according to the classification result, thereby obtaining the training data.
4. The method for constructing a sub-health prediction model according to claim 1, wherein the training data is used for performing prediction model training and optimization according to a random forest algorithm to obtain an initial random forest model, specifically:
splitting the training data into a training set and a testing set according to a preset proportion, and constructing a plurality of decision trees by using the training set according to a random forest algorithm;
verifying a plurality of decision trees according to the test set, thereby completing construction and training of a random forest model;
and searching the maximum tree body and the maximum characteristic quantity of each branch of the random forest model through grid search and five-fold cross verification, and performing model optimization to obtain an initial random forest model.
5. The method for constructing a sub-health prediction model according to claim 1, wherein the feature selection is performed on the initial random forest model, and an optimal feature variable combination is selected, specifically:
and selecting a new feature set as an optimal feature variable combination through feature importance sequencing.
6. The method for constructing the sub-health prediction model according to claim 1, wherein the method for constructing the sub-health prediction model by performing random forest modeling and optimization again on the optimal characteristic variable combination is as follows:
establishing and training a final random forest model according to the optimal characteristic variable combination;
and optimizing and cross-verifying the trained final random forest model until the final random forest model reaches a preset standard, and obtaining a sub-health prediction model.
7. The method for constructing a sub-health prediction model according to any one of claims 1 to 6, wherein the initial sub-health sample data comprises: personal basic information, health status information, lifestyle information, stress response information, and family history information.
8. A sub-health prediction model construction apparatus, comprising: the system comprises a data acquisition module, a data processing module, an initial modeling optimization module, a feature selection module and a final modeling optimization module;
the data acquisition module is used for acquiring initial sub-health sample data;
the data processing module is used for preprocessing and labeling the initial sub-health sample data to obtain training data;
the initial modeling optimization module is used for carrying out prediction model training and optimization by utilizing the training data according to a random forest algorithm to obtain an initial random forest model;
the feature selection module is used for carrying out feature selection on the initial random forest model and selecting an optimal feature variable combination;
and the final modeling optimization module is used for carrying out random forest modeling and optimization on the optimal characteristic variable combination again to obtain a sub-health prediction model.
9. A terminal device comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the sub-health prediction model construction method according to any one of claims 1-7 when the computer program is executed.
10. A computer readable storage medium, wherein the computer readable storage medium comprises a stored computer program; wherein the computer program, when run, controls a device in which the computer readable storage medium is located to perform the sub-health prediction model construction method according to any one of claims 1-7.
CN202310220390.2A 2023-03-08 2023-03-08 Sub-health prediction model construction method, device, equipment and storage medium Pending CN116313086A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310220390.2A CN116313086A (en) 2023-03-08 2023-03-08 Sub-health prediction model construction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310220390.2A CN116313086A (en) 2023-03-08 2023-03-08 Sub-health prediction model construction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116313086A true CN116313086A (en) 2023-06-23

Family

ID=86786446

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310220390.2A Pending CN116313086A (en) 2023-03-08 2023-03-08 Sub-health prediction model construction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116313086A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117954101A (en) * 2024-03-27 2024-04-30 北方健康医疗大数据科技有限公司 Method and system for building lung cancer survival rate prediction model based on artificial intelligence
CN117954101B (en) * 2024-03-27 2024-07-05 北方健康医疗大数据科技有限公司 Method and system for building lung cancer survival rate prediction model based on artificial intelligence

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117954101A (en) * 2024-03-27 2024-04-30 北方健康医疗大数据科技有限公司 Method and system for building lung cancer survival rate prediction model based on artificial intelligence
CN117954101B (en) * 2024-03-27 2024-07-05 北方健康医疗大数据科技有限公司 Method and system for building lung cancer survival rate prediction model based on artificial intelligence

Similar Documents

Publication Publication Date Title
US20220254493A1 (en) Chronic disease prediction system based on multi-task learning model
AU2012245343B2 (en) Predictive modeling
US9443002B1 (en) Dynamic data analysis and selection for determining outcomes associated with domain specific probabilistic data sets
CN107785057B (en) Medical data processing method, device, storage medium and computer equipment
CN110910982A (en) Self-coding model training method, device, equipment and storage medium
CN112017789B (en) Triage data processing method, triage data processing device, triage data processing equipment and triage data processing medium
CN111710420A (en) Complication morbidity risk prediction method, system, terminal and storage medium based on electronic medical record big data
CN108682009A (en) A kind of Alzheimer's disease prediction technique, device, equipment and medium
CN113159147A (en) Image identification method and device based on neural network and electronic equipment
US20220383661A1 (en) Method and device for retinal image recognition, electronic equipment, and storage medium
CN115050442B (en) Disease category data reporting method and device based on mining clustering algorithm and storage medium
CN113724847A (en) Medical resource allocation method, device, terminal equipment and medium based on artificial intelligence
CN111785366A (en) Method and device for determining patient treatment scheme and computer equipment
CN115730605B (en) Data analysis method based on multidimensional information
Popkes et al. Interpretable outcome prediction with sparse Bayesian neural networks in intensive care
CN112447270A (en) Medication recommendation method, device, equipment and storage medium
CN116864139A (en) Disease risk assessment method, device, computer equipment and readable storage medium
CN115101160A (en) Drug sales data mining and retrieving method and device
CN115438040A (en) Pathological archive information management method and system
CN113744845A (en) Medical image processing method, device, equipment and medium based on artificial intelligence
CN116705310A (en) Data set construction method, device, equipment and medium for perioperative risk assessment
CN116543911A (en) Disease risk prediction model training method and device
CN116350203A (en) Physical testing data processing method and system
CN116313086A (en) Sub-health prediction model construction method, device, equipment and storage medium
CN114822857A (en) Prediction method of repeat admission, computing device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination