CN110111888A

CN110111888A - A kind of XGBoost disease probability forecasting method, system and storage medium

Info

Publication number: CN110111888A
Application number: CN201910411562.8A
Authority: CN
Inventors: 黄海涛; 郑早明; 肖俊; 许高峰; 王婧
Original assignee: Wenkang Group Co Ltd
Current assignee: Wenkang Group Co Ltd
Priority date: 2019-05-16
Filing date: 2019-05-16
Publication date: 2019-08-09

Abstract

The embodiment of the invention discloses a kind of XGBoost disease probability forecasting method, system and storage mediums, which comprises obtains Primitive case data；0-1 standardization is carried out to the Primitive case data and obtains sample data set, and the sample data set is cut into training set and test set；The more disaggregated models of XGBoost are constructed, original model parameter is set；The more disaggregated models of the XGBoost are trained using the training set；The more disaggregated models of trained XGBoost are tested using the test set, export the corresponding disease probability value of a variety of diseases；Threshold value selection is carried out to the disease probability value, exports disease probabilistic forecasting value.Support to carry out intellectual analysis and prediction to magnanimity patient data that there is high accuracy and high-timeliness, it is easy to operate at low cost.

Description

XGboost disease probability prediction method, system and storage medium

Technical Field

The embodiment of the invention relates to the technical field of disease prediction, in particular to a method and a system for predicting XGboost disease probability and a storage medium.

Background

Medical resources are scarce for all people, patients hope to be treated by the best experts no matter big or small diseases in order to ensure that the diseases are effectively diagnosed, but the cost for cultivating a fine doctor is very high due to the high medical training threshold, the number of professional medical staff is small, and the problem of uneven medical resource distribution exists. At present, there is no accurate method for predicting the probability of multiple classifications of diseases, the existing disease prediction method still remains in the stage of predicting the disease through symptoms, and can also cope with a small number of symptoms and diseases, but the existing disease prediction method cannot be used when large data of hundreds of cases are met, so that a high-timeliness and high-accuracy disease probability prediction method is urgently needed to solve the problem.

Disclosure of Invention

Therefore, the embodiment of the invention provides a method, a system and a storage medium for XGboost disease probability prediction, so as to solve the problem that the existing disease prediction method cannot perform high-precision and high-timeliness prediction on massive case data.

In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:

according to a first aspect of the embodiments of the present invention, a XGBoost disease probability prediction method is provided, where the method includes:

acquiring original case data, wherein the original case data comprises a plurality of types of disease information and symptom information corresponding to the disease information;

carrying out 0-1 standardization processing on the original case data to obtain a sample data set, and cutting the sample data set into a training set and a test set;

constructing an XGboost multi-classification model, and setting initial model parameters;

training the XGboost multi-classification model by using the training set;

testing the trained XGboost multi-classification model by using the test set, and outputting disease probability values corresponding to various diseases respectively;

and selecting a threshold value for the disease probability value, and outputting a disease probability predicted value.

Further, the testing the trained XGBoost multi-classification model by using the test set to output disease probability values corresponding to a plurality of diseases respectively further includes:

and optimizing the XGboost multi-classification model according to the user requirements, and determining the optimal model parameters.

Further, the performing 0-1 normalization processing on the original case data to obtain a sample data set further includes:

and carrying out weight assignment on the symptom information according to the uncertainty of the symptom information.

Further, the method further comprises:

and acquiring newly-added case data according to a preset period, and performing periodic incremental learning training on the XGboost multi-classification model by using the newly-added case data.

Further, the method further comprises:

and evaluating the XGboost multi-classification model by adopting accuracy, recall rate and F1-Score indexes.

Furthermore, the XGboost multi-classification model is based on an XGboost multi-classification algorithm, a gradient lifting method is adopted for training, the type of an iterator is a gbtree, and a loss function adopts a Mean Square Error (MSE) form.

Further, the cutting the sample data set into a training set and a test set further includes:

the cutting process sets the random seed to a fixed value.

According to a second aspect of the embodiments of the present invention, an XGBoost disease probability prediction system is provided, the system including: a processor and a memory;

the memory is to store one or more program instructions;

the processor is configured to execute one or more program instructions to perform any method step of an XGBoost disease probability prediction method as described above.

According to a third aspect of embodiments of the present invention, there is provided a computer storage medium containing one or more program instructions for performing any method step of a method of XGBoost disease probability prediction as described above by an XGBoost disease probability prediction system.

The embodiment of the invention has the following advantages:

the XGboost disease probability prediction method, the XGboost disease probability prediction system and the storage medium are based on big data and machine learning technology, the XGboost multi-classification model is built to predict the disease probability prediction value of a patient according to symptom information of the patient, intelligent analysis and prediction on massive patient data can be supported, the prediction method has high accuracy and high timeliness, is simple and convenient to operate and low in cost, can provide accurate disease diagnosis service for the patient, can provide auxiliary diagnosis and auxiliary treatment for a doctor, reduces misdiagnosis rate, reduces occurrence of medical accidents, and promotes healthy development of the medical industry.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

Fig. 1 is a schematic flow diagram of an XGBoost disease probability prediction method provided in embodiment 1 of the present invention.

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 1, the present embodiment provides a XGBoost disease probability prediction method, including:

s100, acquiring original case data, wherein the original case data comprises a plurality of kinds of disease information and symptom information corresponding to the disease information. This example used 4920 case data in which the diagnosis record of each patient was one data, 132 symptoms, 41 diseases, and 120 data pieces for each disease.

S200, carrying out 0-1 standardization processing on the original case data to obtain a sample data set, and cutting the sample data set into a training set and a test set.

The disease and symptom information in the original case data is stored in a text data format, 0-1 standardization processing needs to be performed on the information, and the text information is mapped into a [0,1] interval, for example, if a patient has a certain symptom, the value is assigned to 1, and if the patient does not have the certain symptom, the value is assigned to 0.

Further, performing 0-1 normalization processing on the original case data to obtain a sample data set, further comprising: and carrying out weight assignment on the symptom information according to the uncertainty of the symptom information. The value 1 or 0 can be assigned for the case of the above-mentioned symptom being determined or not, while for the case of uncertain symptoms, for example, in the following simulation scenario:

the doctor asks the patient: do you ask you for abdominal pain? "

The patient may respond to three situations: a is not painful; b, a bit of pain; c pain.

Then with the above dialog example, the a and C options are clearly the cases of 0 and 1 described earlier (a is not painful, maps to 0 state, C is painful, maps to 1 state), but the B option is somewhat painful, how did this be judged? In the prior art, a single disease and a symptom are used for judgment, and when the condition is not judged, the uncertain symptom of the option B can be given an initial random weight value by a professional doctor according to the uncertainty of the symptom information, for example, the set value is mapped to 0.3, the uncertain condition of a patient can be reasonably assigned and quantified through weight assignment, and the finally output disease prediction result is more accurate.

After the data processing, the obtained sample data set is cut into a training set and a test set according to the proportion of 7: 3. Further, cutting the sample data set into a training set and a test set, further comprising: the random seed is set as a fixed value in the cutting process so as to facilitate parameter comparison.

S300, constructing an XGboost multi-classification model, and setting initial model parameters. Firstly, input and output of the model are determined, standardized data of patient symptom information are used as input of the model, and finally a disease probability prediction value is output. And (4) constructing a multi-classification machine learning model based on XGboost because the final disease type is more than 2. Machine learning is a process of automatically analyzing a machine learning model for obtaining rules capable of predicting samples from a training sample set, and supervised learning is to train the machine learning model based on sample features and classification results (also called target variables) in the training set, so that the model has the capability of predicting sample classifications outside the training set.

Furthermore, the XGboost multi-classification model is based on an XGboost multi-classification algorithm, a gradient lifting method is adopted for training, the type of the iterator is gbtree, and the loss function is in the form of Mean Square Error (MSE).

The XGboost classification algorithm is a lifting tree model, and integrates a plurality of CART regression tree models to form a strong classifier. The algorithm idea is to continuously add trees, continuously perform feature splitting to grow a tree, and each time a tree is added, actually learn a new function to fit the residual error predicted last time.

Assuming that k trees are obtained after training, the score of a sample is predicted, that is, according to the feature of the sample, a corresponding leaf node is fallen in each tree, each leaf node corresponds to a score, and finally, the score corresponding to each tree is only added up to be the predicted value of the sample, that is, the linear sum of a series of classification regression trees, and an optional example can be written as:

wherein,to classify a set of regression trees, f_kIs in a function spaceThe k-th regression tree function therein, and w_q(x)Weight of leaf node q under the Single Tree model, R^TFor the leaf weights of the tree, q represents the nodes of the tree and T represents the number of leaves on the tree.

The objective function formulated in the training process is:

obj(θ)＝L(θ)+Ω(θ)；

wherein:and representing a training error, and adopting a mean square error, namely an error between a real value and a predicted value.A term of the regularization is represented,t represents the number of leaves on the tree, and gamma represents a control parameter for the number of leaf nodes. w is a_jRepresents the square of the weight modulo of the leaf node j, i.e., L2 regularization, and λ represents the L2 regularization term parameter to prevent overfitting.

As a supplementary description of the CART regression tree, the CART regression tree is a binary tree, and features are continuously split, for example, a current tree node is split based on the jth feature value, a sample with the feature value smaller than s is divided into a left sub-tree, and a sample with the feature value larger than s is divided into a right sub-tree:

R₁(j，s)＝{x|x^(j)≤s}and R₂(j，s)＝{x|x^(j)＞s}；

the CART regression tree essentially divides the sample space in the feature dimension, and the optimization of the space division is an NP-Hard problem, so that a heuristic method is used in the decision tree model. The objective function generated by a typical CART regression tree is:

therefore, in order to solve the optimal segmentation feature j and the optimal segmentation point s, the method is converted into an objective function:

therefore, only through traversing all the segmentation points of all the features, the optimal segmentation feature and segmentation point can be found, and finally a regression tree is obtained.

In this embodiment, when the initial model parameter is set, the learning rate is set to 0.1, the maximum depth of the tree is 6, the minimum loss function reduction value required for node splitting is 0.1, the number of iterations is 100, the L2 regular λ parameter is 1, the number of parallel multi-threads is 4, and other parameters are default values.

S400, training the XGboost multi-classification model by using a training set.

And S500, testing the trained XGboost multi-classification model by using a test set, and outputting disease probability values corresponding to various diseases respectively.

Further, the testing of the trained XGBoost multi-classification model by using the test set to output disease probability values corresponding to a plurality of diseases respectively further comprises: and optimizing the XGboost multi-classification model according to the user requirements, and determining the optimal model parameters. The model parameters can be adjusted according to specific user requirements to obtain probability distribution meeting the requirements.

S600, selecting a threshold value for the disease probability value, and outputting a disease probability predicted value. And sorting the output disease probability values according to the sequence from large to small, selecting a threshold value, such as a maximum value, and outputting a disease probability predicted value.

Further, the method further comprises: and acquiring newly added case data according to a preset period, and performing periodic incremental learning training on the XGboost multi-classification model by using the newly added case data. In order to enable the model to better control the existing epidemic diseases, a sample set can be updated in a fixed period (such as one day, three days or one week) based on newly-added case data and the disease verification result of a professional doctor, and the supervised learning training of XGboost is carried out again, so that the prediction method with more comprehensive data and better timeliness is obtained.

Further, the method further comprises: and evaluating the XGboost multi-classification model by adopting the accuracy, the recall rate and the F1-Score index. After the model evaluation is completed, the model can be packaged and deployed to a server, and an interface is provided for calling.

Based on big data and machine learning technology, the XGboost multi-classification model is constructed to predict the disease probability prediction value of a patient according to the symptom information of the patient, so that intelligent analysis and prediction of massive patient data can be supported, the prediction method has high accuracy and high timeliness, is simple and convenient to operate and low in cost, can provide accurate disease diagnosis service for the patient, can provide auxiliary diagnosis and auxiliary treatment for a doctor, reduces misdiagnosis rate, reduces occurrence of medical accidents, and promotes the healthy development of the medical industry.

Example 2

Corresponding to the foregoing embodiment 1, this embodiment further provides an XGBoost disease probability prediction system, including: a processor and a memory;

the memory is used for storing one or more program instructions;

a processor configured to execute one or more program instructions to perform any method step of a XGBoost disease probability prediction method as described in the above embodiments.

Optionally, the system may further include a display for displaying the disease probability prediction value obtained according to the XGBoost disease probability prediction method in a visual form.

Example 3

Corresponding to the above embodiment 2, the present embodiment also provides a computer storage medium containing one or more program instructions therein. Wherein one or more program instructions are for execution by an XGBoost disease probability prediction system of an XGBoost disease probability prediction method as introduced above.

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. An XGboost disease probability prediction method is characterized by comprising the following steps:

training the XGboost multi-classification model by using the training set;

2. The XGboost disease probability prediction method of claim 1, wherein the testing the trained XGboost multi-classification model by using the test set to output disease probability values corresponding to a plurality of diseases respectively further comprises:

3. The XGBoost disease probability prediction method of claim 1, wherein the normalizing the original case data by 0-1 to obtain a sample data set further comprises:

4. The XGboost disease probability prediction method of claim 1, further comprising:

5. The XGboost disease probability prediction method of claim 1, further comprising:

6. The XGboost disease probability prediction method of claim 1, wherein the XGboost multi-classification model is based on an XGboost multi-classification algorithm, a gradient boosting method is adopted for training, the type of an iterator is gbtree, and a loss function is in a Mean Square Error (MSE) form.

7. The XGboost disease probability prediction method of claim 1, wherein the cutting the sample data set into a training set and a testing set further comprises:

the cutting process sets the random seed to a fixed value.

8. An XGboost disease probability prediction system, the system comprising: a processor and a memory;

the memory is to store one or more program instructions;

the processor, configured to execute one or more program instructions to perform the method of any of claims 1-7.

9. A computer storage medium comprising one or more program instructions for performing the method of any one of claims 1-7 by an XGBoost disease probability prediction system.