CN111145912A

CN111145912A - Machine learning-based prediction device for personalized ovulation promotion scheme

Info

Publication number: CN111145912A
Application number: CN201911337735.2A
Authority: CN
Inventors: 吴健; 陈晋泰; 陈婷婷; 冯芮苇; 应豪超; 雷璧闻; 刘雪晨; 宋庆宇; 曹燕
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2020-05-12
Anticipated expiration: 2039-12-23
Also published as: CN111145912B

Abstract

The invention discloses a prediction device of an individual ovulation induction scheme based on machine learning, which comprises a computer memory and a computer processor, wherein an ovulation induction scheme prediction model is stored in the computer memory, and the ovulation induction scheme prediction model comprises a trained primary learner and a trained secondary learner; the primary learner consists of an SVM model, an ExtraTrees model, a RandomForest model, a LightGBM model and an XGboost model, and the secondary learner adopts a Catboost model; the computer processor, when executing the computer program, performs the steps of: performing characteristic engineering processing on clinical characteristic data to be detected, inputting the processed characteristic data into a primary learner for calculation, and obtaining predicted values of the five models; and calculating the 5 predicted values by adopting a trained secondary learner to obtain a final predicted result. By utilizing the method, the prediction accuracy of the ovulation induction scheme can be improved.

Description

Machine learning-based prediction device for personalized ovulation promotion scheme

Technical Field

The invention belongs to the field of medical artificial intelligence, and particularly relates to a prediction device of an individual superovulation-promoting scheme based on machine learning.

Background

The rapid development of reproductive medicine in the last 30 years has stabilized the clinical pregnancy rate and embryo implantation rate using the technique of tube babies, medically known as the in vitro fertilization-embryo transfer technique (IVF-ET). Implementation of the ovarian hyperstimulation protocol (COS protocol, medical terminology) is a very important step in the course of tube infants, which determines the quantity and quality of eggs obtained later. Thus, the COS protocols are more emphasized for personalized applications, i.e., protocols tailored to each patient need to be tailored to the patient's physical characteristics and conditions.

In selecting the COS scheme, at present, doctors basically make different schemes for different people by observing the physical signs of patients and daily reactions after medication in China. However, this method requires doctors to have deep and solid knowledge and experience in internal secretion of reproduction to ensure the quantity and quality of ova and embryos obtained in later period to a certain extent. However, according to the current medical resource judgment in China, the number of doctors and patients is extremely unbalanced, and doctors with abundant experience are fewer, so that the selection of the scheme for different patients is unstable, and finally, the success rate of the test tube infants is affected.

With the great development of machine learning in the field of artificial intelligence, the machine learning method is widely applied to medical data.

In machine learning, a model captures relevant information from a sample. For a given task, the sample gives an input (feature) and an output (label). Machine learning algorithms learn from observations and then a computer decides how to map features to labels to create a generalized model so that new tasks can be performed correctly on unseen input (e.g., never treated patients).

The classification algorithm is a common and important task in the machine learning algorithm, namely predicting the class to which the algorithm belongs according to a sample. Since the linear discriminant analysis algorithm was proposed in the 30's of the last century, various classification algorithms are developed, including Logistic regression models, COX models, and other linear models, decision trees, RandomForest, and various classification tree models based on boosting, as well as Neural Networks (NN).

With the progress of the technology, different algorithms have been developed in specific fields according to their own characteristics, but various problems are encountered in the process of being specifically applied to different scenes, and different difficulties need to be overcome for the application scenes.

Until now, the application of a machine learning algorithm in the research of the ovulation triggering scheme does not appear, and in order to improve the accuracy and the high efficiency of the selection of the ovulation triggering scheme, a system for predicting the ovulation triggering scheme needs to be designed urgently.

Disclosure of Invention

The invention provides a prediction device of an individual ovulation induction scheme based on machine learning, improves the prediction accuracy of the ovulation induction scheme, and provides an effective suggestion for a doctor to select the ovulation induction scheme.

A prediction device of an individualized ovarian hyperstimulation scheme based on machine learning comprises a computer memory, a computer processor and a computer program which is stored in the computer memory and can be executed on the computer processor, wherein a ovarian hyperstimulation scheme prediction model is stored in the computer memory, and comprises a trained primary learner and a trained secondary learner; the primary learner consists of an SVM model, an ExtraTrees model, a RandomForest model, a LightGBM model and an XGboost model, and the secondary learner adopts a Catboost model;

the computer processor, when executing the computer program, performs the steps of:

carrying out characteristic engineering processing on the clinical characteristic data to be detected, wherein the characteristic engineering processing comprises abnormal value processing, missing value processing and characteristic combination calculation;

inputting the processed clinical characteristic data into a primary learner for calculation to obtain predicted values of the five models;

and calculating the 5 predicted values by adopting a trained secondary learner to obtain a final predicted result.

The prediction device of the invention makes full use of different algorithms to make full use of different observations of data from different data space angles and data structure angles to make up for deficiencies and optimize results, thereby improving the prediction accuracy of the final ovulation induction scheme, reducing the overfitting degree of the whole model by the fusion of multiple models, and assisting a doctor in making decisions of the ovulation induction scheme.

The training process of the primary learner and the secondary learner is as follows:

collecting all clinical characteristic data of a patient who adopts the superovulation therapy to perform assisted reproduction from admission to the time when a treatment result is obtained after the superovulation therapy; judging all patient records according to a professional doctor, determining patients with the required egg number and quality, and bringing clinical characteristic data and the adopted ovulation induction scheme into sample data; classifying and labeling the samples according to an ovulation induction scheme, wherein the sample adopting a long scheme is labeled as 0, a short scheme is labeled as 1, an ultra-long scheme is labeled as 2, an antagonist scheme is labeled as 3, an ultra-short scheme is labeled as 4, and a micro-stimulation scheme is labeled as 5 to form a training set;

respectively inputting the clinical characteristic data in the training set into an SVM model, an ExtraTrees model, a RandomForest model, a LightGBM model and an XGboost model of a primary learner after characteristic engineering processing is carried out on the clinical characteristic data, respectively obtaining a predicted value, taking the 5 predicted values as the input of a Catboost model of a secondary learner, and calculating to obtain a final predicted value; each model calculates a cross entropy loss function according to its predicted value and the label value of the sample, thereby updating the model parameters according to the loss function.

Furthermore, during model training, an oversampling method and a cross-validation method are adopted to train the superovulation scheme prediction model, so that the balance and stability of model training are improved.

When a cross validation method is adopted to train the superovulation scheme prediction model, training an SVM model, an ExtraTrees model, a RandomForest model, a LightGBM model and an XGboost model in a primary learner by adopting 5-fold cross validation; after training is complete, 5 models are generated for each model, and 25 models are generated by the primary learner.

In the model training process, each model in the primary learner obtains importance ranking of each feature on the superovulation scheme prediction model through calculation, and the feature importance ranking results of each model are averaged to obtain the final feature importance ranking.

The ovulation induction scheme prediction model can be trained on line and then stored in a prediction device;

or online training is completed, and the received clinical characteristic data to be predicted in each application is used as a training sample after characteristic engineering, so that the prediction model is optimized and updated.

In the feature engineering process of clinical feature data of the present invention, the abnormal value process specifically includes: feature data outside the medical range is processed as null values.

The missing value processing specifically comprises the following steps: for continuous characteristic missing data, adopting an average filling method, a median filling method, a mode filling method and a nearest neighbor filling method; for discrete feature missing data, a mode filling method and a nearest neighbor filling method are adopted.

The feature combination calculation specifically comprises: the two data of height and weight are combined into a new characteristic index, and the two data of basal follicle stimulating hormone and luteinizing hormone are combined into a new characteristic index.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention integrates a plurality of characteristic data of a plurality of patients by using a machine learning algorithm, learns favorable information from past success cases, automates the process of the ovulation induction scheme, and helps doctors select a more appropriate personalized ovulation induction scheme for the patients who are treated for pregnancy by the IVF-ET technology.

2. The prediction model of the personalized ovulation induction scheme provided by the invention integrates the advantages of 5 models, improves the prediction accuracy of the ovulation induction scheme, and provides an effective suggestion for a doctor to select the ovulation induction scheme. In addition, the prediction model of the ovulation induction scheme can also output the importance ranking of the characteristics, provides more specific reference for doctors to design a treatment scheme more suitable for patients, and fills the blank of machine learning in the application of the personalized ovulation induction scheme.

Drawings

Fig. 1 is a schematic flow chart of the implementation of the prediction device of the personalized ovarian hyperstimulation scheme based on machine learning.

Detailed Description

The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.

The embodiment provides a prediction device of an individualized ovarian hyperstimulation scheme based on machine learning, which comprises a computer memory, a computer processor and a computer program which is stored in the computer memory and can be executed on the computer processor, wherein a hyperstimulation scheme prediction model is stored in the computer memory, and is obtained in the following three stages on line or off line:

stage 1: reception and preprocessing of data

The characteristic data is from clinical records of all patients who are subjected to IVF-ET treatment in eight years from 2010 to 2017 in the procreation department of a woman health care hospital, and specifically comprises basic information (height, weight and age) of male/female, blood routine indexes (female indexes are not indicated), biochemical indexes, infertility reasons, sex hormones, other ovarian function prediction indexes, male smoking history, male alcoholism history, family diabetes history, family hypertension history and the like. The system comprises a plurality of large-class direction data, and a plurality of independent features under the large-class directions. Firstly, according to the quantity and quality of the finally obtained ova, the treatment record of each patient allows a professional doctor to judge whether the record is a treatment case meeting the standard, and the record meeting the standard is included in a sample to be analyzed. All samples were classified into 6 categories according to the hyperstimulation protocol, with the sample using the long protocol labeled 0, the short protocol labeled 1, the very long protocol labeled 2, the antagonist protocol labeled 3, the very short protocol labeled 4, and the microstimulation protocol labeled 5.

And (2) stage: construction of training samples

For the collected feature data, firstly, the text feature data is classified and coded. Secondly, all features are processed with outliers and missing values.

Specifically, the discrete data is first subjected to a thermal encoding process, and data with irregular content or format is subjected to a null value process. Secondly, carrying out abnormal value detection on the continuous characteristic data, and carrying out null value processing on the data beyond the medical scope; then, aiming at missing continuous characteristic data, filling processing is carried out by adopting methods such as average filling, median filling, mode filling, nearest neighbor filling and the like; and (4) performing mode filling processing and nearest neighbor filling methods aiming at missing discrete feature data.

Aiming at the processed characteristic data, new characteristics are generated by combining part of characteristics, for example, height and weight can be combined into Body Mass Index (BMI), and basic follicle stimulating hormone and luteinizing hormone can be combined into basic follicle stimulating hormone/luteinizing hormone, and the like.

And performing correlation detection processing on the processed feature data to remove redundant features with high correlation. For example, the Pearson correlation coefficient of the white blood cell count and the neutrophil count in the blood routine can be 0.9 or more, and one of the highly correlated feature pairs is retained. It should be noted that, in this embodiment, the threshold of the correlation coefficient is 0.8, and if the threshold is higher than 0.8, it is considered that the two features are highly correlated, and the rejection can be performed. The correlation of clinical features can be obtained from statistical knowledge, or from medical experience knowledge.

After the characteristic data is processed, a group of clinical characteristic data corresponding to each patient is a training sample.

And (3) stage: construction of personalized ovulation induction scheme prediction model

The prediction model of the personalized ovulation promotion scheme adopts a stacking model (stacking frame), the first layer is a primary learner, and 5 models are adopted: SVM model, ExtraTrees model, RandomForest model, LightGBM model and XGBosost model. The SVM model is one of the most classical classification algorithms of machine learning in recent decades, and has excellent performance on small-scale high-latitude data classification problems. The ExtraTrees model greatly improves the respective generalization capability by the characteristics that the ExtraTrees model obtains the bifurcation attribute completely randomly and the RandomForest model by the idea that the random selection characteristic and the random sample sampling are adopted when the sub-decision tree is constructed. The XGBost and the LightGBM are different implementations of the gradient descent boosting decision tree (GBDT), perform different optimization processes aiming at the same target, and have excellent performances in a plurality of data mining tasks and competitions.

The Catboost model in the secondary learner is also an improvement over GBDT, and performs no better than XGBost and LightGBM in each big contest, even a little better.

In the present invention, regarding the stacking structure adopted by the prediction model, it should be emphasized that each model in the primary learner must be "quasi-different", that is, each model should have a high prediction accuracy, and the correlation degree between each model cannot be too high, so that the respective advantages of each model can be combined, and no redundant information is generated.

The purpose of the secondary learner is to fuse the information learned by each model in the primary learner and do further learning so that the secondary learner no longer trains using the original training data, which reduces the risk of overfitting.

Particularly, the 5 models adopted by the primary learner in the embodiment all have differences in design principle, and meet the requirements of "quasi-different" after accuracy testing.

Wherein, the ExtraTrees model, the RandomForest model and the SVM model are provided by scimit-left library, and the XGBosost model, the LightGBM model and the Catboost model are provided by respective development kits.

And next, training the constructed personalized ovulation induction scheme prediction model by using the training sample constructed in the stage 2.

In particular, since there are fewer patients using the micro-stimulation scheme in practical situations, there is a case of uneven distribution of training samples, and for this reason, the present embodiment uses an oversampling method to increase the sample balance. And then starting model training based on the data after the equalization processing.

Specifically, 5 models in the primary learner are first trained using 5-fold cross-validation for each model. That is, training samples are randomly divided into 5 equal parts, 4 parts are taken out as training sets, and the rest 1 part is taken as a verification set, so that 5 combinations of the training sets and the verification sets are generated.

Each cross validation process was: training the model based on the training set, predicting the verification set based on the model generated by the training set, and storing the model of each cross training. After the cross-validation training is completed, each model generates data with the number of rows being the total sample length (the sum of the lengths of 5 validation sets) and the number of columns being 1. And combining the data generated by each model in columns to finally form data with the number of rows being the whole sample length and the number of columns being 5 as training samples of the secondary learner. After the cross-validation training is completed, 5 models are generated for each model, and 25 models are generated by the primary learner.

Next, taking the data generated by the primary learner with the number of lines being the length of all samples and the number of columns being 5 as training samples (the sample labels are still the original labels), for the secondary learner: the Catboost model was trained.

And (3) obtaining a trained primary learner through training optimization: SVM models (5), extratress models (5), RandomForest models (5), LightGBM models (5), XGboost models (5), and secondary learner castboost models (1).

The trained personalized ovulation induction scheme prediction model has high accuracy, and can provide effective suggestions for a doctor to select the personalized ovulation induction scheme to a certain extent.

In the training process of the primary learner, each model can rank the importance of the features by calculating the information entropy. The ranking results of all models are averaged, and the final importance ranking result of each feature can be calculated. The characteristic sorting result can suggest that a doctor pays more attention to the index sorted in the front, so that the doctor can be assisted to design a treatment scheme for a patient in a targeted manner.

The obtained personalized ovarian hyperstimulation scheme prediction model is stored in a memory of the prediction device, as shown in figure 1. When the method is applied, after abnormal values, missing value processing, feature combination and other feature engineering are carried out on feature data of a patient, the feature data are respectively input into an ExtraTrees model, a RandomForest model, an SVM model, an XGBost model and a LightGBM model in a primary learner, 5 predicted values are obtained after calculation of each model, 1 predicted value is obtained after averaging of the 5 predicted values, and finally the primary learner outputs the 5 predicted values. Further, 5 predicted values are input to the secondary learner: the Catboost model, the calculated output, yields the final category of the record for this example.

When the personalized ovulation-promoting scheme prediction model is trained on line, the received characteristic data to be predicted in each application is processed to serve as a training sample, and the personalized ovulation-promoting scheme prediction model is optimized and updated.

The personalized ovulation induction scheme prediction model integrates the advantages of 5 models, improves the prediction accuracy of the ovulation induction scheme, and provides effective suggestions for a doctor to select the ovulation induction scheme. In addition, the prediction model can output the importance ranking of the features, and provides more specific reference for a doctor to design a treatment scheme more suitable for a patient.

In this embodiment, the model training in the primary learner uses 5-fold cross validation, which may be 3-fold, 10-fold, or other folds, depending on the training effect.

The computer processor in this embodiment may be any type of processor, and the Memory may be a Random Access Memory (RAM), a Read Only Memory (ROM), a Flash Memory (Flash Memory), a first-in first-out Memory (FIFO), a first-in last-out Memory (FILO), and the like.

The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A prediction apparatus for a machine learning-based personalized ovarian hyperstimulation protocol, comprising a computer memory, a computer processor and a computer program stored in said computer memory and executable on said computer processor, characterized in that:

the computer memory is stored with a prediction model of the ovulation induction scheme, and the prediction model of the ovulation induction scheme comprises a trained primary learner and a trained secondary learner; the primary learner consists of an SVM model, an ExtraTrees model, a RandomForest model, a LightGBM model and an XGboost model, and the secondary learner adopts a Catboost model;

2. The device for predicting the machine learning-based personalized ovarian hyperstimulation protocol according to claim 1, wherein the training process of the primary learner and the secondary learner is as follows:

3. The device for predicting the personalized ovarian hyperstimulation scheme based on the machine learning, which is characterized in that an oversampling method and a cross validation method are adopted to train a ovarian hyperstimulation scheme prediction model.

4. The prediction device of the personalized ovulation induction scheme based on the machine learning as claimed in claim 3, wherein when the ovulation induction scheme prediction model is trained by adopting a cross validation method, 5-fold cross validation is adopted to train an SVM model, an ExtraTrees model, a RandomForest model, a LightGBM model and an XGboost model in a primary learner; after training is complete, 5 models are generated for each model, and 25 models are generated by the primary learner.

5. The prediction device for the personalized ovulation induction scheme based on the machine learning as claimed in claim 1 or 2, wherein in the model training process, each model in the primary learner obtains importance ranking of each feature on the ovulation induction scheme prediction model through calculation, and the feature importance ranking results of each model are averaged to obtain the final feature importance ranking.

6. The prediction device for the personalized ovarian hyperstimulation scheme based on the machine learning as claimed in claim 1 or 2, wherein the ovarian hyperstimulation scheme prediction model is trained on line and then stored in the prediction device;

7. The device for predicting the machine learning-based personalized ovarian hyperstimulation scheme according to claim 1, wherein the abnormal value processing is specifically as follows: feature data outside the medical range is processed as null values.

8. The prediction device for machine learning-based personalized ovarian hyperstimulation protocol according to claim 1, wherein the deficiency value processing is specifically as follows: for continuous characteristic missing data, adopting an average filling method, a median filling method, a mode filling method and a nearest neighbor filling method; for discrete feature missing data, a mode filling method and a nearest neighbor filling method are adopted.

9. The prediction device for machine learning-based personalized ovarian hyperstimulation protocol according to claim 1, wherein the feature combination calculation is specifically as follows:

the two data of height and weight are combined into a new characteristic index, and the two data of basal follicle stimulating hormone and luteinizing hormone are combined into a new characteristic index.