CN112992377A

CN112992377A - Method, device, terminal and storage medium for generating drug treatment result prediction model

Info

Publication number: CN112992377A
Application number: CN202110234102.XA
Authority: CN
Inventors: 赵霞; 胡湛棋; 廖建湘; 赵彩蕾; 段婧; 袁碧霞; 叶园珍; 操德智; 朱凤军; 姚一; 曾洪武; 李德发; 干芸根; 王海峰; 苏适; 杨俊�
Original assignee: Shenzhen Childrens Hospital
Current assignee: Shenzhen Childrens Hospital
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2021-06-18

Abstract

The invention discloses a method, a device, a terminal and a storage medium for generating a drug treatment result prediction model. The method comprises the following steps: acquiring clinical data of a plurality of patients, and generating at least one first training data set according to the clinical data of the plurality of patients, wherein each first training data set comprises a plurality of groups of training data, and each group of training data comprises a sample clinical characteristic and a corresponding drug treatment result; constructing a plurality of initial models according to at least one machine learning algorithm, and training the initial models according to the first training data sets to obtain a plurality of models to be selected; and determining a drug treatment result prediction model according to the test results of the plurality of candidate models. According to the invention, the machine learning model for predicting the drug treatment result more accurately can be generated, so that the drug treatment result of the patient can be predicted through the drug treatment result prediction model to determine whether the patient is resistant, and the time for identifying the drug-resistant patient is shortened.

Description

Method, device, terminal and storage medium for generating drug treatment result prediction model

Technical Field

The invention relates to the technical field of medical treatment, in particular to a method, a device, a terminal and a storage medium for generating a drug treatment result prediction model.

Background

Tuberous sclerosis is an autosomal dominant hereditary disease caused by gene mutation, most patients with the tuberous sclerosis have epileptic seizures, epilepsy is one of symptoms which influence the quality of life most in the manifestation of many symptoms of the tuberous sclerosis, the main treatment method of the epilepsy is antiepileptic, however, many patients with the epilepsy are drug-resistant, the early identification of patients who are ineffective in drug treatment is very important at present, the drug resistance of the patients can be found only if the patients do not have the effect of reusing the drugs for a long time, and the process needs a long time.

Thus, there is a need for improvements and enhancements in the art.

Disclosure of Invention

Aiming at the defects in the prior art, a method, a device, a terminal and a storage medium for generating a drug treatment result prediction model are provided, and the problem that the time consumption for identifying drug-resistant patients in the prior art is long is solved.

In a first aspect of the present invention, a method for generating a model for predicting the outcome of a drug treatment is provided, which includes:

acquiring clinical data of a plurality of patients, and generating at least one first training data set according to the clinical data of the plurality of patients, wherein each first training data set comprises a plurality of groups of training data, and each group of training data comprises a sample clinical characteristic and a corresponding drug treatment result;

constructing a plurality of initial models according to at least one machine learning algorithm, and training the initial models according to the first training data sets to obtain a plurality of models to be selected;

and determining a drug treatment result prediction model according to the test results of the plurality of candidate models.

The method for generating a model for predicting the outcome of a drug treatment, wherein the classes of the sample clinical features in the training data of each of the training data sets are consistent, and the generating at least one first training data set according to the clinical data of the plurality of patients comprises:

extracting a plurality of feature classes from the clinical data of the plurality of patients;

performing feature selection on the plurality of feature classes by using at least one preset feature selection method to determine classes of sample clinical features in the at least one first training data set;

constructing the first training data set according to the category of the sample clinical features.

The method for generating a model for predicting the outcome of drug treatment, wherein the performing feature selection on the plurality of feature classes by using at least one preset feature selection method to determine the class of the sample clinical features in the at least one first training data set, comprises:

and selecting a preset number of characteristic categories as the categories of the sample clinical characteristics in the target first training data set by adopting a target preset characteristic selection method for the plurality of special categories.

The method for generating the drug treatment result prediction model comprises at least one of analysis of variance test, chi-square test and mutual information.

The method for generating the drug treatment result prediction model comprises at least one of a decision tree, a random forest, a support vector machine, naive Bayes, logistic regression and a multi-layer perception machine.

The method for generating the drug treatment result prediction model, wherein the determining the drug treatment result prediction model according to the test results of the plurality of candidate models, comprises:

acquiring a receiver working characteristic curve of each model to be selected;

acquiring the model to be selected with the highest area under the curve of the receiver working characteristic curve as a target model;

training the target model according to a second training data set to generate the drug treatment result prediction model;

the second training data set comprises a plurality of groups of training data, the sample clinical feature category in each group of training data is consistent with the sample clinical special category in the first training data set corresponding to the target model, and the number of the training data sets in the second training data set is larger than that of the training data sets in the first training data set.

The method for generating a prediction model of drug treatment outcome, wherein after determining the prediction model of drug treatment outcome from the plurality of candidate models, the method further comprises:

acquiring clinical data of a target patient, and extracting clinical features of the target patient from the clinical data of the target patient;

inputting the clinical characteristics into a trained drug treatment result prediction model generation model, and determining a drug treatment prediction result of the target patient through the drug treatment result prediction model generation model;

wherein the feature classes of the clinical features of the target patient are consistent with the particular classes of the sample clinical features in the training dataset used in training the medication outcome prediction model.

In a second aspect of the present invention, there is provided a medication result prediction model generation apparatus, including:

the training data generating module is used for acquiring clinical data of a plurality of patients and generating at least one first training data set according to the clinical data of the plurality of patients, each first training data set comprises a plurality of groups of training data, and each group of training data comprises a sample clinical characteristic and a corresponding drug treatment result;

the training module is used for constructing a plurality of initial models according to at least one machine learning algorithm and respectively training the initial models according to each first training data set to obtain a plurality of models to be selected;

and the determining module is used for determining a drug treatment result prediction model according to the test results of the plurality of candidate models.

In a third aspect of the present invention, a terminal is provided, which includes: the system comprises a processor and a storage medium in communication with the processor, wherein the storage medium is adapted to store a plurality of instructions, and the processor is adapted to call the instructions in the storage medium to execute the steps of implementing the method for generating a model for predicting the outcome of a drug therapy according to any one of the above methods.

In a fourth aspect of the present invention, there is provided a computer readable storage medium, wherein the computer readable storage medium stores one or more programs, which are executable by one or more processors to implement the steps of the method for generating a model for predicting the outcome of a drug therapy according to any one of the above methods.

Has the advantages that: compared with the prior art, the invention provides a method, a device, a terminal and a storage medium for generating a drug treatment result prediction model, which are used for extracting different types of features of the existing clinical data of a patient, constructing different initial models by adopting different machine learning algorithms, selecting the drug treatment result prediction model finally used for predicting the drug treatment result from the models obtained by training the sample features of different types, and generating the machine learning model for more accurately predicting the drug treatment result, so that the drug treatment result of the patient can be predicted through the drug treatment result prediction model according to the features extracted from the clinical data of the patient to determine whether the patient is resistant, and the time for identifying the resistant patient is shortened.

Drawings

FIG. 1 is a flow chart of an embodiment of a method for generating a model for predicting the outcome of a drug treatment provided by the present invention;

FIG. 2 is a logic diagram of a process for generating and using a medication outcome prediction model under review in an embodiment of a medication outcome prediction model generation method provided by the present invention;

FIG. 3 is a statistical graph of the area under the curve of the receiver operating characteristic curve of each candidate model in an embodiment of the method for generating a model for predicting the outcome of medication provided by the present invention;

FIG. 4 is a schematic diagram of a receiver operating characteristic curve of a target model in an embodiment of a method for generating a model for predicting a medication outcome provided by the present invention;

FIG. 5 is a schematic structural diagram of an embodiment of a model generation apparatus for predicting the outcome of a drug treatment provided by the present invention;

fig. 6 is a schematic structural diagram of an embodiment of a terminal provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The method for generating the drug treatment result prediction model provided by the invention can be applied to terminals, and the terminals can be but are not limited to various personal computers, notebook computers, mobile phones, tablet computers and the like.

Example one

As shown in fig. 1, the method for generating a model for predicting the outcome of a drug therapy provided by the present invention comprises the steps of:

s100, obtaining clinical data of a plurality of patients, and generating at least one first training data set according to the clinical data of the plurality of patients, wherein each first training data set comprises a plurality of groups of training data, and each group of training data comprises a sample clinical characteristic and a corresponding drug treatment result.

The present invention generates a medication outcome prediction model based on a supervised machine learning approach that learns a mapping from input to output based on existing input-output data pairs, one input-output pair being representable as a dyad (x, y), referred to as a training example, where x is input and y is output. The plurality of training examples constitute a training set. The supervised learning method derives a function f: x → y through a training set. This function may also be input to x' that is not in the training set. Assume that the correct output for input x 'is y'. In the most ideal case, obtained after inputting x' to function f

Equal to the correct label, i.e.

In the supervised learning method, the type of features input in a training set, the type and parameters of a machine learning algorithm, and the like directly influence the prediction effect of a generated model, in this embodiment, the types of features included in existing clinical data are selected through different feature selection modes to generate training data sets including features of different types, and the types of sample clinical features in the training data of each training data set are consistent. In particular, said method according to said plurality of patientsThe clinical data generates at least one first training data set comprising:

s110, extracting a plurality of feature categories from clinical data of the patients;

s120, performing feature selection on the feature categories by adopting at least one preset feature selection method to determine the categories of the sample clinical features in the at least one first training data set;

s130, constructing the first training data set according to the category of the sample clinical features.

Specifically, the clinical data of the patient includes personal information, medical history data, genetic data, MR image data, CT image data, etc. of the patient, and each feature is obtained by converting each item of data into a numerical value, that is, the feature category is a data category, for example, the feature category may include: after clinical data of a plurality of patients are obtained, preprocessing is carried out on the data, when the drug resistance of epilepsy is predicted, patient data without epilepsy and patient data with epilepsy but not only with drug treatment are removed, and date, name, birth date and other information irrelevant to the task are removed from the data. In practical applications, there may be data missing, and for the missing data, there is a default value, and the missing data may be supplemented by default value filling, for example, the number of lesions may default to 0, for those that are not detailed or not examined, a continuous value (such as age, etc.) may be filled with a median, and a discrete value (such as gender, etc.) may be filled with a mode. The treatment results in the patient data are stored separately as target values. After the preprocessed data are converted into numerical values, a feature vector with the length of m is formed for the feature value of each patient, wherein m is the number of feature types, for example, the feature vector of the ith patient

Of a first value v₁Representing sex, second value v₂Representing the waiting age of onset, all the feature vectors of n patients are formed into an m multiplied by n feature matrix X_m×n＝[x₁,x₂,...x_n]And then, the first and second image data are displayed,mixing X_m×nViewed as m column vectors

In a possible implementation manner, in order to facilitate data processing, a normalization operation is further performed on each feature value, and a specific formula of the normalization operation is as follows:

wherein: ═ denotes assignment, max (f)_i) Expression vector f_iMaximum value of (d), min (f)_i) Expression vector f_iMinimum value of (1).

After processing each feature value, performing feature selection by using at least one preset feature selection method, specifically, the preset feature selection method includes at least one of analysis of variance test, chi-square test and mutual information, and performing feature selection on the plurality of feature classes by using at least one preset feature selection method to determine classes of sample clinical features in the at least one first training data set, including:

Processing the clinical data of the patients to obtain a plurality of feature categories and feature values under the feature categories, selecting a preset number of feature categories from the feature categories by adopting at least one preset feature selection method, and changing the size of a feature matrix X from m × n to k × n after selection, wherein k is the preset number. The preset number may be multiple, for example, 20, 25, 30, and the like, for example, when feature selection is performed by using an analysis of variance test feature selection method, the first 20, the first 25, and the first 30 feature categories are respectively selected, so that 3 feature matrices with sizes of nx20, nx25, and nx30 can be obtained, three first training data sets can be generated, each first training data set includes n groups of training data, the number of features of sample clinical features in each group of training data is 20, 25, and 30, and each group of training data includes a drug therapy result (whether drug resistance is present) corresponding to the sample clinical features in the data.

It is obvious that, according to the above method, a plurality of first training data sets can be constructed, and since in this embodiment, the first training data sets are used to preliminarily determine the prediction capability of the drug treatment result of the model to preliminarily perform model selection, the number of training data sets in the first training data sets can be set to be smaller, and after the model is selected, further training is performed according to a second training data set having more training data sets, which will be described later in detail.

Referring to fig. 1 again, the method for generating a prediction model of drug treatment outcome further includes the following steps:

s200, constructing a plurality of initial models according to at least one machine learning algorithm, and training the initial models respectively according to the first training data sets to obtain a plurality of models to be selected.

In this embodiment, model training is performed according to each first training data set, and since different machine learning algorithms may have different effects, in order to select a machine learning algorithm more suitable for predicting a medication result, in this embodiment, different initial models are constructed according to different machine learning algorithms, and then training is performed according to each first training data set, and then selection is performed.

Specifically, the machine learning algorithm includes at least one of a decision tree, a random forest, a support vector machine, naive bayes, logistic regression, and a multi-layered perceptron. For each machine learning algorithm, multiple hyper-parameters may be selected to build the initial model, i.e., for each and its learning algorithm, multiple initial models may be built. As shown in fig. 2, a plurality of different models to be selected can be obtained by combining training data sets obtained by different preset feature selection methods with different machine learning methods for model training. After screening according to the drug treatment result prediction performance of the candidate model, determining a drug treatment result prediction model finally used for predicting the drug result of the new patient, that is, the method for generating the drug treatment result prediction model provided by the embodiment further includes the steps of:

s300, determining a drug treatment result prediction model according to the test results of the plurality of candidate models.

Specifically, the determining a prediction model of the drug treatment result according to the test results of the plurality of candidate models includes:

s310, obtaining the receiver working characteristic curve of each model to be selected.

The receiver operating characteristic curve is a curve which is drawn by taking the false positive rate of the classifier model as a horizontal axis and the true positive rate as a vertical axis and changing the threshold value of the classifier model. The area under the curve can reflect the classification performance of the classifier model, and the closer to 1.0, the better the effect is; the closer to 0.5, the classifier is in random guessing and has no prediction value; if the value is less than 0.5, the effect is worse than that of random guessing. The area under the curve of a normal and effective classifier model is between 0.5 and 1.0.

And S320, acquiring the candidate model with the highest area under the curve of the receiver working characteristic curve as a target model.

The higher the area under the curve of the receiver operating characteristic curve is, the better the prediction performance of the medication result of the corresponding candidate model is, in this embodiment, the candidate model with the highest area under the curve of the receiver operating characteristic curve is selected as the target model.

The experiment was conducted using the method provided in this example, using a data set with a patient count of 103. The number of features is 155, and 1 target feature. After data preprocessing, the number of patients remained 102 and the number of features remained 109. In the experiment, three methods of variance analysis F test, chi-square test and mutual information are used for feature selection, the number k of feature selection is 20, 35 and 50, and a group of feature selection is added for comparison. In the experiment, six machine learning methods including decision trees, random forests, support vector machines, naive Bayes, logistic regression and multilayer perceptrons are used. The number of trees in the random forest is 100, the kernel function in the support vector machine is a radial basis function, the multilayer perceptron comprises 1 hidden layer with 100 neurons, and the activation function is a linear rectification function. In the experiment, layered ten-fold cross validation is used for validating the models constructed by the feature selection method and the machine learning method, each experiment is repeated for 50 times, the area under the curve of the receiver working characteristic curve is recorded and calculated, and the average value and 95% confidence interval of the area under the curve are calculated. The experimental results are shown in fig. 3, which shows the areas under the curves and the 95% confidence intervals thereof for various methods (the histograms corresponding to each machine learning method in fig. 3 respectively show the areas under the curves of the receiver operating characteristic curves corresponding to the non-feature selection, the F check selection 20, the F check selection 35, the F check selection 50, the chi-square check selection 20, the chi-square check selection 35, the chi-square check selection 50, the mutual information selection 20, the mutual information selection 35, and the mutual information selection 50 from left to right). Of the results, the best performing was to select 35 features for the analysis of variance, F, test and predict the classification using a multi-layered perceptron. The receiver performance curve is shown in fig. 4, and the area under the curve reaches 0.812, 95% confidence intervals (0.807, 0.817). This illustrates that the approach provided by the present embodiment is feasible.

S330, training the target model according to a second training data set to generate the drug treatment result prediction model.

The second training data set comprises a plurality of groups of training data, the sample clinical characteristic category in each group of training data is consistent with the sample clinical special category in the first training data set corresponding to the target model, and the number of the training data groups in the second training data set is larger than that of the training data groups in the first training data set. After the target model is determined, for more clinical data of patients who have undergone epilepsy medication, after feature extraction and preprocessing are performed by using clinical feature categories corresponding to the target model, training data in the second training data set are generated, and similarly, the second training data set includes multiple sets of training data, and each set of training data includes sample clinical features and corresponding treatment results.

After the target model is trained by using the second training data set, the medication result prediction model for predicting whether a new patient is resistant to the drug is generated, that is, after the medication result prediction model is determined according to the test results of the plurality of candidate models, the method includes the following steps:

In summary, the embodiment provides a method for generating a medication result prediction model, which performs different types of feature extraction on existing patient clinical data, constructs different initial models by using different machine learning algorithms, and selects a medication result prediction model finally used for predicting a medication result from models obtained by training using different types of sample features, so as to generate a machine learning model capable of more accurately predicting a medication result, thereby achieving that a medication result of a patient can be predicted by the medication result prediction model according to features extracted from the patient clinical data to determine whether the patient is resistant, and shortening the time for identifying a resistant patient.

It should be understood that, although the steps in the flowcharts shown in the figures of the present specification are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps of the present invention are not limited to being performed in the exact order disclosed, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps of the present invention may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Example two

Based on the above embodiment, the present invention also provides a device for generating a model for predicting a medication outcome, as shown in fig. 5, including:

a training data generation module, configured to obtain clinical data of multiple patients, and generate at least one first training data set according to the clinical data of the multiple patients, where each first training data set includes multiple sets of training data, and each set of training data includes a sample clinical characteristic and a corresponding medication result, which is specifically described in embodiment one;

the training module is configured to construct a plurality of initial models according to at least one machine learning algorithm, and train the initial models according to each of the first training data sets to obtain a plurality of models to be selected, which is specifically described in embodiment one;

a determining module, configured to determine a drug therapy outcome prediction model according to a test outcome of the multiple candidate models, as described in embodiment one.

EXAMPLE III

Based on the above embodiments, the present invention further provides a terminal, and a schematic block diagram thereof may be as shown in fig. 6. The terminal comprises a processor 10 and a memory 20, wherein the memory 20 stores a computer program, and the processor 10 executes the computer program to realize at least the following steps:

Wherein the classes of sample clinical features in the training data of each of the training data sets are consistent, the generating at least one first training data set from the clinical data of the plurality of patients comprising:

Wherein the performing of feature selection on the plurality of feature classes using at least one preset feature selection method to determine the class of the sample clinical features in the at least one first training data set comprises:

The preset feature selection method comprises at least one of analysis of variance test, chi-square test and mutual information.

The machine learning algorithm comprises at least one of a decision tree, a random forest, a support vector machine, naive Bayes, logistic regression and a multilayer perceptron.

Wherein, the determining a drug treatment result prediction model according to the test results of the plurality of candidate models comprises:

acquiring a receiver working characteristic curve of each model to be selected;

Wherein, after determining the prediction model of the drug treatment result in the plurality of candidate models, the method further comprises:

EXAMPLE III

The present invention also provides a computer readable storage medium storing one or more programs, which are executable by one or more processors, to implement the steps of the method for generating a prediction model of drug therapy outcome described in the above embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for generating a model for predicting the outcome of a drug treatment, comprising:

2. The method of generating a medication outcome prediction model according to claim 1, wherein the classes of sample clinical features in the training data of each of the training data sets are consistent, and the generating at least one first training data set from the clinical data of the plurality of patients comprises:

3. The method of generating a medication outcome prediction model according to claim 2, wherein the feature selecting the plurality of feature classes using at least one preset feature selection method to determine the class of the sample clinical features in the at least one first training data set comprises:

4. The method of claim 2, wherein the predetermined feature selection method comprises at least one of analysis of variance test, chi-square test, and mutual information.

5. The method of claim 1, wherein the machine learning algorithm comprises at least one of decision trees, random forests, support vector machines, naive bayes, logistic regression, and multi-tier perceptrons.

6. The method for generating a prediction model of drug treatment outcome according to claim 1, wherein the determining a prediction model of drug treatment outcome from the test outcomes of the plurality of candidate models comprises:

acquiring a receiver working characteristic curve of each model to be selected;

7. The method of generating a prediction model for outcome of drug treatment according to claim 1, wherein after determining the prediction model for outcome of drug treatment in the plurality of candidate models, the method further comprises:

8. A medication outcome prediction model generation apparatus, comprising:

9. A terminal, characterized in that the terminal comprises: a processor, a storage medium communicatively coupled to the processor, the storage medium adapted to store a plurality of instructions, the processor adapted to invoke the instructions in the storage medium to perform the steps of implementing the method of generating a model of a drug therapy outcome prediction according to any of the preceding claims 1-7.

10. A computer readable storage medium, storing one or more programs, which are executable by one or more processors, to implement the steps of the method for generating a model for predicting drug therapy outcome of any of claims 1-7.