CN111613289B

CN111613289B - Individuation medicine dosage prediction method, device, electronic equipment and storage medium

Info

Publication number: CN111613289B
Application number: CN202010378831.8A
Authority: CN
Inventors: 卢晓阳; 楼燕; 杨希; 何玲娟; 洪东升; 高飞; 孙佳星
Original assignee: Beijing Medicinovo Technology Co ltd; First Affiliated Hospital of Zhejiang University School of Medicine
Current assignee: Beijing Medicinovo Technology Co ltd; First Affiliated Hospital of Zhejiang University School of Medicine
Priority date: 2020-05-07
Filing date: 2020-05-07
Publication date: 2023-04-28
Anticipated expiration: 2040-05-07
Also published as: CN111613289A

Abstract

The embodiment of the invention provides a method, a device, electronic equipment and a storage medium for predicting personalized medicine dosage, wherein the method comprises the following steps: acquiring clinical original data related to a target individual, and respectively acquiring a significant single-factor variable and a significant cross variable after data cleaning; combining the significant single factor variable and the significant cross variable to obtain a modeling data set; a predictive model is constructed using a preset machine learning algorithm for drug delivery prediction of the target individual based on the modeling dataset with the dose used per unit time of the therapeutic drug as a target variable. The personalized medicine dose prediction method, the device, the electronic equipment and the storage medium provided by the embodiment of the invention not only can fit the prediction model well under the condition that a large amount of data is missing, but also can quickly and effectively find out important variables with great influence on the personalized medicine dose from high-dimension variables, and obtain specific personalized medicine dose prediction results according to different situations of patients.

Description

Individuation medicine dosage prediction method, device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of data processing, in particular to a personalized medicine dosage prediction method, a personalized medicine dosage prediction device, electronic equipment and a storage medium.

Background

The differences of individual metabolic capacities of a plurality of medicaments are known to be large, the treatment safety range is narrow, and factors influencing the medicament effect of the medicaments in the clinical application process are many, so that the variability of individual doses of a plurality of medicaments is large, serious potential side effects and medicament administration risks exist, and the management of a plurality of therapeutic medicaments is challenging. Taking warfarin and vancomycin as examples:

warfarin is an anticoagulant widely applied clinically, can be used for treating atrial fibrillation, deep vein thrombosis, pulmonary embolism, valve implantation and the like, and has low price. Current clinical warfarin dosing regimens typically employ a standard dose administered first, then the clinician increases or decreases the dose until the INR reaches the target, as per the international normalized ratio (International Normalized Ratio, INR) value for each patient. However, in such anticoagulation therapy, the period of adjusting the dose is long, and the possibility of thrombus or hemorrhage in the patient is high. "Laboratory Medicine", december 2013, vol28. No. 12:1157-1161, reviewed the maintenance dose model and application of warfarin anticoagulants built based on the results of genetic pharmacology studies at home and abroad in combination with clinical data, and mainly divided into three prediction models: a stable dose predictive model, a starting dose predictive model, and an accurate model of the stable dose.

Vancomycin is a first-line drug for treating serious infection of gram-positive bacteria such as methicillin-resistant staphylococcus aureus and the like at present, and is commonly used for critical patients. At present, the traditional initial administration scheme is formulated by the following ways: (1) The pharmacist performs initial administration according to the vancomycin instructions; (2) The pharmacist performs initial administration according to the traditional pharmacokinetic equation; (3) the pharmacist empirically performs the initial administration. After treatment according to the initial regimen, the patient is typically monitored for blood concentration (TDM) prior to the fifth administration, and based on the measured trough concentration, the pharmacist gives an adjusted regimen based on general guidelines and personal experience.

The existing researches show that compared with the traditional method, the drug administration scheme based on group pharmacokinetics can obviously improve the concentration standard rate, but has some limitations in the aspects of adaptability, functionality, usability and the like: for example, the PPK parameters used are derived from eudomen or limited to chinese adults, and the individual dosing regimen directly applied to all different populations of china may present a certain risk. To address the limitations of group-based pharmacokinetic dosing regimens, "Acta Pharmaceutica Sinica"2018,53 (1): 104-110 provides a method of calculating the optimal dose of vancomycin given by the maximum a posteriori Bayes method (maximum a posterior Bayesian estimation, MAPB) using group-based pharmacokinetics (PPK) in combination with Bayes estimation. According to the method, the vancomycin group pharmacokinetics characteristic parameters of Chinese people are collected, special people such as common adults, newborns, old people, neurosurgery patients and the like are analyzed, MAPB estimation of individual parameters of different people is realized, and an initial dosing scheme for different people and an adjustment scheme according to therapeutic drug monitoring results can be formulated.

In clinical personalized medicine dose prediction research, the problems of too few selected patients, insufficient treatment course, lack of long-term follow-up data, large data loss and the like exist, so that the formulation of a dosing scheme based on group pharmacokinetics is difficult. Moreover, the dosing scheme based on group pharmacokinetics cannot realize personalized medication according to different individual patients.

Disclosure of Invention

In order to solve the problems in the prior art, the embodiment of the invention provides an individualized medicament dose prediction method, an individualized medicament dose prediction device, electronic equipment and a storage medium.

In a first aspect, embodiments of the present invention provide a personalized medicine dosage prediction method comprising: acquiring clinical original data related to a target individual, and obtaining data to be analyzed after data cleaning is carried out on the clinical original data; carrying out variable primary screening on the continuous independent variable and the classified independent variable in the data to be analyzed by adopting a single-factor analysis method to obtain a remarkable single-factor variable, wherein the remarkable single-factor variable has remarkable correlation with the dosage of the therapeutic drug used in unit time; carrying out segmentation processing and independent heat coding on the continuous independent variable in the obvious single-factor variable, carrying out independent heat coding on multiple classified independent variables in the classified independent variable in the obvious single-factor variable, and obtaining a classified independent variable data set based on the obvious single-factor variable; constructing cross variables based on the two classification self-variable data sets by a characteristic engineering method to generate a cross variable data set; taking the dosage used in the unit time of the therapeutic drug as a target variable, and obtaining a significant cross variable through stepwise regression screening based on the cross variable data set, wherein the significant cross variable has significant correlation with the dosage used in the unit time of the therapeutic drug; combining the significant single-factor variable and the significant cross variable to obtain a modeling data set; and constructing a prediction model for the prediction of the dosage of the target individual by adopting a preset machine learning algorithm based on the modeling data set by taking the dosage of the therapeutic drug used per unit time as a target variable.

Further, the variable preliminary screening is performed on the continuous independent variable and the classified independent variable in the data to be analyzed by adopting a single factor analysis method to obtain significant single factor variables, which specifically comprises the following steps: carrying out Pearson correlation test on the continuous independent variable and the dosage of the therapeutic drug in unit time, and judging whether the relationship between the continuous independent variable and the dosage of the therapeutic drug in unit time is obvious; carrying out Mann-Whitney U test on the type independent variable and the dosage used by the therapeutic medicine in unit time, and judging whether the relationship between the type independent variable and the dosage used by the therapeutic medicine in unit time is obvious or not; the significant single factor variable is derived based on the continuous type independent variable and the categorical independent variable that are significant in relation to the dosage of the therapeutic agent used per unit time.

Further, the method uses the dosage of the therapeutic drug in unit time as a target variable, and based on the modeling data set, builds a prediction model by adopting a preset machine learning algorithm for predicting the dosage of the target individual, and specifically includes: constructing a tree-shaped prediction model by adopting an XGBoost algorithm based on the modeling data set by taking the dosage of the therapeutic drug in unit time as a target variable; in the model training process, auto-ml model automatic parameter adjustment is adopted, and a K-fold cross validation mode is adopted to optimize the model, so that the prediction model is obtained for predicting the administration dosage of the target individual.

Further, the clinical raw data includes at least one of demographic data, therapeutic drug usage data, co-medication data, adjuvant therapy data, gene polymorphism data, test data, diagnostic data.

Further, the data cleansing includes performing at least one of text information extraction, data grouping, data transposition, data normalization, data deduplication, missing value processing, outlier processing, data dimension reduction, and variable calculation by using regular expressions.

Further, the Feature engineering method includes a Feature Tools method.

Further, the therapeutic agent dosage per unit time includes a daily therapeutic agent dosage.

In a second aspect, embodiments of the present invention provide a personalized medicine dosage prediction device comprising: the data acquisition module is used for: acquiring clinical original data related to a target individual, and obtaining data to be analyzed after data cleaning is carried out on the clinical original data; a significant single factor variable acquisition module for: carrying out variable primary screening on the continuous independent variable and the classified independent variable in the data to be analyzed by adopting a single-factor analysis method to obtain a remarkable single-factor variable, wherein the remarkable single-factor variable has remarkable correlation with the dosage of the therapeutic drug used in unit time; a significant cross variable acquisition module for: carrying out segmentation processing and independent heat coding on the continuous independent variable in the obvious single-factor variable, carrying out independent heat coding on multiple classified independent variables in the classified independent variable in the obvious single-factor variable, and obtaining a classified independent variable data set based on the obvious single-factor variable; constructing cross variables based on the two classification self-variable data sets by a characteristic engineering method to generate a cross variable data set; taking the dosage used in the unit time of the therapeutic drug as a target variable, and obtaining a significant cross variable through stepwise regression screening based on the cross variable data set, wherein the significant cross variable has significant correlation with the dosage used in the unit time of the therapeutic drug; the prediction model construction module is used for: combining the significant single-factor variable and the significant cross variable to obtain a modeling data set; and constructing a prediction model for the prediction of the dosage of the target individual by adopting a preset machine learning algorithm based on the modeling data set by taking the dosage of the therapeutic drug used per unit time as a target variable.

In a third aspect, an embodiment of the invention provides an electronic device for personalized medicine dosage prediction comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, which processor, when executing the computer program, implements the steps of the method as provided in the first aspect.

In a fourth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as provided by the first aspect.

The personalized medicine dosage prediction method, the device, the electronic equipment and the storage medium provided by the embodiment of the invention gradually build and screen important risk factors influencing the personalized medicine dosage by adopting methods such as a statistical method, a characteristic engineering, stepwise regression, machine learning and the like based on the clinical data of the real world patients, so as to build a personalized clinical treatment dosage prediction model of the medicine. The real world big data is characterized in that the data dimension is higher, the deletion rate is also higher, the two problems are effectively solved, the prediction model can be well fitted under the condition that a large amount of data is deleted, important variables with larger influence on the individual dosage of the medicine can be quickly and effectively found out from the high-dimension variables, cross variables can be constructed through characteristic engineering, the cross influence relation among the variables is reasonably utilized, the prediction model is finally constructed by comprehensively utilizing all factors possibly influencing the medicine dosage difference of a patient, and the specific individual medicine dosage prediction result is obtained according to different situations of the patient.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a personalized medicine dosage prediction method provided by an embodiment of the invention;

FIG. 2 is a flow chart of a personalized medicine dosage prediction method provided by another embodiment of the invention;

FIG. 3 is a schematic diagram of a personalized medicine dosage prediction device according to an embodiment of the invention;

fig. 4 is a schematic physical structure of an electronic device according to an embodiment of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

FIG. 1 is a flowchart of a personalized medicine dosage prediction method according to an embodiment of the invention.

As shown in fig. 1, the method includes:

and step 101, acquiring clinical original data related to a target individual, and obtaining data to be analyzed after data cleaning is carried out on the clinical original data.

The personalized medicine dosage prediction method provided by the embodiment of the invention can obtain a prediction model for predicting the target individual administration dosage. The model is derived based on clinical raw data of the target individual. Thus, the personalized medicine dose prediction device first needs to acquire clinical raw data related to the target individual, specifically, clinical raw data related to the target individual can be extracted from the hospital data system.

The clinical raw data may be set as desired, and may include, for example, at least one of demographic data, therapeutic drug use data, co-medication data, adjuvant therapy data, gene polymorphism data, test data, and diagnostic data.

Data content of clinical raw data examples: demographic data (age, sex, height, weight, smoking history, drinking history, allergy history, past medical history, etc.), therapeutic drug usage data (single dose of therapeutic drug, frequency of administration, daily dose, total dose during hospitalization, etc.), co-medication information, adjuvant therapy data (ventilator mechanical ventilation, shock therapy, infrared therapy, laser, acupuncture, massage, etc. physical therapy), genetic polymorphism data, test data (blood routine, urine routine, liver function, kidney function, electrolytes, cancer markers, cardiac markers, clotting factors, etiology detection, etc.), diagnostic data (based on discharge diagnosis, mainly) and the like.

And after the clinical original data are subjected to data cleaning, all the clinical original data are subjected to data cleaning processing to obtain data to be analyzed. The related data cleaning method can comprise the following steps: text information extraction, data grouping, data transposition, data standardization, data deduplication, missing value processing, outlier processing, data dimension reduction, variable calculation and the like are performed by using the regular expression.

And 102, performing variable preliminary screening on the continuous independent variable and the classified independent variable in the data to be analyzed by adopting a single-factor analysis method to obtain a remarkable single-factor variable, wherein the remarkable single-factor variable has remarkable correlation with the dosage of the therapeutic drug used in unit time.

The data to be analyzed includes continuous type independent variables and classified type independent variables. Continuous independent variables such as age, test data, co-administered dose, etc. The classification independent variables comprise two classification independent variables and multiple classification independent variables, wherein the two classification independent variables are such as gender, whether combined administration is used or not and the like; multiple classification independent variables such as gene polymorphism information, etc.

The one-factor analysis may initially explore the relationship of the independent variable to the target variable, and may delete a part of the independent variable that is independent of the target variable when the sample size is not large. And analyzing whether the continuous independent variable and the classified independent variable in the data to be analyzed have obvious correlation with the dosage used in the unit time of the therapeutic drug by adopting a single-factor analysis method, and taking the variable which has obvious correlation with the dosage used in the unit time of the therapeutic drug as a obvious single-factor variable according to an analysis result. The dosage of the therapeutic agent used per unit time may be generally taken as a daily dosage of the therapeutic agent, and is not limited to the present invention.

Step 103, carrying out segmentation processing and independent heat coding on the continuous independent variables in the obvious single-factor variables, carrying out independent heat coding on multiple classified independent variables in the obvious single-factor variables, and obtaining a two-class independent variable data set based on the obvious single-factor variables; constructing cross variables based on the two classification self-variable data sets by a characteristic engineering method to generate a cross variable data set; and taking the dosage used in the unit time of the therapeutic drug as a target variable, and obtaining a significant cross variable through stepwise regression screening based on the cross variable data set, wherein the significant cross variable has significant correlation with the dosage used in the unit time of the therapeutic drug.

And firstly carrying out segmentation treatment on the continuous independent variable in the obvious single-factor variable, and then carrying out single-heat coding to obtain two classification independent variables. The biochemical detection index can be segmented according to a normal value reference interval; age can be divided according to underage, young and middle aged, and elderly; the remaining indicators may be segmented according to upper and lower quartiles according to their data distribution. For example, for a red blood cell test index, there is a normal range, and then the red blood cell test result can be divided into three sections, namely, a smaller than normal range, a normal range and a larger than normal range. After segmentation, performing single-heat coding, and obtaining a result whether the result is smaller than a normal range according to a real detection result of the red blood cells, for example, whether the result is represented by 1 or not is represented by 0; results can also be obtained in the normal range, for example, if it is indicated by 1 or if it is indicated by 0; results can also be obtained if they are greater than the normal range, for example, if they are indicated by 1 and if they are indicated by 0. Similarly, corresponding two-class independent variables can be obtained according to the continuity independent variable.

The classified independent variables comprise two-class independent variables and multi-class independent variables, and the two-class independent variables can be directly represented by 0 and 1. Each category of the multi-category independent variable is equivalent to being segmented, so that for the multi-category independent variable, the corresponding two-category independent variable can be obtained only by performing single-hot coding.

The binary independent variable obtained by continuous variable conversion, the binary independent variable obtained by multi-class independent variable conversion and the binary independent variable which is originally expressed by 0 and 1 in the classified independent variable jointly form a binary independent variable data set.

Constructing cross variables based on the two-classification self-variable data set through a Feature engineering method (such as a Feature Tools method) to generate a cross variable data set; when the feature engineering is carried out to generate the cross variable, a Cartesian product mode can be adopted, namely, the two crossed variables are multiplied by a numerical value, if the two variable values are 1, the cross variable value is 1, and otherwise, the cross variable value is 0. Screening by stepwise regression (stepwise logistics regression) based on the cross variable dataset for a significant cross variable having a significant correlation with the therapeutic drug unit time dose as a target variable.

104, merging the significant single-factor variable and the significant cross variable to obtain a modeling data set; and constructing a prediction model for the prediction of the dosage of the target individual by adopting a preset machine learning algorithm based on the modeling data set by taking the dosage of the therapeutic drug used per unit time as a target variable.

And combining the significant single-factor variable and the significant cross variable to obtain a modeling data set. During modeling, the dosage is used as a target variable in the unit time of the therapeutic drug, and a prediction model is built by adopting a preset machine learning algorithm based on the modeling data set so as to be used for predicting the dosage of the target individual. For example, the modeling data set is used as training data, the dosage of the therapeutic drug in unit time is used as a label, and a preset machine learning algorithm is adopted to construct a prediction model for the dosage prediction of the target individual. The preset machine learning algorithm can be selected according to actual needs.

Based on real-world patient clinical data, the embodiment of the invention gradually builds and screens important risk factors influencing the personalized dosage of the medicine by adopting methods such as a statistical method, feature engineering, stepwise regression, machine learning and the like, and builds a personalized clinical treatment dosage prediction model of the medicine. The real world big data is characterized in that the data dimension is higher, the deletion rate is also higher, the two problems are effectively solved, the prediction model can be well fitted under the condition that a large amount of data is deleted, important variables with larger influence on the individual dosage of the medicine can be quickly and effectively found out from the high-dimension variables, cross variables can be constructed through characteristic engineering, the cross influence relation among the variables is reasonably utilized, the prediction model is finally constructed by comprehensively utilizing all factors of the patient which possibly influence the medicine dosage difference, and specific individual medicine dosage prediction results of the patient are obtained according to different conditions (such as different clinical characteristics and treatment conditions) of the patient.

Further, based on the above embodiment, the performing variable preliminary screening on the continuous independent variable and the classified independent variable in the data to be analyzed by using a single factor analysis method to obtain significant single factor variables specifically includes: carrying out Pearson correlation test on the continuous independent variable and the dosage of the therapeutic drug in unit time, and judging whether the relationship between the continuous independent variable and the dosage of the therapeutic drug in unit time is obvious; carrying out Mann-Whitney U test on the type independent variable and the dosage used by the therapeutic medicine in unit time, and judging whether the relationship between the type independent variable and the dosage used by the therapeutic medicine in unit time is obvious or not; the significant single factor variable is derived based on the continuous type independent variable and the categorical independent variable that are significant in relation to the dosage of the therapeutic agent used per unit time.

And (3) performing variable primary screening by adopting a single-factor analysis method to obtain a remarkable single-factor variable.

Wherein, for continuous independent variables, by carrying out Pearson correlation test with a target variable (dosage used per unit time of therapeutic drug), whether the relationship between continuous independent variables and the target variable is significant is judged, and the original assumption of Pearson correlation test is: there is no significant correlation between the continuous independent variable and the target variable. If the original hypothesis is rejected, the continuous independent variable and the target variable are considered to have obvious correlation, the variable is reserved, and otherwise, the variable is deleted. A correlation coefficient greater than 0 indicates a significant positive correlation between the two, and a correlation coefficient less than 0 indicates a significant negative correlation between the two.

For the subtype independent variables, whether the relationship between the subtype independent variables and the target variables is obvious or not is judged by carrying out Mann-Whitney U test on the target variables (the dosage used for unit time of the therapeutic drugs), and the original assumption of the Mann-Whitney U test is as follows: there is no significant difference in the target variable data distribution of the two sets of data grouped by the categorical independent variable. If the original assumption is refused, the data distribution of the two groups of data target variables is considered to have significant difference, the independent variable of the classification type affects the target variables significantly, the variable is reserved, and the variable is deleted if the influence is not significant.

The significant single factor variable is derived based on the continuous type independent variable and the categorical independent variable that are significant in relation to the dosage of the therapeutic agent used per unit time.

Based on the embodiment, the embodiment of the invention improves the rapidity and accuracy of obtaining the obvious single-factor variable by carrying out Pearson correlation test on the continuous independent variable and the dosage of the therapeutic drug in unit time and carrying out Mann-Whitney U test on the independent variable of the type and the dosage of the therapeutic drug in unit time.

Further, based on the above embodiment, the using the dose of the therapeutic drug per unit time as the target variable, based on the modeling dataset, constructing a prediction model by using a preset machine learning algorithm for predicting the dose of the target individual, specifically includes: constructing a tree-shaped prediction model by adopting an XGBoost algorithm based on the modeling data set by taking the dosage of the therapeutic drug in unit time as a target variable; in the model training process, auto-ml model automatic parameter adjustment is adopted, and a K-fold cross validation mode is adopted to optimize the model, so that the prediction model is obtained for predicting the administration dosage of the target individual.

Through verification, the prediction model constructed by the XGBoost algorithm has good training and prediction effects. Thus, the preset machine learning algorithm may be an XGBoost algorithm. And constructing a tree-shaped prediction model by adopting an XGBoost algorithm based on the modeling data set by taking the dosage of the therapeutic drug in unit time as a target variable.

In the model training process, auto-ml model automatic parameter adjustment is adopted, and a K-fold cross validation mode is adopted to optimize the model, so that the prediction model is obtained for predicting the administration dosage of the target individual. Taking five-fold cross validation as an example, optimizing the model in a five-fold cross validation mode, taking the average value of the evaluation indexes in five folds as the reference of the evaluation indexes (R2/MSE/RMSE/MAE) of the model prediction capability, and storing a training model with the optimal five-fold model prediction evaluation indexes for prediction.

The specific procedure for auto-ml is as follows:

the goal of automatic machine learning (AutoML) is to make decisions using automated data driven approaches. The user only needs to provide data, the automatic machine learning system automatically decides the optimal scheme, and the auto_ml bottom layer uses tools such as Scikit-Learn, XGBoost, tensorFlow, keras, lightGBM and the like to ensure the high efficiency in running. Automatic machine learning includes algorithm selection, super-parametric optimization and neural network architecture search, and covers each step of the machine learning workflow:

And (v) automatically preparing data.

Automatic feature selection.

And (v) an automatic selection algorithm.

And (5) optimizing the hyper-parameters.

Automatic pipeline/workflow construction.

V neural network architecture search.

Automatic model selection and ensemble learning.

The implementation process of the five-fold cross validation is as follows:

the five-fold cross validation divides the data set into five equal parts, one part is used as a test set, the remaining four parts are used as training sets to construct a prediction model, the training effect of the prediction model of the test set is adopted, the prediction evaluation index result of the test set in each fold is output, and the final model prediction capability evaluation index is based on the average value of each evaluation index in the five folds.

The XGBoost algorithm specifically comprises the following steps:

v inputting the target variable and the self-variable data set, respectively.

V defines an objective function (loss + regularization term).

■ Where loss = error of last tree (gradient); regularization term = complexity of tree.

■ Optimizing the objective function requires that the prediction error be as small as possible and that the complexity of the numbers be as low as possible.

And (3) searching the segmentation points by using a greedy method, and constructing a decision tree.

And (3) enumerating all different tree structures, and selecting a scheme with the maximum Gain value and exceeding a threshold value.

Pruning terminates splitting if max (Gain) is less than the threshold.

And (3) calculating the score of the leaf node.

And updating the decision tree sequence, and storing all constructed decision trees and scores thereof.

And (3) calculating the prediction result of each sample, namely the sum of the scores of each tree, and obtaining the probability that the sample belongs to each category.

And (3) calculating importance scores of each variable, and selecting important variables with obvious influence on the model. First, gini coefficients of the respective variables are calculated, and the average value of the Gini coefficients is the importance score of the variable.

And retaining the important variables with the importance scores greater than 0.

On the basis of the embodiment, the embodiment of the invention constructs the prediction model through the XGBoost algorithm, adopts auto-ml model automatic parameter adjustment, adopts a K-fold cross validation mode to optimize the model, and improves the accuracy of the prediction model.

FIG. 2 is a flow chart of a personalized medicine dosage prediction method provided by another embodiment of the invention. As shown in fig. 2, the method comprises the steps of:

1. establishing a database

(1) Setting a target variable: daily dosage of therapeutic agent. Taking a single administration medical advice of the therapeutic drug as an event, screening effective drug events according to a therapeutic window of the drug, wherein a calculation formula of daily dose of the therapeutic drug is as follows: daily dose = single dose.

(2) Data cleaning is carried out on the original data: raw data are extracted from a hospital data system, the raw data are subjected to data cleaning, and the data content approximately comprises: demographic information, therapeutic drug use information, combination information, other adjuvant therapy means, genetic polymorphisms, inspection information, diagnostic information and other real-world patient clinical data, cleaning all data, and constructing a database for modeling after data formatting. The data cleaning method involved can comprise the following steps: text information extraction, data grouping, data transposition, data standardization, data deduplication, missing value processing, outlier processing, data dimension reduction, variable calculation and the like are performed by using the regular expression.

2. Variable primary screening by adopting single factor analysis method

(1) For continuous independent variables, through carrying out Pearson correlation test on the continuous independent variables and target variables, judging whether the relation between the continuous independent variables and the target variables is obvious or not, and the original assumption of the Pearson correlation test is as follows: there is no significant correlation between the continuous independent variable and the target variable. If the original hypothesis is rejected, the continuous independent variable and the target variable are considered to have obvious correlation, the variable is reserved, and otherwise, the variable is deleted. A correlation coefficient greater than 0 indicates a significant positive correlation between the two, and a correlation coefficient less than 0 indicates a significant negative correlation between the two.

(2) For the independent variable of the classification, judging whether the relation between the independent variable of the classification and the target variable is obvious or not by carrying out Mann-Whitney U test on the independent variable of the classification, wherein the original assumption of the Mann-Whitney U test is as follows: there is no significant difference in the target variable data distribution of the two sets of data grouped by the categorical independent variable. If the original assumption is refused, the data distribution of the two groups of data target variables is considered to have significant difference, the independent variable of the classification type affects the target variables significantly, the variable is reserved, and the variable is deleted if the influence is not significant.

3. Construction of Cross variables by feature engineering, screening Cross variables by stepwise logistics regression (stepwise regression)

(1) Based on the significant single-factor variable screened in the 2 nd part, carrying out segmentation treatment and single-heat coding (one-hot) on continuous variables in the single-factor variable, wherein biochemical test indexes are segmented according to normal value reference intervals; age is divided according to underage, young and middle aged and elderly people; the other indexes are segmented according to the upper quartile and the lower quartile according to the data distribution.

(2) Based on the segmented data set in the step (1) and the original two-class independent variables in the classified independent variables, performing feature engineering through a feature tools method, constructing a cross variable data set, generating cross variable data sets, wherein all variables in the data set are the two-class independent variables, when the cross variable is generated through feature engineering, adopting a Cartesian product mode, namely adopting a numerical multiplication mode for the two crossed variables, if the values of the two variables are 1, the cross variable value is 1, otherwise, the cross variable value is 0.

(3) Screening cross variables affecting daily doses of therapeutic drugs by stepwise regression based on the cross variable dataset in (2) with "daily doses of therapeutic drugs" as target variables.

4. XGboost algorithm-based construction of personalized medicine dosage prediction model

(1) Combining the significant single-factor variable screened in the 2 nd part with the significant cross variable screened in the 3 rd part to construct a new modeling data set;

(2) And constructing a tree-shaped prediction model based on an XGBoost algorithm by taking the daily dose of the therapeutic drug as a target variable. In the model training process, auto-ml model automatic parameter adjustment is firstly adopted, then a model is optimized in a five-fold cross validation mode, and an evaluation index (R2/MSE/RMSE/MAE) of model prediction capability is based on the average value of the evaluation indexes in five folds, so that a training model with optimal model prediction evaluation index is saved for prediction.

According to the embodiment of the invention, based on real-world patient clinical data such as demographic information, treatment drug use information, combined drug information, other auxiliary treatment means, gene polymorphism, inspection information, diagnosis information and the like, important risk factors influencing the individual doses of drugs are gradually constructed and screened by adopting methods such as a statistical method, cross validation, feature engineering (Feature Tools), stepwise regression and the like, and then an XGboost tree model is constructed based on the screened important influence factors, so that the individual doses of the drugs for different patients are predicted. The real world big data is characterized in that the data dimension is higher, the deletion rate is higher, the two problems are effectively solved, the prediction model can be well fitted under the condition that a large number of deletion values exist in the data, important variables with larger influence on the individual medicine dosage can be quickly and effectively found out from the high-dimension variables, cross variables can be constructed through characteristic engineering, the cross influence relation among the variables is reasonably utilized, the prediction model is finally constructed by comprehensively utilizing all factors possibly influencing the dosage difference of a patient, and the specific medicine dosage prediction result is obtained according to different clinical characteristics and treatment conditions of the patient. Meanwhile, based on real-world clinical data, the problems of too few patients, insufficient treatment course and the like in the traditional medicine personalized dose prediction research can be effectively avoided.

The embodiment of the invention introduces a machine learning (XGBoost algorithm), feature engineering (Feature Tools) and stepwise regression method, and has the following advantages:

1. XGBoost can process large-scale medical data, is low in memory use, has a faster training speed, has the capability of autonomous learning and incremental learning, and has higher model accuracy;

2. the model construction allows missing values to exist, so that the defect that a large number of missing values exist in the real world data set is effectively overcome, and the real world data is utilized to the greatest extent;

3. the method can process data with more variables, screen important variables through methods such as machine learning, feature engineering, stepwise regression and the like, construct significant and effective cross variables, and can effectively process a high-dimensional data set;

4. a large number of cross variables can be generated and processed simultaneously, and the cross variable mode can be selected according to actual needs (for example, a Cartesian addition mode can be adopted).

Fig. 3 is a schematic structural diagram of an individualized medication dose prediction device according to an embodiment of the present invention. As shown in fig. 3, the apparatus includes a data acquisition module 10, a significant single factor variable acquisition module 20, a significant cross variable acquisition module 30, and a prediction model construction module 40, wherein: the data acquisition module 10 is configured to: acquiring clinical original data related to a target individual, and obtaining data to be analyzed after data cleaning is carried out on the clinical original data; the significant one-factor variable acquisition module 20 is configured to: carrying out variable primary screening on the continuous independent variable and the classified independent variable in the data to be analyzed by adopting a single-factor analysis method to obtain a remarkable single-factor variable, wherein the remarkable single-factor variable has remarkable correlation with the dosage of the therapeutic drug used in unit time; the significant cross variable acquisition module 30 is configured to: carrying out segmentation processing and independent heat coding on the continuous independent variable in the obvious single-factor variable, carrying out independent heat coding on multiple classified independent variables in the classified independent variable in the obvious single-factor variable, and obtaining a classified independent variable data set based on the obvious single-factor variable; constructing cross variables based on the two classification self-variable data sets by a characteristic engineering method to generate a cross variable data set; taking the dosage used in the unit time of the therapeutic drug as a target variable, and obtaining a significant cross variable through stepwise regression screening based on the cross variable data set, wherein the significant cross variable has significant correlation with the dosage used in the unit time of the therapeutic drug; the prediction model construction module 40 is configured to: combining the significant single-factor variable and the significant cross variable to obtain a modeling data set; and constructing a prediction model for the prediction of the dosage of the target individual by adopting a preset machine learning algorithm based on the modeling data set by taking the dosage of the therapeutic drug used per unit time as a target variable.

Based on real-world patient clinical data, the embodiment of the invention gradually builds and screens important risk factors influencing the personalized dosage of the medicine by adopting methods such as a statistical method, feature engineering, stepwise regression, machine learning and the like, and builds a personalized clinical treatment dosage prediction model of the medicine. The real world big data is characterized in that the data dimension is higher, the deletion rate is also higher, the two problems are effectively solved, the prediction model can be well fitted under the condition that a large amount of data is deleted, important variables with larger influence on the individual dosage of the medicine can be quickly and effectively found out from the high-dimension variables, cross variables can be constructed through characteristic engineering, the cross influence relation among the variables is reasonably utilized, the prediction model is finally constructed by comprehensively utilizing all factors possibly influencing the medicine dosage difference of a patient, and the specific individual medicine dosage prediction result is obtained according to different clinical characteristics and treatment conditions of the patient.

Further, the significant one-factor variable acquisition module 20 is specifically configured to: carrying out Pearson correlation test on the continuous independent variable and the dosage of the therapeutic drug in unit time, and judging whether the relationship between the continuous independent variable and the dosage of the therapeutic drug in unit time is obvious; carrying out Mann-Whitney U test on the type independent variable and the dosage used by the therapeutic medicine in unit time, and judging whether the relationship between the type independent variable and the dosage used by the therapeutic medicine in unit time is obvious or not; obtaining the significant single factor variable based on the continuous type independent variable and the categorical independent variable that are significant in relation to the dosage of the therapeutic agent used per unit time; the significant single factor variable has a significant correlation with the dosage of therapeutic drug used per unit time.

Further, based on the above embodiment, the prediction model construction module 40 is specifically configured to, when configured to use the dose per unit time of the therapeutic drug as the target variable, construct a prediction model for the prediction of the dose administered to the target individual using a preset machine learning algorithm based on the modeling data set:

constructing a tree-shaped prediction model by adopting an XGBoost algorithm based on the modeling data set by taking the dosage of the therapeutic drug in unit time as a target variable; in the model training process, auto-ml model automatic parameter adjustment is adopted, and a K-fold cross validation mode is adopted to optimize the model, so that the prediction model is obtained for predicting the administration dosage of the target individual.

The device provided by the embodiment of the invention is used for the method, and specific functions can refer to the flow of the method and are not repeated here.

Fig. 4 is a schematic physical structure of an electronic device according to an embodiment of the invention. As shown in fig. 4, the electronic device may include: processor 410, communication interface (Communications Interface) 420, memory 430 and communication bus 440, wherein processor 410, communication interface 420 and memory 430 communicate with each other via communication bus 440. The processor 410 may call logic instructions in the memory 430 to perform the following method: acquiring clinical original data related to a target individual, and obtaining data to be analyzed after data cleaning is carried out on the clinical original data; carrying out variable primary screening on the continuous independent variable and the classified independent variable in the data to be analyzed by adopting a single-factor analysis method to obtain a remarkable single-factor variable, wherein the remarkable single-factor variable has remarkable correlation with the dosage of the therapeutic drug used in unit time; carrying out segmentation processing and independent heat coding on the continuous independent variable in the obvious single-factor variable, carrying out independent heat coding on multiple classified independent variables in the classified independent variable in the obvious single-factor variable, and obtaining a classified independent variable data set based on the obvious single-factor variable; constructing cross variables based on the two classification self-variable data sets by a characteristic engineering method to generate a cross variable data set; taking the dosage used in the unit time of the therapeutic drug as a target variable, and obtaining a significant cross variable through stepwise regression screening based on the cross variable data set, wherein the significant cross variable has significant correlation with the dosage used in the unit time of the therapeutic drug; combining the significant single-factor variable and the significant cross variable to obtain a modeling data set; and constructing a prediction model for the prediction of the dosage of the target individual by adopting a preset machine learning algorithm based on the modeling data set by taking the dosage of the therapeutic drug used per unit time as a target variable.

Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the method provided in the above embodiments, for example, including: acquiring clinical original data related to a target individual, and obtaining data to be analyzed after data cleaning is carried out on the clinical original data; carrying out variable primary screening on the continuous independent variable and the classified independent variable in the data to be analyzed by adopting a single-factor analysis method to obtain a remarkable single-factor variable, wherein the remarkable single-factor variable has remarkable correlation with the dosage of the therapeutic drug used in unit time; carrying out segmentation processing and independent heat coding on the continuous independent variable in the obvious single-factor variable, carrying out independent heat coding on multiple classified independent variables in the classified independent variable in the obvious single-factor variable, and obtaining a classified independent variable data set based on the obvious single-factor variable; constructing cross variables based on the two classification self-variable data sets by a characteristic engineering method to generate a cross variable data set; taking the dosage used in the unit time of the therapeutic drug as a target variable, and obtaining a significant cross variable through stepwise regression screening based on the cross variable data set, wherein the significant cross variable has significant correlation with the dosage used in the unit time of the therapeutic drug; combining the significant single-factor variable and the significant cross variable to obtain a modeling data set; and constructing a prediction model for the prediction of the dosage of the target individual by adopting a preset machine learning algorithm based on the modeling data set by taking the dosage of the therapeutic drug used per unit time as a target variable.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of personalized medicine dosage prediction comprising:

acquiring clinical original data related to a target individual, and obtaining data to be analyzed after data cleaning is carried out on the clinical original data;

carrying out variable primary screening on the continuous independent variable and the classified independent variable in the data to be analyzed by adopting a single-factor analysis method to obtain a remarkable single-factor variable, wherein the remarkable single-factor variable has remarkable correlation with the dosage of the therapeutic drug used in unit time;

carrying out segmentation processing and independent heat coding on the continuous independent variable in the obvious single-factor variable, carrying out independent heat coding on multiple classified independent variables in the classified independent variable in the obvious single-factor variable, and obtaining a classified independent variable data set based on the obvious single-factor variable; constructing cross variables based on the two classification self-variable data sets by a characteristic engineering method to generate a cross variable data set; taking the dosage used in the unit time of the therapeutic drug as a target variable, and obtaining a significant cross variable through stepwise regression screening based on the cross variable data set, wherein the significant cross variable has significant correlation with the dosage used in the unit time of the therapeutic drug;

Combining the significant single-factor variable and the significant cross variable to obtain a modeling data set; and constructing a prediction model for the prediction of the dosage of the target individual by adopting a preset machine learning algorithm based on the modeling data set by taking the dosage of the therapeutic drug used per unit time as a target variable.

2. The method for predicting the dosage of an individualized medication according to claim 1, wherein the variable preliminary screening is performed on the continuous independent variable and the classified independent variable in the data to be analyzed by adopting a single factor analysis method, so as to obtain a significant single factor variable, which specifically comprises:

carrying out Pearson correlation test on the continuous independent variable and the dosage of the therapeutic drug in unit time, and judging whether the relationship between the continuous independent variable and the dosage of the therapeutic drug in unit time is obvious;

carrying out Mann-Whitney U test on the type independent variable and the dosage used by the therapeutic medicine in unit time, and judging whether the relationship between the type independent variable and the dosage used by the therapeutic medicine in unit time is obvious or not;

3. The personalized medicine dosage prediction method according to claim 1, wherein the using dosage per unit time of the therapeutic medicine as a target variable, based on the modeling dataset, builds a prediction model with a preset machine learning algorithm for dosage prediction of the target individual, specifically comprising:

4. The personalized medicine dosage prediction method of claim 1, wherein the clinically raw data comprises at least one of demographic data, therapeutic use data, co-medication data, adjuvant therapy data, genetic polymorphism data, test data, diagnostic data.

5. The personalized medicine dosage prediction method of claim 1, wherein the data cleansing comprises at least one of text information extraction, data grouping, data transposition, data normalization, data deduplication, missing value processing, outlier processing, data dimension reduction, and variable calculation using regular expressions.

6. The personalized medicine dosage prediction method of claim 1, wherein the Feature engineering method comprises a Feature Tools method.

7. The personalized medicine dosage prediction method of claim 1, wherein the therapeutic medicine dosage per unit time comprises a therapeutic medicine daily dosage.

8. A personalized medicine dosage prediction device, comprising:

the data acquisition module is used for: acquiring clinical original data related to a target individual, and obtaining data to be analyzed after data cleaning is carried out on the clinical original data;

a significant single factor variable acquisition module for: carrying out variable primary screening on the continuous independent variable and the classified independent variable in the data to be analyzed by adopting a single-factor analysis method to obtain a remarkable single-factor variable, wherein the remarkable single-factor variable has remarkable correlation with the dosage of the therapeutic drug used in unit time;

a significant cross variable acquisition module for: carrying out segmentation processing and independent heat coding on the continuous independent variable in the obvious single-factor variable, carrying out independent heat coding on multiple classified independent variables in the classified independent variable in the obvious single-factor variable, and obtaining a classified independent variable data set based on the obvious single-factor variable; constructing cross variables based on the two classification self-variable data sets by a characteristic engineering method to generate a cross variable data set; taking the dosage used in the unit time of the therapeutic drug as a target variable, and obtaining a significant cross variable through stepwise regression screening based on the cross variable data set, wherein the significant cross variable has significant correlation with the dosage used in the unit time of the therapeutic drug;

The prediction model construction module is used for: combining the significant single-factor variable and the significant cross variable to obtain a modeling data set; and constructing a prediction model for the prediction of the dosage of the target individual by adopting a preset machine learning algorithm based on the modeling data set by taking the dosage of the therapeutic drug used per unit time as a target variable.

9. An electronic device for personalized medicine dosage prediction comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the personalized medicine dosage prediction method of any one of claims 1 to 7.

10. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the personalized medicine dosage prediction method of any of claims 1 to 7.