CN117438090A

CN117438090A - Drug-induced immune thrombocytopenia toxicity prediction model, method and system

Info

Publication number: CN117438090A
Application number: CN202311726212.3A
Authority: CN
Inventors: 聂晓璐; 詹思延; 倪鑫; 孙凤; 彭晓霞
Original assignee: Peking University; Beijing Childrens Hospital
Current assignee: Peking University; Beijing Childrens Hospital
Priority date: 2023-12-15
Filing date: 2023-12-15
Publication date: 2024-01-23
Anticipated expiration: 2043-12-15
Also published as: CN117438090B

Abstract

The invention provides a drug-induced immunity thrombocytopenia toxicity prediction model, a method and a system, which relate to the technical field of computational chemistry, and the construction method of the prediction model mainly comprises the following steps: integrating a drug-induced thrombocytopenia source database to construct a first database; washing and screening the medicines meeting inclusion and exclusion criteria to obtain a second database; generating multiple types of molecular descriptors; randomly dividing the compounds in the second database into a training set and an external test set according to a preset proportion; constructing, training and testing a QSAR classification model through different machine learning algorithms; ranking according to the evaluation parameters of the model, and selecting the first N QSAR classification models to be combined into a consensus model for predicting whether the drug or the compound has DIIT toxicity. The method can reliably predict whether the medicine or the compound has the drug-induced immunity thrombocytopenia toxicity, improves the development efficiency of related medicines, and reduces the use risk of the related medicines.

Description

Drug-induced immune thrombocytopenia toxicity prediction model, method and system

Technical Field

The invention relates to the technical field of computational chemistry, in particular to a drug-induced immune thrombocytopenia toxicity prediction model, method and system.

Background

The drug often has certain toxicity, and the toxicity can cause drug-induced thrombocytopenia (namely DITP), wherein the drug-induced thrombocytopenia (namely DIIT) is taken as a drug adverse reaction which is easy to ignore, the drug is not easy to pay attention to a patient in the early stage, but can cause huge harm to a human body in the later stage, and serious bleeding complications, even serious consequences such as death and the like can be caused if the drug is not recognized and intervened in time. Therefore, if it can be predicted whether or not a drug has immune thrombocytopenic toxicity (DIIT toxicity, i.e., DIIT high risk) when the drug is developed or used, it will be a great help in avoiding the increased risk of complications such as spontaneous bleeding, traumatic bleeding, etc. in patients. Moreover, if a clinician or pharmacist can identify the medicine which can cause the DIIT risk as early as possible, and develop platelet monitoring in the process of medication, the clinical medication safety will be further improved.

At present, few hematopathy professional laboratories can detect drug-dependent platelet antibodies generated when DIIT occurs; on the other hand, the drug-dependent platelet antibody has long detection period and large investment, and the experimental detection condition is high in requirement, so that the drug-dependent platelet antibody is difficult to popularize widely; while various machine learning (such as neural network model, graph neural network and other deep learning) models are utilized to predict drug toxicity, the study in the field of DIIT toxicity is still blank although a certain progress is made.

Disclosure of Invention

The invention aims to provide a drug-induced immune thrombocytopenia toxicity prediction model, method and system for solving at least one of the technical problems in the prior art.

In order to solve the above technical problems, the present invention provides a drug-induced immune thrombocytopenia toxicity prediction model, and the method for constructing the prediction model includes the following steps:

step 1, integrating source databases of multiple drug-induced thrombocytopenia to construct a first database;

step 2, cleaning and screening the medicines meeting the drug-induced immune thrombocytopenia inclusion exclusion standard to obtain a second database; the second database comprises warning structure information (such as warning groups and the like) and a DIIT toxicity tag, and is used for marking the toxicity risk of drug-induced immune thrombocytopenia;

step 3, generating a molecular descriptor based on SMILES codes of each compound, so that the two-dimensional structure of the compound can be converted into a character string capable of storing the structure and chemical properties of the compound, and the character string can be calculated, stored and retrieved in a second database later; the types of the molecular descriptors comprise one-hot codes, molecular map codes, molecular fingerprints, physicochemical descriptors and the like;

Step 4, layering and randomly dividing the drug names or the compounds in the second database into a training set and an external test set according to a preset proportion, wherein the DIIT positive (high risk) drugs or compounds and the DIIT negative (low risk) drugs or compounds are distributed into the training set and the external test set according to the same proportion; respectively calculating the Molecular Weight (MW) and lipid water distribution coefficient (XLogP) distribution of the training set and the external test set, and evaluating the chemical space coverage of the training set and the external test set;

step 5, building and training a QSAR classification model for predicting the DIIT toxicity of a drug or a compound according to different machine learning algorithms aiming at different types of molecular descriptors, so that the different machine learning algorithms are applicable to the different types of molecular descriptors respectively, and specifically comprising: for the physicochemical descriptors and molecular fingerprints, a QSAR classification model is constructed through algorithms such as a support vector machine (support vector machines, SVM), random Forest (RF), extreme gradient rise (eXtreme Gradient Boosting, XGBoost) and the like; for one-hot coding, constructing a QSAR classification model by a deep learning cyclic neural network (recurrent neural network, RNN) algorithm; for the molecular graph coding, constructing a QSAR classification model by a graph convolution neural network (graph neural network, GCN) algorithm of deep learning;

The QSAR refers to a specified amount of structure-activity relationship for describing the relationship between the molecular structure and a certain biological activity of a molecule;

step 6, calculating evaluation parameters of each QSAR classification model, and evaluating the prediction effect of the QSAR classification model; sorting according to the evaluation parameters, and selecting the first N QSAR classification models to be combined into a consensus model; the N is an odd number; the consensus model outputs a predicted outcome of the drug or compound with respect to DIIT toxicity based on minority compliance majority rules;

by the method, a deep learning model for predicting the drug-induced immune thrombocytopenia toxicity can be comprehensively and effectively constructed and used for subsequent prediction.

In a possible embodiment, the inclusion exclusion criteria in step 2 include:

standard 1, screening for anti-neoplastic and immunomodulating agents, which eliminates the usual drugs recognized to have non-immune thrombocytopenic toxicity (due to myelosuppression), such as those having ATC encoded as class L01;

standard 2, screening out compound molecules such as proteins, antibodies, polymers, gene preparations, simple molecules and the like;

standard 3, screening out salt compounds, and retaining skeleton fragments of main active ingredients;

Standard 4, for complexes with multiple molecules, retaining the largest backbone fragment;

standard 5, screen out drugs with repetitive molecular formulas.

Through the standard, the data interference noise and redundant information can be effectively reduced, the post-sequence data processing scale is reduced, and the post-sequence data processing efficiency is improved.

In a possible implementation manner, the first database construction process in the step 1 specifically includes:

step 11, selecting an existing drug information database as a source database, comprising: SIDER4.1 database, onSIDES v2.0.0 database, on-market drug catalog set database (China), DITP (drug induced thrombocytopenia) literature library, DIIT antibody DDAbs detection database (American hematology laboratory) and OffSIDES database (potential safety risk signal database summarized after signal mining of American FDA spontaneous report system);

step 12, converting the trade names of the medicines in the source databases into common names; excluding non-pharmaceutical categories (e.g., dietary supplements, foods, etc.); removing pure external medicine; so as to narrow the data range to the category of drugs for oral and/or intravenous use while reducing the effects of drug metabolism;

step 13, merging source databases: combining a source database based on a drug specification, such as a SIDER4.1 database, an OnSIDES v2.0.0 database, a drug catalog set database and the like; sequentially combining a DIIT antibody DDAbs detection database, a DITP document library, an OffSIDES database and the like; when the databases are combined, screening overlapped medicines and carrying out consistency check on the thrombocytopenia risk attributes, and judging whether the overlapped medicines are consistent with the thrombocytopenia risk attributes or not: if the two medicines are consistent, deleting the overlapped medicines; if not, retaining the overlapped medicine; the scale of the first database can be simplified, and the instruction information difference and the individual information after the medicines are marketed are supplemented into the first database as completely as possible, so that the first database contains comprehensive medicine information about the thrombocytopenia risk;

Step 14, screening the effective active ingredients of the medicines in the first database, if the effective active ingredients correspond to different acids or salts, retaining the bulk medicines of the effective active ingredients, and deleting the corresponding acid salts or metal salts;

and step 15, deleting the independent metal medicaments and the compound medicaments, so that the first database is convenient for subsequent high-throughput technical treatment.

Through the above steps, a first database including comprehensive information of drug-induced thrombocytopenia toxicity can be formed.

In a possible embodiment, the preset ratio in step 4 is a DIIT positive drug or compound: DIIT negative drug or compound = 4:1.

In a possible embodiment, the molecular descriptor in step 5 is specifically shown in the following table:

wherein, chiral refers to the characteristic that chiral molecules do not coincide with their mirror images.

In a possible implementation, each machine learning algorithm in step 5 needs to test MACCS, ECFP4, CORINA, RDKit, MACCS +corina and ECFP4+corina in order to optimize the machine learning algorithm that fits the molecular descriptor.

In a possible implementation manner, the training of each QSAR classification model in step 5 further includes super-parameter optimization so as to obtain an optimal parameter, and a specific optimizing method and the optimal parameter are shown in the following table:

In a possible embodiment, the step 6 includes:

step 61, based on the drug name in the second database or the prediction result of the DIIT toxicity and QSAR classification model of the compound, a four-grid table is constructed as follows:

TP (1 Positive) indicates true positivity, namely the number of samples which are actually DIIT Positive for a drug or a compound and are predicted to be DIIT Positive by a QSAR classification model; TN (1 Negative) represents true Negative, i.e., the number of samples that are actually DIIT Negative for the drug or compound and predicted to be DIIT Negative by the QSAR classification model; FP (0 Positive) represents false positives, i.e. the number of samples that the drug or compound is actually DIIT negative and predicted to be DIIT Positive by the QSAR classification model; FN (0 Negative) represents false Negative, i.e. the number of samples for which the drug or compound is actually DIIT positive and is predicted to be DIIT Negative by the QSAR classification model;

step 62, calculating Sensitivity (SE), specificity (SP), accuracy (ACC) and Ma Xiusi correlation coefficients (i.e. MCC) of each QSAR classification model, where the specific formulas may be:

；

the sensitivity is used for reflecting the prediction accuracy of the QSAR classification model on the DIIT positive; the specificity is used for reflecting the prediction accuracy of the QSAR classification model on the DIIT negativity; the accuracy and Ma Xiusi correlation coefficients are respectively used for comprehensively reflecting the prediction effect of the QSAR classification model, and the closer the numerical value is to 1, the better the prediction effect is;

Step 63, selecting N QSAR classification models with the correlation coefficients closest to 1 of Ma Xiusi, and combining the N QSAR classification models into a consensus model; the N is an odd number; counting the prediction results of the N QSAR classification models, and outputting the prediction result with the most counting.

In one possible embodiment, step 64 is further included: constructing ROC curves of N QSAR classification models, calculating the area AUC values under the curves, and selecting M QSAR classification models with the AUC values closest to 1 to be combined into a consensus model; m is an odd number; counting the prediction results of the M QSAR classification models, and outputting the prediction result with the most counting.

In a possible implementation manner, the prediction result in the step 6 includes: prediction probability and prediction category: when the prediction probability is greater than 0.5, the prediction category is DIIT positive (high risk); when the prediction probability is less than 0.5, then the prediction class is DIIT negative (low risk).

In a possible embodiment, further comprising step 7: optimizing and screening for different types of molecular descriptors, specifically comprising the following steps:

the molecular fingerprint is subjected to variance screening, and the method specifically comprises the following steps:

step a1, calculating variances of all molecular fingerprints; starting with the variance of 0, performing QSAR classification model modeling and calculating an MCC value once every 0.01 is added until the variance reaches a preset variance;

And a2, screening out the molecular fingerprint with the largest MCC value.

By the method, the molecular fingerprint descriptor with the optimal prediction effect can be screened out.

In a possible implementation manner, the step 7 further includes performing Pearson correlation optimization screening on the physicochemical descriptors, and specifically includes:

step b1, calculating standard deviation corresponding to warning structure information in each physical and chemical descriptor;

step b2, deleting the physicochemical descriptor with the standard deviation of 0;

step b3, calculating Pearson correlation coefficients between every two descriptors based on Pearson correlation analysis (Pearson Correlation Coefficient), and judging and screening: if the absolute value of the Pearson correlation coefficient is larger than a preset threshold, retaining a physical and chemical descriptor with higher correlation with DIIT toxicity;

and b4, calculating a Pearson correlation coefficient between a certain physical and chemical descriptor and the DIIT toxicity, and deleting the physical and chemical descriptor if the absolute value of the Pearson correlation coefficient is smaller than a preset threshold value.

By the method, the physicochemical descriptor with the best prediction effect can be screened out.

In a possible implementation manner, the step 7 further includes descriptor combination screening, which specifically includes:

Step c1, sorting all descriptors by a recursive feature elimination method or an information gain IG mode;

step c2, respectively performing QSAR classification model training as training sets by sequentially increasing the number of the physical and chemical descriptors, and recording corresponding MCC values of the training sets;

and c3, screening out the descriptor combination with the highest MCC value.

By the method, the descriptor combination with the optimal prediction effect can be screened out.

In a possible implementation manner, the method further comprises step 8, and model verification specifically comprises the following steps:

step 81, randomly dividing a training set into 10 parts by a ten-fold intersection method, taking each 1 part as an internal verification set, carrying out internal intersection verification on a single QSAR classification model, distinguishing a prediction effect based on evaluation parameters, and optimizing and determining super parameters of each QSAR classification model;

step 82, respectively importing an external test set into a single QSAR classification model and a single consensus model for verification, distinguishing a prediction effect based on evaluation parameters, and optimizing and determining super parameters of each QSAR classification model and each consensus model;

step 83, randomly disturbing the sequence of the DIIT toxicity labels in the training set by a Y-disturbance (Y-distribution) experimental method, executing steps 5-6, and verifying the performance degradation condition of each QSAR classification model.

In a possible embodiment, the method further includes step 84, performing treatment subgroup drug confidence analysis on the QSAR classification model, specifically including: according to the second-stage treatment subgroup classification of the medicine ATC codes, a plurality of compounds are used as training sets, ten-fold cross check is carried out for a plurality of times, subgroup accuracy of each QSAR classification model is calculated respectively, and judgment is carried out by taking the accuracy of the consensus model as a reference: when the subgroup accuracy is higher than that of the consensus model, defining that the current QSAR classification model belongs to a subgroup with high confidence; otherwise, defining that the current QSAR classification model belongs to a low-confidence subgroup; step 84 is iteratively performed until there are N QSAR classification models in the high confidence subgroup and the low confidence subgroup, so that the prediction effect of the QSAR classification model in each treatment subgroup drug can be further known, and whether there is a potential value difference is determined.

In a second aspect, based on the same inventive concept, the present application also provides a method for predicting drug-induced immune thrombocytopenia toxicity comprising: and acquiring the medicine name or compound information, inputting the medicine name or compound information into the prediction model, and outputting a prediction result of whether the medicine or compound has the medicine-induced immune thrombocytopenia toxicity.

In a third aspect, based on the same inventive concept, the present application further provides a drug-induced immune thrombocytopenia toxicity prediction system, including a data receiving module, a data processing module, and a result generating module:

the data receiving module is used for receiving the drug name or the compound information;

the data processing module comprises a model unit and a prediction unit:

the model unit stores the drug-induced immune thrombocytopenia toxicity prediction model;

the prediction unit calls the drug-induced immune thrombocytopenia toxicity prediction model, and inputs drug names or compound information to obtain a prediction result;

the result generation module is used for sending out the prediction result.

By adopting the technical scheme, the invention has the following beneficial effects:

the drug-induced immunity thrombocytopenia toxicity prediction model, method and system provided by the invention can reliably predict whether drugs or compounds have drug-induced immunity thrombocytopenia toxicity, thereby improving the development efficiency of related drugs and reducing the use risk of the related drugs.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings which are required in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are some embodiments of the invention and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for constructing a drug-induced immune thrombocytopenia toxicity prediction model according to an embodiment of the present invention;

FIG. 2 is a diagram of types of four molecular descriptors provided in an embodiment of the present invention;

FIG. 3 is a flow chart of step 6 of FIG. 1;

FIG. 4 is a flowchart of variance filtering of molecular fingerprints according to an embodiment of the present invention;

FIG. 5 is a flowchart of Pearson correlation optimization screening for physicochemical descriptors according to an embodiment of the present invention;

FIG. 6 is a flowchart of descriptor combination screening provided in an embodiment of the present invention;

FIG. 7 is a flow chart of a method for model verification provided by an embodiment of the present invention;

fig. 8 is a diagram of a drug-induced immune thrombocytopenia toxicity prediction system according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

The invention is further illustrated with reference to specific embodiments.

It should be further noted that the following specific examples or embodiments are a series of optimized arrangements of the present invention for further explaining specific summary, and these arrangements may be used in combination or in association with each other.

Embodiment one:

as shown in fig. 1, the embodiment provides a drug-induced immune thrombocytopenia toxicity prediction model, and the construction method of the prediction model comprises the following steps:

step 3, generating a molecular descriptor based on SMILES codes (Simplified Molecular Input Line Entry System is a standard linear symbol for inputting and representing molecular structures, which is an ASCII code including an isomeric SMILES code capable of representing the stereochemistry and isotope specifications of drugs), so as to convert the two-dimensional structure of the compound into a character string capable of storing the structure and chemical properties (such as atoms, chemical bonds, aromatic hydrocarbons and the like) of the compound for subsequent calculation, storage and retrieval in a second database; types of the molecular descriptors include one-hot code, molecular map code, molecular fingerprint and physicochemical descriptors, as shown in fig. 2; the one-hot encoding belongs to the prior art, and as shown in the upper left corner pattern in fig. 2, a multi-bit status register is used to encode a plurality of states, each state having an independent register bit and only one bit being valid at any time; the molecular diagram coding belongs to the prior art, and as shown in the pattern of the upper right corner in fig. 2, the node or the edge in the molecule can be subjected to representation learning by combining a graph neural network algorithm; the molecular fingerprint belongs to the prior art, and is shown in the pattern of the lower left corner in fig. 2, and is in a mark form that a fingerprint fragment exists by using '1' and a fingerprint fragment does not exist by using '0'; the physicochemical descriptors belong to the prior art, as shown in the pattern in the lower right hand corner of fig. 2, which can be used to represent the topology, geometry, static and other physicochemical characteristics of the compound;

the support vector machine belongs to the prior art, and the key of learning is the selection of a distance measurement model;

the random forest belongs to the prior art, is a comprehensive discrimination model, the basic unit of the random forest is a decision tree, each decision tree classifies data according to independent variables, and finally, all decision trees comprehensively determine the final classification of the data;

the extreme gradient rising belongs to the prior art, is an open-source gradient rising frame, and can be used for rising the performance of the model through iteration of a plurality of weak learners;

the cyclic neural network belongs to the prior art, is one of deep learning neural network algorithms, and the forward calculation method is similar to a fully-connected neural network and consists of 3 parts: an input layer, a plurality of hidden layers and an output layer;

the graph convolution neural network belongs to the prior art, and is input into a graph with node and edge characteristics, and comprises a function which depends on the characteristics and the graph structure;

step 6, calculating evaluation parameters such as Ma Xiusi correlation coefficients (namely MCCs) of each QSAR classification model, and evaluating the prediction effect of the QSAR classification model; sequencing according to the Marshall correlation coefficient values, and selecting the first N QSAR classification models to be combined into a consensus model; the N is an odd number; the consensus model outputs a predicted outcome of the drug or compound with respect to DIIT toxicity based on minority compliance majority rules;

Further, the inclusion exclusion criteria in step 2 include:

standard 2, screening for compound molecules such as proteins, antibodies, polymers, genetic preparations, and simple molecules (e.g., NO);

standard 5, screen out drugs with repetitive molecular formulas.

Further, the first database construction process in the step 1 specifically includes:

step 13, merging source databases: the source databases based on the drug specifications, including SIDER4.1 database (1400, 604/796), onSIDES v2.0.0 database (1091, 431/660) and Chinese medicine catalogue set (CDE-DITP, 560, 282/278) database are combined; then merging the DIIT antibody DDabs detection database (112/144), the Geo-DITP document library (346/37) and the OffSIDES FAERS database (65/0) in sequence; when the databases are combined, screening overlapped medicines and carrying out consistency check on the thrombocytopenia risk attributes, and judging whether the overlapped medicines are consistent with the thrombocytopenia risk attributes or not: if the two medicines are consistent, deleting the overlapped medicines; if not, retaining the overlapped medicine; the scale of the first database can be simplified, and the information difference of the drug specifications of different countries and the individual information of the drugs after marketing are supplemented into the first database as completely as possible, so that the first database contains the comprehensive drug information about the thrombocytopenia risk;

Through the above steps, a first database (DITPst, 1765, 858/907) including comprehensive information of drug-induced thrombocytopenia toxicity can be formed.

Further, the preset ratio in the step 4 is a DIIT positive drug or compound: DIIT negative drug or compound = 4:1.

Further, the molecular weight may be calculated by the corena Symphony program; the lipid fraction distribution coefficient can be calculated by the corena Symphony program.

Further, the molecular descriptors in the step 5 are specifically shown in the following table:

Further, each machine learning algorithm in step 5 needs to test MACCS, ECFP4, CORINA, RDKit, MACCS +corena and ECFP 4+corena in order to optimize the machine learning algorithm suitable for the molecular descriptor.

Further, the training of each QSAR classification model in step 5 further includes super parameter optimizing so as to obtain an optimal parameter, and a specific optimizing method and the optimal parameter are shown in the following table:

further, the support vector machine and the random forest can be constructed by a scikit-learn program library of a Python system; the extreme gradient rise may be constructed from the XGBoost program library; the recurrent neural network and the graph roll-up neural network may be established by a PyTorch system.

Further, when the input layer of the recurrent neural network inputs the cordina or MACCS descriptors, the number of neurons of the input layer is the number of descriptors, for example, the number of the cordina descriptors is 117 and the number of the MACCS descriptors is 166; setting the hidden layer as 5 layers, wherein the number of neurons of the hidden layer is 128, 64, 32, 16 and 8 in sequence; an activation function is arranged between the hidden layers, the activation function is a Tanh function and a ReLU function, and the number of neurons of the output layer is 2, namely DIIT positive and DIIT negative.

Further, as shown in fig. 3, the step 6 includes:

step 61, constructing a four-grid table based on the DIIT toxicity of the compound in the second database and the prediction result of the QSAR classification model, as follows:

；

Preferably, N is 3.

Further, step 64 is also included: constructing ROC curves of N QSAR classification models, calculating the area AUC values under the curves, and selecting M QSAR classification models with the AUC values closest to 1 to be combined into a consensus model; m is an odd number; counting the prediction results of the M QSAR classification models, and outputting the prediction result with the most count; the ROC curve is a graph formed by taking a false positive rate (0 Positive Rate, FPR) as an abscissa and a true positive rate (1 Positive Rate,TPR) as an ordinate, and is used for further evaluating the prediction capability of the classification model, and when positive and negative samples are unbalanced, the ROC curve better judges the prediction capability of the classification model than the correct rate and Ma Xiusi correlation coefficient; the AUC value represents the area surrounded by the ROC curve and the abscissa and the ordinate, the value range is 0-1, and when the AUC value is closer to 1, the prediction effect of the classification model is better.

Further, the predicting result in the step 6 includes: prediction probability and prediction category: when the prediction probability is greater than 0.5, the prediction category is DIIT positive (high risk); when the prediction probability is less than 0.5, then the prediction class is DIIT negative (low risk).

Further, when N in said step 6 is equal to 3, the minority obeys majority rule means that: when all the 3 QSAR classification models judge that the medicine or the compound is DIIT positive, recording that the number of the DIIT positive is 3 and the number of the DIIT negative is 0, and judging that the prediction result of the consensus model is DIIT positive; when 1 QSAR classification model in the 3 QSAR classification models judges that the medicine or the compound is DIIT positive, and the other 2 QSAR classification models judge that the medicine or the compound is DIIT negative, the number of the DIIT positive is recorded to be 1, and the number of the DIIT negative is recorded to be 2, and the prediction result of the consensus model is DIIT negative.

Further, the method also comprises a step 7 of optimizing and screening for different types of molecular descriptors, and specifically comprises the following steps:

the molecular fingerprint is subjected to variance screening, as shown in fig. 4, and specifically includes:

and a2, screening out the molecular fingerprint with the largest MCC value.

Further, the preset variance in the step a1 is 0.5, so that 50 times of calculation can be performed, and finally, the molecular fingerprint with the best prediction effect is screened out from the 50 MCC values.

Further, in the step 7, the Pearson correlation optimization screening is further performed on the physicochemical descriptors, as shown in fig. 5, and specifically includes:

step b3, calculating Pearson correlation coefficients between every two descriptors based on Pearson correlation analysis (Pearson Correlation Coefficient), and judging and screening: if the absolute value of the Pearson correlation coefficient is greater than 0.95, retaining the physicochemical descriptor with higher correlation with DIIT toxicity;

and b4, calculating a Pearson correlation coefficient between a certain physical and chemical descriptor and the DIIT toxicity, and deleting the physical and chemical descriptor if the absolute value of the Pearson correlation coefficient is smaller than 0.01.

Further, in the step 7, descriptor combination screening is further included, as shown in fig. 6, which specifically includes:

step c1, sorting all descriptors by recursive feature elimination (e.g. Scikit-learn recursive feature elimination in Python kit, RFE) or information gain IG;

and c3, screening out the descriptor combination with the highest MCC value.

Further, the method further includes step 8 of performing model verification on the single QSAR classification model and the consensus model, as shown in fig. 7, specifically including:

step 81, randomly dividing a training set into 5 parts by a five-fold intersection method, taking each 1 part as an internal verification set, carrying out internal intersection verification on a single QSAR classification model, distinguishing a prediction effect based on evaluation parameters, and optimizing and determining super parameters of each QSAR classification model;

step 83, randomly disturbing the sequence of DIIT toxicity labels in the training set by a Y-disturbance (Y-distribution) experimental method, executing steps 6-7, and verifying the performance degradation condition of each QSAR classification model; the Y-disturbance (Y-distribution) experimental method belongs to the prior art, and is a verification method for reconstructing a model by randomly disturbing a sample label to cause the rapid reduction of the performance of the model, and is used for eliminating the contingency of the model performance;

Preferably, said step 83 is performed 5 times;

step 84, performing treatment subgroup drug confidence analysis on the QSAR classification model, specifically including: performing 100 times of ten-fold cross check (for example, adopting the stratifiedKFOLD function of scikit-learn program library in Python, setting random seeds to 0-99 to achieve the effect of 100 times of random allocation training sets), respectively calculating the subgroup accuracy of each QSAR classification model, and judging by taking the accuracy of the consensus model as a reference: when the subgroup accuracy is higher than that of the consensus model, defining that the current QSAR classification model belongs to a subgroup with high confidence; otherwise, defining that the current QSAR classification model belongs to a low-confidence subgroup; step 84 is iteratively performed until the number of QSAR classification models in the high confidence subgroup and the low confidence subgroup is 3, so that the prediction effect of the QSAR classification model in each treatment subgroup drug can be further known, and whether a potential value difference exists is determined.

Embodiment two:

the embodiment provides a drug-induced immune thrombocytopenia toxicity prediction method, which comprises the following steps: and acquiring the medicine name or compound information, inputting the medicine name or compound information into the prediction model, and outputting a prediction result of whether the medicine or compound has the medicine-induced immune thrombocytopenia toxicity.

Embodiment III:

as shown in fig. 8, the present embodiment provides a drug-induced immune thrombocytopenia toxicity prediction system, which includes a data receiving module, a data processing module, and a result generating module:

the data processing module comprises a model unit and a prediction unit:

the result generation module is used for sending out the prediction result.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A drug-induced immune thrombocytopenia toxicity prediction model is characterized in that the construction method of the prediction model comprises the following steps:

step 2, cleaning and screening the medicines meeting the drug-induced immune thrombocytopenia inclusion exclusion standard to obtain a second database; the second database comprises warning structure information and a DIIT toxicity tag, and is used for marking toxicity risks of drug-induced immune thrombocytopenia;

step 3, generating a molecular descriptor based on SMILES codes of the compounds; the types of the molecular descriptors comprise one-hot codes, molecular map codes, molecular fingerprints and physicochemical descriptors;

step 4, layering and randomly dividing the drug names or compound information in the second database into a training set and an external test set according to a preset proportion, wherein the DIIT positive compound and the DIIT negative compound are distributed into the training set and the external test set according to the same proportion; respectively calculating the molecular weight and the distribution of the lipid water distribution coefficients of the training set and the external test set;

step 5, constructing and training a QSAR classification model for predicting the DIIT toxicity of a drug or a compound by different machine learning methods for different types of molecular descriptors, wherein the method specifically comprises the following steps: for the physical and chemical descriptors and molecular fingerprints, constructing a QSAR classification model by a support vector machine or random forest or extreme gradient rise; for one-hot coding, constructing a QSAR classification model through a cyclic neural network; for molecular graph coding, constructing a QSAR classification model through a graph convolution neural network;

Step 6, calculating the evaluation parameters of each QSAR classification model; sorting according to the evaluation parameters, and selecting the first N QSAR classification models to be combined into a consensus model; wherein N is an odd number; the consensus model outputs a predicted outcome of the drug or compound with respect to DIIT toxicity based on a minority compliance with majority rules.

2. The predictive model according to claim 1, wherein the inclusion exclusion criteria in step 2 include:

standard 1, screening out antineoplastic agents and immunomodulators;

standard 2, screening out compound molecules of proteins, antibodies, polymers, genetic preparations and simple molecules;

standard 5, screen out drugs with repetitive molecular formulas.

3. The predictive model according to claim 1, wherein the predetermined ratio in step 4 is DIIT positive drugs or compounds: DIIT negative drug or compound = 4:1.

4. The predictive model as recited in claim 1, wherein said step 6 comprises:

step 61, constructing a four-grid table based on the DIIT toxicity of the drug or the compound in the second database and the prediction result of the QSAR classification model, wherein the four-grid table is used for reflecting the number of samples of true positives, false negatives, true negatives and false positives;

Step 62, calculating the sensitivity SE, specificity SP, accuracy ACC and Ma Xiusi correlation coefficient MCC of each QSAR classification model, wherein the specific formula is as follows:

；

TP represents true positives, namely the number of samples of the drug or the compound which are actually DIIT positives and predicted to be DIIT positives by a QSAR classification model; FN represents false negatives, i.e. the number of samples that the drug or compound is actually DIIT positive and predicted to be DIIT negative by the QSAR classification model; TN represents true negativity, i.e., the number of samples that are actually DIIT negative for the drug or compound and predicted to be DIIT negative by the QSAR classification model; FP represents false positives, i.e. the number of samples for which the drug or compound is actually DIIT negative and predicted to be DIIT positive by the QSAR classification model;

5. The predictive model of claim 4, further comprising step 64: constructing ROC curves of N QSAR classification models, calculating the area AUC values under the curves, and selecting M QSAR classification models with the AUC values closest to 1 to be combined into a consensus model; wherein M is an odd number; counting the prediction results of the M QSAR classification models, and outputting the prediction result with the most counting.

6. The predictive model as recited in claim 1, wherein the predicting results in step 6 include: prediction probability and prediction category: when the prediction probability is greater than 0.5, the prediction category is DIIT positive; when the prediction probability is less than 0.5, the prediction category is DIIT negative.

7. The method for constructing a predictive model according to claim 4, further comprising step 7 of optimizing and screening for different types of molecular descriptors, specifically comprising:

step a2, screening out the molecular fingerprint with the largest MCC value;

the Pearson correlation optimization screening is carried out on the physical and chemical descriptors, and the method specifically comprises the following steps:

step b3, calculating Pearson correlation coefficients between every two descriptors based on Pearson correlation analysis, and judging and screening: if the absolute value of the Pearson correlation coefficient is greater than 0.95, retaining the physicochemical descriptor with higher correlation with DIIT toxicity;

Step b4, calculating a Pearson correlation coefficient between a certain physical and chemical descriptor and the DIIT toxicity, and deleting the physical and chemical descriptor if the absolute value of the Pearson correlation coefficient is smaller than 0.01;

descriptor combination screening, specifically comprising:

step c1, sorting all descriptors by a recursive feature elimination method or an information gain mode;

and c3, screening out the descriptor combination with the highest MCC value.

8. The predictive model of claim 7, further comprising step 8, model verification, comprising:

Step 83, randomly disturbing the sequence of the DIIT toxicity labels in the training set through a Y-disturbance experiment method, executing the steps 5-6, and verifying the performance degradation condition of each QSAR classification model;

step 84, performing treatment subgroup medicine confidence analysis on the QSAR classification model, which specifically comprises the following steps: according to the second-stage treatment subgroup classification of the medicine ATC codes, a plurality of compounds are used as training sets, ten-fold cross check is carried out for a plurality of times, subgroup accuracy of each QSAR classification model is calculated respectively, and judgment is carried out by taking the accuracy of the consensus model as a reference: when the subgroup accuracy is higher than that of the consensus model, defining that the current QSAR classification model belongs to a subgroup with high confidence; otherwise, defining that the current QSAR classification model belongs to a low-confidence subgroup; step 84 is iteratively performed until there are N QSAR classification models in both the high confidence subgroup and the low confidence subgroup.

9. A method for predicting toxicity of drug-induced immune thrombocytopenia, comprising: collecting medicine name or compound information, inputting the medicine name or compound information into the prediction model according to any one of claims 1-8, and outputting a prediction result of the medicine or compound on drug-induced immunity thrombocytopenia toxicity.

10. A drug-induced immune thrombocytopenia toxicity prediction system, which is characterized by comprising a data receiving module, a data processing module and a result generating module:

the data processing module comprises a model unit and a prediction unit:

the model unit stores the prediction model according to any one of claims 1 to 8;

the prediction unit calls a prediction model, inputs a medicine name or compound information, and obtains a prediction result about drug-induced immune thrombocytopenia toxicity;

the result generation module is used for sending out the prediction result.