CN114187980A

CN114187980A - Model training method, model prediction method, molecular screening method and device thereof

Info

Publication number: CN114187980A
Application number: CN202210139413.2A
Authority: CN
Inventors: 徐鑫; 李远鹏; 王纵虎
Original assignee: Beijing Jingtai Technology Co ltd
Current assignee: Beijing Jingtai Technology Co ltd
Priority date: 2022-02-15
Filing date: 2022-02-15
Publication date: 2022-03-15

Abstract

The application provides a model training method, a model prediction method, a molecular screening method and a device thereof. The model training method comprises the following steps: acquiring a plurality of initial molecular data sets, wherein the initial molecular data sets are respectively constructed according to different positive and negative sample proportions; constructing an initial model, and respectively training the initial model by using the plurality of initial molecular data sets to obtain a plurality of training models and a prediction result of each training model, wherein each initial molecular data set corresponds to one training model; obtaining an evaluation index of the training model based on the prediction result; and selecting the training model meeting the preset requirement as a final prediction model according to the evaluation index of the training model. Through the mode, the terminal equipment selects the model which can meet specific requirements as the final output model through the representation of different data sets on the model, and the pertinence and efficiency of model training can be improved.

Description

Model training method, model prediction method, molecular screening method and device thereof

Technical Field

The application relates to the technical field of computational chemistry, in particular to a model training method, a model prediction method, a molecular screening method and a device thereof.

Background

Drug safety is an important issue in the drug development process. The main reason for the failure of clinical trials in the 21 st century is the lack of efficacy and safety (about 30%). Cardiotoxicity, hepatotoxicity, genotoxicity and phototoxicity are frequently observed toxicities. In clinical trials and post-approval studies, 30% to 40% of cardiotoxicity and hepatotoxicity, respectively, have been reported. hERG is a key factor in drug-induced QT interval prolongation and arrhythmia (also known as torsades de pointes). hERG can be inhibited by structurally diversified compounds due to its special molecular structure, and several drugs such as terfenadine, cisapride and sertindole have been withdrawn from the market due to the inhibition of hERG. Therefore, evaluation of the inhibitory effect of a pharmaceutical compound on hERG is essential in drug discovery programs.

In biological experiments, there are various methods for evaluating the toxicity of hERG, which are mainly classified into in vitro research methods (such as patch clamp test) and in vivo research methods (such as animal model electrocardiogram test), and different evaluation methods can be selected according to the stage of drug research and the cost of the project. The method is very important for evaluating the hERG toxicity of the compound in the early stage of drug development, and can effectively avoid the condition of failure of drug development in the later stage. However, since it is impractical to test hERG toxicity experimentally at an early stage of drug discovery (e.g., screening and lead optimization), this will greatly increase development costs. Therefore, the development of QSAR models capable of rapidly predicting hERG toxicity is one of the effective methods for realizing high-throughput screening in early drug development.

However, in recent years, the increase in the biological activity information data of hERG inhibitors has accelerated the iteration of QSAR models. Different machine learning methods are used to construct hERG inhibitor models, including partial least squares, bayesian, neural networks, random forests, and support vector machines, among others, and are almost classification models. Meanwhile, the training sets of different models and the selection of descriptors are different, so that the accuracy of the finally obtained models is also different. Because the integration of various databases is made difficult by the differences in data formats and ontologies between data sets from different sources, most studies have data sets for constructing hERG prediction models derived from a single database, the data set size is less than 3000, and constructing prediction models using a single database may result in limited prediction performance and applicability.

Disclosure of Invention

The application provides a model training method, a model prediction method, a molecular screening method and a device thereof.

The application provides a model training method, which comprises the following steps:

acquiring a plurality of initial molecular data sets, wherein the initial molecular data sets are respectively constructed according to different positive and negative sample proportions;

constructing an initial model, and respectively training the initial model by using the plurality of initial molecular data sets to obtain a plurality of training models and a prediction result of each training model, wherein each initial molecular data set corresponds to one training model;

obtaining an evaluation index of the training model based on the prediction result;

and selecting the training model meeting the preset requirement as a final prediction model according to the evaluation index of the training model.

Wherein the training the initial models respectively by using the plurality of initial molecular data sets to obtain a plurality of training models and a prediction result of each training model comprises:

dividing each initial molecular data set into an initial molecular training set and an initial molecular test set according to a preset proportion;

training the initial model by using the initial molecular training set to obtain the training model;

and testing the training model by using the initial molecular test set to obtain a prediction result of the training model.

respectively training the initial models by using the plurality of initial molecular data sets to obtain a plurality of training models;

acquiring a target test set;

and predicting each training model by using the target test set to obtain a prediction result of each training model.

Wherein the obtaining of the evaluation index of the training model based on the prediction result includes:

acquiring the total number of negative samples predicted in the prediction result, wherein the total number of negative samples predicted comprises a first number of true negative samples predicted as negative samples and a second number of true positive samples predicted as negative samples;

and acquiring the evaluation index of the training model based on the ratio of the second quantity to the total quantity.

Wherein the evaluation index further comprises at least one of the following: accuracy, confidence, f1 score, correlation coefficient, precision, recall, and confusion matrix.

Before the initial models are respectively trained by using the plurality of initial molecular data sets, the model training method further includes:

vectorizing initial molecules in the initial molecule data set by using multiple preset molecular fingerprints to obtain multiple fingerprint feature vectors of each initial molecule;

and splicing the plurality of characteristic vectors of each initial molecule to obtain the characteristic data of the initial molecule.

Wherein, the splicing the multiple fingerprint feature vectors of each initial molecule to obtain the feature data of the initial molecule comprises:

splicing the multiple fingerprint characteristic vectors of each initial molecule to obtain a characteristic data matrix of the initial molecule;

deleting the characteristic columns in the characteristic data matrix, wherein the characteristic columns have characteristic vector values with preset values and the proportion of the characteristic vector values is higher than the preset proportion;

calculating the correlation coefficient of any two characteristic columns in the characteristic data matrix, and deleting one of the two characteristic columns of which the correlation coefficient is higher than a preset coefficient;

and taking the residual characteristic data matrix as the characteristic data of the initial molecules.

The application also provides a model prediction method, which comprises the following steps:

obtaining a target molecule to be predicted;

and predicting the target molecule by using a prediction model obtained by training by using the model training method to obtain a first prediction result.

Wherein the model prediction method further comprises:

traversing and deleting each atom of the target molecule to respectively obtain corresponding pseudo target molecules;

inputting the pseudo target molecules into the prediction model to obtain a second prediction result;

and obtaining the influence of the traversed and deleted atoms on the target molecule based on the difference value of the first prediction result and the second prediction result.

Wherein the effect of the atom on the target molecule comprises a positive effect and a negative effect;

after the influence of the atoms which are deleted in the traversal mode on the target molecule is obtained based on the difference value between the first prediction result and the second prediction result, the model prediction method further comprises the following steps:

visualizing the effect of the traversed deleted atoms on the target molecule to generate a visualized molecular interpretability graph, wherein the molecular interpretability graph comprises a first identification of the positive effect and/or a second identification of the negative effect.

The present application also provides a molecular screening method, comprising:

predicting to obtain prediction results of a plurality of target molecules by using the model prediction method;

and screening candidate molecules from the target molecules based on the prediction result.

The application also provides a model training device, which comprises an acquisition module, a training module, an evaluation module and a selection module; wherein the content of the first and second substances,

the acquisition module is used for acquiring a plurality of initial molecular data sets, wherein the initial molecular data sets are respectively constructed according to different positive and negative sample proportions;

the training module is used for constructing an initial model, and respectively training the initial model by using the plurality of initial molecular data sets to obtain a plurality of training models and a prediction result of each training model, wherein each initial molecular data set corresponds to one training model;

the evaluation module is used for acquiring an evaluation index of the training model based on the prediction result;

and the selection module is used for selecting the training model meeting the preset requirement as the final prediction model according to the evaluation index of the training model.

The application also provides a model prediction device, which comprises an acquisition module and a prediction module; wherein the content of the first and second substances,

the acquisition module is used for acquiring target molecules to be predicted;

the prediction module is used for predicting the target molecule by using the prediction model obtained by the training of the model training method to obtain a first prediction result.

The application also provides a molecular screening device, which comprises a prediction module and a screening module; wherein the content of the first and second substances,

the prediction module is used for predicting to obtain prediction results of a plurality of target molecules by using the model prediction method;

the screening module is used for screening candidate molecules from the target molecules based on the prediction result.

The present application further provides a terminal device comprising a processor and a memory, wherein the memory stores program data, and the processor is configured to execute the program data to implement the model training method, the model prediction method and/or the molecular screening method as described above.

The present application also provides a computer-readable storage medium for storing program data which, when executed by a processor, is adapted to implement the model training method, the model prediction method and/or the molecular screening method described above.

The beneficial effect of this application is: the method comprises the steps that terminal equipment obtains a plurality of initial molecular data sets, wherein the initial molecular data sets are respectively constructed according to different positive and negative sample proportions; constructing an initial model, and respectively training the initial model by using the plurality of initial molecular data sets to obtain a plurality of training models and a prediction result of each training model, wherein each initial molecular data set corresponds to one training model; obtaining an evaluation index of the training model based on the prediction result; and selecting the training model meeting the preset requirement as a final prediction model according to the evaluation index of the training model. Through the mode, the terminal equipment selects the model which can meet specific requirements as the final output model through the representation of different data sets on the model, and the pertinence and efficiency of model training can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts. Wherein:

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a model training method provided herein;

FIG. 2 is a detailed flowchart of steps S12 and S13 of the model training method of FIG. 1;

FIG. 3 is a schematic flow chart diagram illustrating an embodiment of a model prediction method provided herein;

FIG. 4 is a schematic flow chart diagram illustrating another embodiment of a model prediction method provided herein;

FIG. 5 is a schematic diagram of the structure of one embodiment of the molecular interpretable diagram provided herein;

FIG. 6 is a schematic flow chart diagram of one embodiment of a molecular screening method provided herein;

FIG. 7 is a schematic diagram illustrating an embodiment of a model training apparatus provided herein;

FIG. 8 is a schematic diagram illustrating an embodiment of a model prediction apparatus provided herein;

FIG. 9 is a schematic structural diagram of an embodiment of the molecular sieving apparatus provided in the present application;

fig. 10 is a schematic structural diagram of an embodiment of a terminal device provided in the present application;

FIG. 11 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a model training method according to an embodiment of the present disclosure.

The model training method is applied to a terminal device, wherein the terminal device can be a server, and can also be a system in which the server and a local terminal are matched with each other. Accordingly, each part, such as each unit, sub-unit, module, and sub-module, included in the terminal device may be entirely disposed in the server, or may be disposed in the server and the local terminal, respectively.

Further, the server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules, for example, software or software modules for providing distributed servers, or as a single software or software module, and is not limited herein. In some possible implementations, the model training method of the embodiments of the present application may be implemented by a processor calling computer-readable instructions stored in a memory.

Specifically, as shown in fig. 1, the model training method in the embodiment of the present application specifically includes the following steps:

step S11: and acquiring a plurality of initial molecular data sets, wherein the initial molecular data sets are respectively constructed according to different positive and negative sample proportions.

In the embodiment of the application, the terminal equipment obtains a molecular prediction model with good prediction capability through a model training method. For example, the molecular prediction model may be specifically an hERG-QSAR model, and may be applied to specific service scenarios, such as a company service scenario, a school service scenario, a hospital service scenario, and the like.

Among them, hERG, thehumanEther-a-go-go-RelatedGene is a gene KCNH2 encoding a cardiac potassium ion channel. hERG is a key factor in drug-induced QT interval prolongation and arrhythmia (also known as torsades de pointes), which specifically includes ventricular depolarization and repolarization activation times, representing the total time course of the ventricular depolarization and repolarization processes, as the time taken from the start of the QRS wave to the end of the T wave, with measurements varying with age and gender.

The QSAR (Quantitative structure-activity relationship) model constructs Quantitative structure-activity relationships, and a mathematical model is used to describe the relationship between the molecular structure and certain biological activity of the molecule. The toxicity of the compound molecules against the hERG gene is examined below using the hERG gene as a target.

The terminal device can obtain an integrated data set of the hERG inhibitory activity information through literature research or a set of existing databases, and specific information of the integrated databases is shown in the following table:

database with a plurality of databases	hERG inhibitors (amounts)	Inactive compounds (amount)
			ChEMBL (version 22)	4793	5275
GOSTAR	3260	3509
			NCGC	232	1234
hERGCentral	4321	274536
			hERG integrated datast	9890	281329

TABLE 1 details of the Integrated data set

In the embodiment of the present application, a compound having an inhibitory effect on hERG can be regarded as a toxic sample, i.e., a positive sample, and the label value thereof can be recorded as 1, whereas a non-toxic sample, i.e., a negative sample, and the label value thereof can be recorded as 0.

In table 1, the hERG inhibitor is a toxic sample with inhibitory effect on hERG, i.e. a positive sample; inactive compounds are non-toxic samples that do not have an inhibitory effect on hERG, i.e., negative samples.

It should be noted that this part of data is already open source, so the terminal device of the present application may use part or all of this integrated data as a training set for model building.

Further, because the positive and negative sample ratios in the integrated data set may be extremely unbalanced, the embodiment of the present application selects a part of the positive and negative sample ratios as data of the preset positive and negative sample ratio limit data set by exploring the performance difference of the data set construction model of different positive and negative sample ratios.

Therefore, the terminal device extracts five data sets (i.e., initial molecular data sets) with positive and negative sample ratios of 1:1, 1:2, 1:4, 1:8 and 1:16, respectively, based on the integrated data sets. Each data set may be partitioned into a training set and a test set according to a certain ratio (e.g., 4:1, 3:2, etc.). In addition, additional pieces of data (3192 pieces collected here) may also be collected as external test sets.

It should be noted that the positive-negative sample ratio and the number of data sets are only one of the embodiments provided in the present application, and in other embodiments, other positive-negative sample ratios and numbers of data sets may be adopted, which are not described herein again.

Before inputting the initial molecular data set into the initial model for training, the terminal device further needs to perform molecular vectorization on the initial molecules in the initial molecular data set to extract feature data of the initial molecules.

In the embodiment of the present application, the molecular vectorization is to convert the SMILES format of the molecule into a numerical representation format recognizable by a machine learning model. Specifically, there are various vectorization methods, including segment-like fingerprints, 2D fingerprints, and 3D fingerprints, and also neural network hidden variable representations which have been recently developed.

In the embodiment of the present application, the vectorization method applied by the terminal device is a combination of some fragment-like fingerprints and 2D fingerprints, including 6 fingerprints/descriptors, which are Morgan (ECFP4) molecular fingerprints (circular Topological fragment fingerprints, 2 × 32-dimensional sparse), RDKit topologic molecular fingerprints (path Topological fragment fingerprints, 2 × 32-dimensional sparse), topologic-torsion molecular fingerprints (dihedral angle (4 consecutive atoms) based on atomic features, 2 × 36-dimensional sparse), MACCS molecular fingerprints (a smart fragment fingerprint, 167-dimensional), electrochogic state molecular descriptor (a combination of atomic fingerprint and atomic descriptor, 158-dimensional sparse), and PAIR molecular fingerprint (atomic feature-distance-based atomic PAIR fingerprint, 2 × 23-dimensional sparse). The above 6 fingerprint types are all existing fingerprint types and can be obtained through public ways, wherein the first 5 types are fragment type fingerprints, and the last 1 type is 2D fingerprints. In the vectorization process, the number and the kind of fingerprints are not limited, and are only examples here. The existing vectorization tools, such as ArcGIS, GIS, and the like, can be selected according to data characteristics by the tools used for vectorization, and all the tools can be applied to the embodiment of the present application.

Each fingerprint can generate a plurality of feature vectors, so that the terminal device can perform vectorization processing on the initial molecules to obtain a plurality of fingerprint feature vectors of each initial molecule, and a fingerprint feature vector can include a plurality of feature vectors.

After the vectorization processing, SMILES of data molecules are respectively converted into 6 fingerprints/descriptor vectors and then spliced to obtain a final molecular vector, the final molecular vector is too high in dimensionality and not beneficial to model construction, certain dimensionality compression is necessary, and the spliced vectors are compressed by the terminal equipment based on a certain rule to further obtain feature data.

Specific rules may be as follows: splicing the multiple fingerprint characteristic vectors of each initial molecule to obtain a characteristic data matrix of the initial molecule; deleting the characteristic columns in the characteristic data matrix, wherein the characteristic columns have characteristic vector values with preset values and the proportion of the characteristic vector values is higher than the preset proportion; calculating the correlation coefficient of any two characteristic columns in the characteristic data matrix, and deleting one of the characteristic columns of any two characteristic columns of which the correlation coefficient is higher than a preset coefficient; and taking the residual characteristic data matrix as the characteristic data of the initial molecules. For example, column feature vectors that are not 0 and have a ratio less than 0.01 are deleted, others are retained; for another example, a correlation coefficient between two column feature vectors is calculated, e.g., less than 0.8, two columns are retained, and conversely, one column is deleted and the remaining column is retained. And after the dimension compression is finished, obtaining a structured vector format, namely constructing a model.

Step S12: and constructing an initial model, and respectively training the initial model by using a plurality of initial molecular data sets to obtain a plurality of training models and a prediction result of each training model, wherein each initial molecular data set corresponds to one training model.

In the embodiment of the present application, the terminal device constructs an initial model, which may be an initial random forest model, sets the value of n _ estimators (the number of trees) to 200, and sets the remaining hyper-parameters to default values.

And after vectorization operation is carried out on the five training sets, 5 structured training data sets are obtained, the five structured training data sets are respectively fitted by using the initial random forest model to obtain five fitted training models, and then the five training models are respectively tested by using the test set to respectively obtain the prediction results of the test set. It should be noted that, the molecules in the test set also need to be vectorized to obtain a structured format, and then the structured format is predicted by using the training model.

In other embodiments, the prediction result may also be a prediction result of the training model on the training set, which is not described herein again.

Step S13: and obtaining an evaluation index of the training model based on the prediction result.

In the embodiment of the application, after the terminal device obtains the prediction results of the test set or the training set, the evaluation index of the training model can be further calculated according to the prediction results. Generally, the closer the predicted result is to the true result, the better the evaluation index of the training model is. In the application, a numerical evaluation index can be obtained, so that the staff can evaluate the quality of the training model intuitively.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating steps S12 and S13 of the model training method shown in fig. 1.

Specifically, as shown in fig. 2, step S12 in the model training method according to the embodiment of the present application may specifically include the following steps:

step S131: and dividing each initial molecular data set into an initial molecular training set and an initial molecular test set according to a preset proportion.

In the embodiment of the present application, the terminal device performs the following operation according to a preset ratio, for example, 4: a ratio of 1 divides each of the five initial molecular data sets into a corresponding initial molecular training set and an initial molecular test set.

Step S132: and training the initial model by using the initial molecular training set to obtain a training model.

In the embodiment of the application, the terminal device inputs five initial molecular training sets into five same initial models respectively, and trains the five same initial models respectively to obtain five different training models.

Step S133: and testing the training model by using the initial molecular test set to obtain a prediction result of the training model.

In the embodiment of the application, the terminal device inputs the five initial molecular test sets into the corresponding training models respectively, and tests the training models to obtain corresponding prediction results.

In the above manner, for an initial molecular data set, a part of data in the data set can be used as a training set, and a part of data can be used as a test set, so that training of a model and prediction of the model can be realized by using one data set, and a test set does not need to be additionally acquired from the outside to predict the model, thereby reducing the calculation amount, time cost and workload although the model has an influence on the accuracy.

In other embodiments, step S12 may specifically include the following steps: respectively training initial models by using a plurality of initial molecular data sets to obtain a plurality of training models; acquiring a target test set; and predicting each training model by using the target test set to obtain a prediction result of each training model.

Wherein the target test set is an external test set additionally collected from the outside. When an initial molecular data set is used for training to obtain a training model, the whole initial molecular data set can be used as the training set to train the initial model to obtain the training model; or training the initial model by using partial data in the initial molecular data set as a training set to obtain a training model; or dividing the initial molecular data set into a training set and a verification set according to a certain proportion (such as 4: 1), performing model training by using the training set, and performing model adjustment by using the verification set to obtain a corresponding training model.

According to the mode, the terminal equipment can test a plurality of different training models by using the same external test set to obtain corresponding prediction results. The standards for predicting the training models are the same (namely the same external test set), so that the fairness and consistency of the training models in the subsequent evaluation can be ensured, and the prediction deviation caused by different test sets is avoided.

Further, as shown in fig. 2, step S13 in the model training method according to the embodiment of the present application may specifically include the following steps:

step S134: and acquiring the total number of the negative samples predicted in the prediction result, wherein the total number of the negative samples predicted comprises a first number of true negative samples predicted as negative samples and a second number of true positive samples predicted as negative samples.

In the embodiment of the present application, the prediction results of the training model include, but are not limited to: the total number of predicted positive samples and the total number of predicted negative samples. Wherein predicting the total number of negative samples comprises predicting a true negative sample as a first number of negative samples TN and predicting a true positive sample as a second number of negative samples FN; the total number of predicted as positive samples comprises a third number TP of true positive samples predicted as positive samples and a fourth number FP of true negative samples predicted as positive samples.

Step S135: and acquiring an evaluation index of the training model based on the ratio of the second quantity to the total quantity.

In the embodiment of the application, the terminal device needs to generate a better prediction model through a model training method, and the prediction model can judge toxic samples as non-toxic samples as less as possible, i.e., the smaller the evaluation index of FN/(FN + TN) is, the better the evaluation index is.

Specifically, the prediction model regards a toxic sample as a positive sample, specifically adopts a label value of "1" for marking, regards a non-toxic sample as a negative sample, and specifically adopts a label value of "0" for marking. In other embodiments, the prediction model may also mark the positive and negative samples with other values or other types of label values, which are not specifically illustrated herein.

In another embodiment, the terminal device may also calculate other types of evaluation indexes according to the above prediction result, and then evaluate the performance of the prediction model by combining other evaluation indexes in addition to the above evaluation indexes.

Specifically, the evaluation indexes calculated from the prediction results of the training model include, but are not limited to, the evaluation indexes of the following table:

table 2

It can be seen that the evaluation index of the embodiment of the present application may further include, but is not limited to, at least one of the following: accuracy (accuray), auc (Area Under ROC Curve), confidence that can be used to characterize the model, f1 score (f 1 score), Marx Correlation Coefficient (MCC), precision (precision), recall (recall), and confusion matrix. Based on the test evaluation index results of different positive and negative sample proportion models in the table 2, after comprehensive consideration, the terminal equipment selects the test evaluation index result with the positive and negative sample proportion of 1:2 is the final hERG toxicity prediction model.

Furthermore, the application also compares pkCSM, a graph-based feature prediction for predicting the test performance of the website with small molecule pharmacokinetic and toxicity characteristics on the same external test set, and the specific evaluation indexes are shown in table 3:

table 3

From the evaluation indexes of the test results, the performance of the five training models generated by the method on the same external test set is superior to that of a pkCSM website. Therefore, the prediction accuracy of the model obtained by training with the model training method is higher than that of the existing prediction scheme.

Step S14: and selecting the training model meeting the preset requirements as a final prediction model according to the evaluation indexes of the training model.

In this embodiment of the application, the terminal device may select one or more training models with better performance or with an evaluation index reaching a preset threshold as a final prediction model according to one or more evaluation indexes of the training models.

In the embodiment of the application, the terminal equipment acquires a plurality of initial molecular data sets, wherein the initial molecular data sets are respectively constructed according to different positive and negative sample proportions; constructing an initial model, and respectively training the initial model by using the plurality of initial molecular data sets to obtain a plurality of training models and a prediction result of each training model, wherein each initial molecular data set corresponds to one training model; obtaining an evaluation index of the training model based on the prediction result; and selecting the training model meeting the preset requirement as a final prediction model according to the evaluation index of the training model. Through the mode, the terminal equipment selects the model capable of meeting specific requirements as the final output model through the representation of different data sets on the model, the pertinence and efficiency of model training can be improved, and the accuracy of follow-up prediction is further improved.

Referring to fig. 3, fig. 3 is a schematic flowchart illustrating a model prediction method according to an embodiment of the present disclosure.

As shown in fig. 3, the model prediction method of the embodiment of the present application includes the following steps:

step S21: and obtaining the target molecules to be predicted.

In the embodiment of the present application, the specific manner for the terminal device to obtain the target molecule to be predicted may refer to step S11, which is not described herein again. It is understood that the target molecule may also be manually input by a user, and is not limited herein.

Step S22: and predicting the target molecule by using a prediction model obtained by training by using a model training method to obtain a first prediction result.

In the embodiment of the application, the terminal equipment inputs the target molecules into the trained prediction model, and obtains the prediction result output by the prediction model. The prediction result can be a probability value and a label value obtained by predicting the target molecule, for example, the value of the probability value is between 0 and 1, and the label value of the target molecule can be obtained based on the probability value of the target molecule by presetting a probability threshold.

For example, the probability threshold is set to 0.5, and when the predicted probability value of the target molecule is greater than or equal to 0.5, the tag value of the target molecule is set to 1, which indicates that the target molecule is predicted as a positive sample, i.e., a toxic sample, by the prediction model. And when the probability value of the target molecule obtained by prediction is less than 0.5, setting the label value of the target molecule to be 0, and indicating that the target molecule is predicted to be a negative sample by the prediction model, namely a non-toxic sample.

Referring to fig. 4, fig. 4 is a schematic flowchart illustrating a model prediction method according to another embodiment of the present disclosure.

As shown in fig. 4, after step S22 of the model prediction method shown in fig. 3, the model prediction method of the embodiment of the present application further includes:

step S23: and traversing and deleting each atom of the target molecule to respectively obtain corresponding pseudo target molecules.

In the embodiment of the application, the terminal device deletes one atom of the target molecule each time to obtain the corresponding pseudo target molecule. The pseudo target molecule differs from the target molecule by one atom.

Step S24: and inputting the pseudo target molecules into the prediction model to obtain a second prediction result.

Step S25: and obtaining the influence of the traversed and deleted atoms on the target molecule based on the difference value of the first prediction result and the second prediction result.

Step S26: visualizing the effect of the traversed deleted atoms on the target molecule to generate a visualized molecular interpretability graph, wherein the molecular interpretability graph comprises a first identification of positive effects and/or a second identification of negative effects.

In the embodiment of the present application, the terminal device may further analyze the interpretability of the prediction model, and may give some result information to the pharmaceutical scientist, such as a segment of a drug molecule that has a large influence on hERG toxicity, and a contribution change of the segment before and after molecular modification.

Specifically, the basic principle regarding model interpretability analysis is as follows: and the terminal equipment traverses each atom in the target molecule, and a new pseudo target molecule is obtained after the traversed atoms are deleted in sequence. And after vectorizing the target molecules and the pseudo target molecules respectively, the terminal equipment predicts by using a prediction model to obtain a first prediction result of the target molecules and a second prediction result of the pseudo target molecules. The terminal device makes a difference between the first prediction and the second prediction, and the difference can be used for representing the contribution or influence of the deleted atom on the toxicity.

After the terminal device traverses all atoms in the target molecule, the influence of each atom in the target molecule can be obtained, and a visualized molecule interpretability graph is generated, as shown in fig. 5. In the molecular interpretability diagram of FIG. 5, the atom labeled A indicates that this atom has a positive effect on hERG toxicity, and the atom labeled B indicates that this atom has a negative effect on hERG toxicity. Wherein darker color indicates a greater positive/negative effect of the atom on hERG toxicity and vice versa.

Based on the visualized molecular interpretability graph, the pharmaceutical scientist can roughly know the fragments or atoms of a drug molecule which have a large influence on the toxicity of hERG, and the fragments or atoms can be used as guiding information to modify or optimize the drug molecule.

In the embodiment of the application, the prediction model generated by the terminal device through the model training method is used for predicting the hERG toxicity of the drug molecules, so that the safety of the early research and development stage of the drug can be effectively guaranteed, and meanwhile, the terminal device can help drug scientists to screen the drug and optimize the structure of the drug molecules; the terminal device integrates various segment type fingerprints/descriptors and 2D fingerprints/descriptors in the vectorization process, and roughly screens the converted characteristic dimensions, so that the expression of the constructed model can be effectively improved; compared with the existing hERG model in the industry, the model generated by the method has advantages in the number of training sets, so that the model is expressed by other models; meanwhile, the method also performs preliminary model interpretability analysis, gives some structural information to a model user, and is beneficial to modification and optimization of drug molecules.

Referring to fig. 6, fig. 6 is a schematic flow chart of an embodiment of the molecular screening method provided in the present application.

As shown in fig. 6, the molecular screening method of the embodiment of the present application includes the steps of:

step S31: and predicting by using a model prediction method to obtain prediction results of a plurality of target molecules.

In the embodiment of the present application, the terminal device predicts the prediction results of a plurality of target molecules by using the model prediction method of the above embodiment. Wherein the prediction result comprises a probability value and a label value predicted by each target molecule.

Step S32: and screening candidate molecules from the plurality of target molecules based on the prediction result.

In the embodiment of the present application, the target molecule with the tag value of 1 is predicted to be a positive sample, i.e., a toxic sample, and the higher the probability value is, the higher the probability that the target molecule is a positive sample is. The target molecule with the label value of 0 is predicted to be a negative sample, namely a non-toxic sample, and the smaller the probability value is, the larger the probability that the target molecule is the negative sample is. The terminal device can screen out suitable positive sample candidate molecules and/or negative sample candidate molecules from the target molecules according to the prediction result.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

To implement the model training method of the above embodiment, the present application further provides a model training device, and specifically refer to fig. 7, where fig. 7 is a schematic structural diagram of an embodiment of the model training device provided in the present application.

The model training apparatus 400 of the embodiment of the present application includes an obtaining module 41, a training module 42, an evaluating module 43, and a selecting module 44.

The obtaining module 41 is configured to obtain a plurality of initial molecular data sets, where the plurality of initial molecular data sets are respectively constructed according to different positive and negative sample ratios.

The training module 42 is configured to construct an initial model, and train the initial model with the multiple initial molecular data sets respectively to obtain multiple training models and a prediction result of each training model, where each initial molecular data set corresponds to one training model.

The evaluation module 43 is configured to obtain an evaluation index of the training model based on the prediction result.

And the selection module 44 is configured to select, according to the evaluation index of the training model, the training model meeting the preset requirement as a final prediction model.

The training module 42 may be specifically configured to divide each initial molecular data set into an initial molecular training set and an initial molecular test set according to a preset ratio; training the initial model by using the initial molecular training set to obtain the training model; and testing the training model by using the initial molecular test set to obtain a prediction result of the training model.

The training module 42 may be specifically configured to train the initial models respectively by using the plurality of initial molecular data sets to obtain a plurality of training models; acquiring a target test set; and predicting each training model by using the target test set to obtain a prediction result of each training model.

The evaluation module 43 may be specifically configured to obtain a total number of negative samples predicted in the prediction result, where the total number of negative samples predicted includes a first number of true negative samples predicted as negative samples and a second number of true positive samples predicted as negative samples; and acquiring the evaluation index of the training model based on the ratio of the second quantity to the total quantity.

The evaluation index may further include, but is not limited to, at least one of the following: accuracy, confidence, f1 score, correlation coefficient, precision, recall, and confusion matrix.

Optionally, the model training apparatus 400 shown in fig. 7 may further include a processing module and a splicing module (not shown in the figure), wherein:

a processing module, configured to perform vectorization processing on initial molecules in the initial molecular data sets by using multiple preset molecular fingerprints before the initial model is trained by using the initial molecular data sets by the training model 42, so as to obtain multiple fingerprint feature vectors of each initial molecule;

and the splicing module is used for splicing the multiple fingerprint characteristic vectors of each initial molecule to obtain the characteristic data of the initial molecules.

The splicing module is specifically used for splicing the multiple fingerprint feature vectors of each initial molecule to obtain a feature data matrix of the initial molecule; deleting the characteristic columns in the characteristic data matrix, wherein the characteristic columns have characteristic vector values with preset values and the proportion of the characteristic vector values is higher than the preset proportion; calculating the correlation coefficient of any two characteristic columns in the characteristic data matrix, and deleting one of the two characteristic columns of which the correlation coefficient is higher than a preset coefficient; and taking the residual characteristic data matrix as the characteristic data of the initial molecules.

To implement the model prediction method of the foregoing embodiment, the present application further provides a model prediction apparatus, and specifically refer to fig. 8, where fig. 8 is a schematic structural diagram of an embodiment of the model prediction apparatus provided in the present application.

The model prediction apparatus 500 of the embodiment of the present application includes an obtaining module 51 and a prediction module 52.

The obtaining module 51 is configured to obtain a target molecule to be predicted.

The prediction module 52 is configured to predict the target molecule by using the prediction model obtained by the training of the model training method, so as to obtain a first prediction result.

The prediction module 52 is further configured to traverse and delete each atom of the target molecule to obtain corresponding pseudo target molecules respectively; inputting the pseudo target molecules into the prediction model to obtain a second prediction result; and obtaining the influence of the traversed and deleted atoms on the target molecule based on the difference value of the first prediction result and the second prediction result.

Wherein the influence of an atom on the target molecule comprises a positive influence and a negative influence.

Optionally, the model prediction apparatus 500 shown in fig. 8 may further include a generation module (not shown in the figure), wherein:

a generating module for visualizing the effect of the traversed deleted atoms on the target molecule to generate a visualized molecule interpretability graph, wherein the molecule interpretability graph comprises the first identification of the positive effect and/or the second identification of the negative effect.

To implement the molecular screening method of the above embodiment, the present application further provides a molecular screening apparatus, and specifically refer to fig. 9, where fig. 9 is a schematic structural diagram of an embodiment of the molecular screening apparatus provided in the present application.

The molecular screening apparatus 600 of the embodiment of the present application includes a prediction module 61 and a screening module 62.

The prediction module 61 is configured to predict a prediction result of a plurality of target molecules by using the model prediction method.

The screening module 62 is configured to screen candidate molecules from the plurality of target molecules based on the prediction result.

To implement the model training method, the model prediction method, and/or the molecular screening method of the foregoing embodiments, the present application further provides a terminal device, and specifically please refer to fig. 10, where fig. 10 is a schematic structural diagram of an embodiment of the terminal device provided in the present application.

The terminal device 300 of the embodiment of the present application includes a memory 31 and a processor 32, wherein the memory 31 and the processor 32 are coupled.

The memory 31 is used for storing program data, and the processor 32 is used for executing the program data to implement the model training method, the model prediction method and/or the molecular screening method described in the above embodiments.

In the present embodiment, the processor 32 may also be referred to as a CPU (Central Processing Unit). The processor 32 may be an integrated circuit chip having signal processing capabilities. The processor 32 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor 32 may be any conventional processor or the like.

To implement the model training method, the model prediction method and/or the molecular screening method of the above embodiments, the present application further provides a computer-readable storage medium, as shown in fig. 11, the computer-readable storage medium 700 is used for storing program data 71, and when the program data 71 is executed by a processor, the program data is used for implementing the model training method, the model prediction method and/or the molecular screening method of the above embodiments.

The present application further provides a computer program product, wherein the computer program product comprises a computer program operable to cause a computer to perform a model training method, a model prediction method and/or a molecular screening method as described in embodiments of the present application. The computer program product may be a software installation package.

The model training method, the model prediction method and/or the molecular screening method according to the above embodiments of the present application may be stored in a device, for example, a computer-readable storage medium, when the model training method, the model prediction method and/or the molecular screening method exist in the form of software functional units and are sold or used as independent products. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the purpose of illustrating embodiments of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application or are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A model training method, characterized in that the model training method comprises:

2. Model training method according to claim 1,

respectively training the initial models by using the plurality of initial molecular data sets to obtain a plurality of training models and a prediction result of each training model, and the method comprises the following steps:

3. Model training method according to claim 1,

acquiring a target test set;

4. The model training method according to any one of claims 1 to 3,

the obtaining of the evaluation index of the training model based on the prediction result includes:

5. Model training method according to claim 4,

the evaluation index further includes at least one of the following: accuracy, confidence, f1 score, correlation coefficient, precision, recall, and confusion matrix.

6. Model training method according to claim 1,

before the training the initial models respectively by using the plurality of initial molecular data sets, the model training method further includes:

and splicing the multiple fingerprint feature vectors of each initial molecule to obtain the feature data of the initial molecules.

7. The model training method according to claim 6,

the splicing of the multiple fingerprint feature vectors of each initial molecule to obtain the feature data of the initial molecule comprises:

8. A model prediction method, characterized in that the model prediction method comprises:

obtaining a target molecule to be predicted;

predicting the target molecule by using a prediction model obtained by training with the model training method of any one of claims 1-7 to obtain a first prediction result.

9. The model prediction method of claim 8,

the model prediction method further comprises:

10. The model prediction method of claim 9,

the influence of the atoms on the target molecule includes a positive influence and a negative influence;

11. A molecular screening method, comprising:

predicting the predicted results of a plurality of target molecules by using the model prediction method of any one of claims 8 to 10;

12. A model training device is characterized by comprising an acquisition module, a training module, an evaluation module and a selection module; wherein the content of the first and second substances,

13. The model prediction device is characterized by comprising an acquisition module and a prediction module; wherein the content of the first and second substances,

the acquisition module is used for acquiring target molecules to be predicted;

the prediction module is used for predicting the target molecule by using the prediction model obtained by training the model training method according to any one of claims 1 to 7 to obtain a first prediction result.

14. A molecular screening device is characterized by comprising a prediction module and a screening module; wherein the content of the first and second substances,

the prediction module is used for predicting the prediction results of a plurality of target molecules by using the model prediction method of any one of claims 8-10;

15. A terminal device, characterized in that the terminal device comprises a processor and a memory, the memory having stored therein program data for executing the program data to implement the model training method of any one of claims 1-7, the model prediction method of any one of claims 8-10 and/or the molecular screening method of claim 11.

16. A computer-readable storage medium for storing program data which, when executed by a processor, is adapted to carry out the model training method of any one of claims 1 to 7, the model prediction method of any one of claims 8 to 10 and/or the molecular screening method of claim 11.