Scoring method for three-dimensional similarity of molecules for virtual screening of drugs
Technical Field
The invention relates to the technical field of computer-aided drug research and development, in particular to a scoring method for three-dimensional similarity of molecules for drug virtual screening.
Background
The drug research and development has the characteristics of large investment, high risk and long period, and generally one drug research and development period is more than 10 years, the research and development investment is hundreds of millions of dollars, and the trend is rising year by year. Drug screening is a key link of drug discovery, and high-flux drug virtual screening can greatly reduce screening time and cost, and has important significance for accelerating drug research and development.
In the virtual screening of drugs, a scoring method based on three-dimensional similarity of molecules is commonly used at present to carry out screening and sorting of molecules. Such scoring methods typically involve similarity of molecular shape and similarity of pharmacophores (typically scoring of similarity of database molecules to template molecules or pharmacophore models) and constructing therefrom a comprehensive score of similarity by a simple weighting function, the effectiveness of which determines the effectiveness of the screening and the speed of calculation. However, a significant problem with scoring methods of the type described above is the lack of accuracy, which makes virtual screening of drugs highly false positive or false negative.
In recent years, the deep learning technology based on the artificial neural network has made a significant breakthrough in the fields of unmanned driving, voice recognition, image recognition and the like, and has also gradually made a significant progress in the biomedical field, and diagnostic applications of diseases such as skin cancer, congenital cataract, childhood autism and the like based on deep learning have been developed. With the development of technology, the pharmaceutical field is also focusing on accelerating the development of drugs by using deep learning technology to reduce the development cost. Research shows that the deep learning technology has more advantages in optimizing synthetic routes, predicting drug targets and virtual screening compared with the traditional machine learning method.
Therefore, the application of the deep learning technology in the aspect of drug virtual screening is urgently needed to be researched, so that the problem of high false positive or false negative in the existing drug virtual screening process is solved.
Disclosure of Invention
The invention aims to provide a scoring method for three-dimensional similarity of molecules for virtual screening of medicines, aiming at the defects in the prior art, so as to solve the problem of poor accuracy of the existing virtual screening of medicines, and simultaneously maintain high flux computing speed of screening.
The aim of the invention is achieved by the following technical scheme:
the method for scoring the three-dimensional similarity of molecules for virtual screening of drugs comprises the following steps: the method comprises the following steps:
step one, obtaining various characteristic parameters of two molecules for similarity comparison
Respectively reading topological structure and three-dimensional structure information of two molecules for similarity comparison, and calculating to obtain various characteristic parameters, wherein the characteristic parameters comprise: the atomic number difference (F1) between the two molecules; number of rotatable chemical bonds of two molecules (F2); volume difference of two molecules (F3); shape similarity of two molecules (F4); similarity of two molecular hydrogen bond acceptors (F5); similarity of two molecular hydrogen bond donors (F6); similarity of two aromatic rings (F7); hydrophobic center similarity of two molecules (F8); positively charged group similarity of two molecules (F9); and a negative group similarity (F10) of two molecules; wherein:
f1 is calculated by reading in the respective topological structure information of two molecules, and then taking the absolute value of the difference value of the total number of atoms of the two molecules;
the calculation mode of F2 is based on the calculation mode of F1, judging whether each chemical bond is a rotatable bond, obtaining the total number of the rotatable bonds of two molecules, and then taking the absolute value of the difference value of the total number of the rotatable bonds of two molecules;
f3 is calculated by obtaining Van der Waals radius of the atoms according to the types of the atoms in the two molecules on the basis of F1 calculation, wherein each atom is represented by a Gaussian sphere, the radius of the Gaussian sphere is the same as the Van der Waals radius of the atom, the position coordinate of the Gaussian sphere is the same as the coordinate of the atom, and the atomic coordinate is from the input molecular three-dimensional structure; calculating the superposition volume of Gaussian balls in each molecule, wherein the ij-th Gaussian ball group comprises Gaussian balls corresponding to the ith atom and Gaussian balls corresponding to the jth atom, and the superposition volume of the ij-th Gaussian ball group is v ij The method comprises the steps of carrying out a first treatment on the surface of the Calculating the superposition volume of each of the two molecules asN is the total number of atoms in the molecule; then taking the absolute value of the difference value of the superposition volumes of the two molecules;
the calculation mode of F4 is based on the calculation mode of F3, and the intermolecular superposition volume of two molecules under the condition of multiple superposition is calculatedWherein v is ij For the overlapping volume of the ith atom in the first molecule and the jth atom in the second molecule, N is the total number of atoms in the first molecule, M is the total number of atoms in the second molecule, the maximum of which is selectedAs the maximum intermolecular volume; calculate the shape similarity of two molecules +.>Wherein V is A Is the self-overlapping volume of the first molecule, V B Is the self-overlapping volume of the second molecule;
the calculation mode of F5 is based on the calculation mode of F1, and the positions of hydrogen bond acceptors in two molecules are found out; calculation of the overlap volume of the respective Hydrogen bond acceptors in the two moleculesWherein F is ij Is the superposition volume between the ith hydrogen bond acceptor and the jth hydrogen bond acceptor; calculating the superposition volume of intermolecular hydrogen bond acceptor under multiple superposition conditions>Wherein F is ij For the overlapping volume of the ith hydrogen bond acceptor in the first molecule and the jth hydrogen bond acceptor in the second molecule, N is the total number of hydrogen bond acceptors in the first molecule, M is the total number of hydrogen bond acceptors in the second molecule, the maximum value of which is selected->Superimposed volumes as the largest intermolecular hydrogen bond acceptors; calculation of Hydrogen bond acceptor similarity of two molecules +.>Wherein P is A Is the self-overlapping volume of the hydrogen bond acceptor in the first molecule, P B Is the self-overlapping volume of the hydrogen bond acceptor in the second molecule;
the calculation mode of F6 is the same as that of F5, and only the hydrogen bond acceptors in two molecules are replaced by hydrogen bond donors;
the calculation mode of F7 is the same as that of F5, and only the hydrogen bond acceptor is replaced by an aromatic ring;
the calculation mode of F8 is the same as that of F5, and only the hydrogen bond acceptor is replaced by a hydrophobic center;
the calculation mode of F9 is the same as that of F5, and only the hydrogen bond acceptor is replaced by a positive group;
the calculation mode of F10 is the same as that of F5, and only the hydrogen bond acceptor is replaced by a negative group;
training deep learning model
The DUD-E data set is adopted, 102 biological target point information is arranged in the data set, each target point is provided with a corresponding active component set and a corresponding Decoy component set, and the data of each target point is processed as follows:
selecting crystal structure molecules in the active molecule set of each target point as template molecules, and calculating F1-F10 characteristic parameters of every two molecules in the template molecules and the other molecules according to the calculation mode in the step one, so that each target point is calculated to obtain a set of characteristic parameter data;
modeling by using a deep learning method, taking the characteristic parameter data of each target obtained by calculation as input data, and taking whether the activity of molecules is taken as an objective function of two classifications, wherein the direction of model optimization is to minimize the error of the prediction of the activity of all the targets in the molecules, so that the average value of AUC values is maximized; after training is completed, a final deep learning model is obtained;
step three, external verification of deep learning model
Verifying generalization capability of a deep learning model by adopting a MUV data set, selecting 10 biological target point information in the MUV data set, wherein each target point has a corresponding active component set and a corresponding Decoy component set; selecting crystal structure molecules in the active molecule set of each target point as template molecules, and calculating F1-F10 characteristic parameters of every two molecules in the template molecules and other molecules according to the calculation mode in the first step with other molecules in the active molecule set of the target point and all molecules in the Decoy molecule set respectively; and inputting the characteristic parameters into a trained deep learning model, and calculating to obtain the AUC value of virtual screening of each target point.
In the above technical solution, in the first step, the three-dimensional structure information includes a total number of atoms in the molecule, a total number of chemical bonds, a type of each atom, and a coordinate value thereof.
In the above technical scheme, in the second step, an AUC value of virtual screening is calculated for each target point by adopting a 5-time cross validation mode.
The invention has the beneficial effects that:
the invention relates to a scoring method for three-dimensional similarity of molecules for virtual screening of drugs, which comprises the following steps of firstly, obtaining characteristic parameters of two molecules for similarity comparison, wherein the characteristic parameters mainly comprise atomic number difference of the two molecules, rotatable chemical bond number, volume difference, shape similarity, similarity of hydrogen bond acceptors, similarity of hydrogen bond donors, similarity of aromatic rings, hydrophobic center similarity, positive group similarity, negative group similarity and the like; training a deep learning model; and step three, external verification of the deep learning model, wherein the generalization capability of the model is verified by adopting the MUV data set. The multi-dimensional similarity selected in the step one is used as a characteristic parameter through training the deep neural network, and final comprehensive scoring, namely an AUC value, is given through transformation of the deep neural network. The evaluation index of the AUC value of the drug virtual screening is a common standard for evaluating the accuracy of the screening method, the AUC value ranges from 0 to 1, and the closer the value is 1, the more accurate the screening method is. Experiments prove that compared with the prior art, the invention has remarkable improvement on the drug screening accuracy, and simultaneously maintains the high flux calculation speed of screening. Therefore, the scoring method can screen as many molecules with potential biological activity as possible from a database containing a large number of molecules, and the more accurate the screening method is, the easier the molecules with potential activity can be found, so that any potential active molecules can be omitted as much as possible, and the problem of false positive or false negative is solved. Therefore, the invention has very wide application prospect in the aspect of drug virtual screening.
Detailed Description
The present invention will be described in further detail with reference to specific examples, but is not limited thereto.
In a specific implementation manner of the scoring method for three-dimensional similarity of molecules for virtual drug screening in this embodiment, a target ADA17 in a DUD-E dataset is taken as an example, there are 102 biological target information in the DUD-E dataset, and each target has a corresponding active fraction set and a Decoy fraction set. Wherein the dataset of ADA17 targets contained 1,341 active fraction and 35,900 Decoy fraction. The active molecules in the crystal structure are respectively selected as template molecules (hereinafter referred to as "molecule A") and the first molecule in the active molecule set (hereinafter referred to as "molecule B"), and the following operations are performed on the data:
step one, obtaining various characteristic parameters of two molecules for similarity comparison:
the topological structure and three-dimensional structure information (including the total number of atoms, the total number of chemical bonds, the type of each atom, the coordinate value thereof and the like) of the molecule A and the molecule B are read, and then the characteristic parameters for similarity comparison are respectively calculated according to the following steps:
and step 1, taking the absolute value of the difference value of the total number of atoms of the molecules A and B to obtain a first characteristic parameter F1.
And 2, judging whether each chemical bond in the molecules A and B is a rotatable bond, respectively obtaining the total number of the rotatable bonds of the molecules A and B, and obtaining a second characteristic parameter F2 by taking the absolute value of the difference value of the total number of the two.
Step 3, obtaining the van der Waals radius of the atoms according to the types of the atoms in the molecules A and B, wherein each atom is represented by a Gaussian sphere, the radius of the Gaussian sphere is the same as the van der Waals radius of the atom, the position coordinate of the Gaussian sphere is the same as the coordinate of the atom, and the atomic coordinate is from the input three-dimensional structure of the molecule; calculating the superposition volume of a group of Gaussian balls (hereinafter referred to as Gaussian ball group) corresponding to any two atoms in the molecule A, wherein the ij-th Gaussian ball group comprises the Gaussian balls corresponding to the ith atom and the Gaussian balls corresponding to the j-th atom in the molecule A, and the superposition volume of the ij-th Gaussian ball group is v ij The method comprises the steps of carrying out a first treatment on the surface of the Calculate the superimposed volume of molecule A itself asN is the total number of atoms in molecule A; the same method was used to calculate the superimposed volume of molecule B itself asM is the total number of atoms in molecule B; and then taking the absolute value of the difference value of the self-overlapped volumes of the molecules A and B to obtain F3.
Step 4, calculating intermolecular overlapping volume of the molecules A and B under various overlapping conditionsWherein v is ij For the volume of superposition of the ith atom in molecule A and the jth atom in molecule B, N is the total number of atoms in molecule A, M is the total number of atoms in molecule B, the maximum of which is selected>As the maximum intermolecular volume; calculating the shape similarity of two moleculesWherein V is A Is the self-overlapping volume of molecule A, V B Is the self-folded volume of molecule B (i.e., calculated in step 4).
Step 5, finding out the positions of all hydrogen bond acceptors in the molecules A and B; calculation of the overlap volume of Hydrogen bond acceptors in molecule AWherein F is ij N is the total number of hydrogen bond acceptors in the molecule A, which is the superposition volume between the ith hydrogen bond acceptor and the jth hydrogen bond acceptor in the molecule A; the overlapping volume of the hydrogen bond acceptors in molecule B was calculated in the same way +.>Wherein F is ij The volume of superposition between the ith hydrogen bond acceptor and the jth hydrogen bond acceptor in the molecule B is M is the total number of the hydrogen bond acceptors in the molecule B; calculating the superposition of molecules A and B in various waysThe overlapping volume of the intermolecular hydrogen bond acceptors in the case +.>Wherein F is ij For the overlapping volume of the ith hydrogen bond acceptor in molecule A and the jth hydrogen bond acceptor in molecule B, N is the total number of hydrogen bond acceptors in molecule A, M is the total number of hydrogen bond acceptors in molecule B, the maximum of which is selected->A volume of overlap that acts as a hydrogen bond acceptor between molecules a and B; calculating the hydrogen bond acceptor similarity of molecules A and B +.>Wherein P is A Is the self-overlapping volume of the hydrogen bond acceptor in the molecule A, P B Is the self-overlapping volume of the hydrogen bond acceptors in molecule B.
And step 6, the calculation mode is consistent with that of the step 5, and the hydrogen bond acceptor similarity F6 of the molecules A and B can be obtained only by replacing the hydrogen bond donor with the hydrogen bond acceptor.
And 7, the calculation mode is consistent with that of the step 5, and the aromatic ring similarity F7 of the molecules A and B can be obtained only by replacing the hydrogen bond donor with an aromatic ring.
And 8, the calculation mode is consistent with that of the step 5, and the hydrophobic center similarity F8 of the molecules A and B can be obtained only by replacing the hydrogen bond donor with the hydrophobic center.
And 9, the calculation mode is consistent with that of the step 5, and the similarity F9 of the positive groups of the molecules A and B can be obtained only by replacing the hydrogen bond donor with the positive group.
And step 10, the calculation mode is consistent with that of the step 5, and the similarity F10 of the negative groups of the molecules A and B can be obtained only by replacing the hydrogen bond donor with the negative group.
Thus, the characteristic parameters F1 to F10 of the first molecule in the set of the molecule A and the active molecule, namely the molecule B, are obtained, and 10 total.
Training a deep learning model:
taking the DUD-E data set target ADA17 as an example, the same calculation method as in the first step is adopted to calculate the characteristic parameters of the second molecule (hereinafter referred to as "molecule C") in the molecule a and the active molecule set, and 10 corresponding characteristic parameters are obtained.
And similarly, calculating the third, fourth and fifth of the set of molecules respectively, the nth molecule (N is a natural number) until all the active molecules in the set of active molecules have been calculated, to obtain a number of characteristic parameters of 10×1341= 13,410.
Next, the characteristic parameters of each of the molecules a and Decoy in the set were calculated in the same manner as in step one, resulting in 10×35900= 359,000 characteristic parameters. So far, all the characteristic parameters of the target ADA17 are calculated.
Then, the characteristic parameter sets of the other 101 targets in the DUD-E data set are calculated by adopting the same calculation mode. To this end, there are 102 targets in the DUD-E dataset, each with a set of feature parameter data.
Finally, the 102 sets of characteristic parameter data are used as input characteristic data of a deep learning model, whether the activity of molecules is used as an objective function of two classifications, and the model optimization direction is that the error of the prediction of the activity of all targets in the molecules is minimized, so that the average value of AUC values is maximized. In the training process, a 5-time cross validation mode is adopted, the AUC value of virtual screening is calculated for each target point (see table 1), the average AUCaver of all 102 target point AUC values is taken, and the model optimization direction is that the AUCaver value is the largest. After training is completed, a final deep learning model can be obtained.
TABLE 1 AUC values calculated for the DUD-E dataset using the conventional method and the method of this example
Step three, external verification of the deep learning model:
the generalization ability of the deep learning model was verified using the MUV dataset. Taking the example of the target 466 in the MUV dataset, the target 466 dataset contains 31 active fractions and 15000 Decoy fractions. Active molecules in the crystal structure are selected as template molecules (hereinafter referred to as molecule a). And (3) obtaining characteristic parameters of the molecule A and each molecule in the active ingredient set and the Decoy molecule set respectively by adopting a calculation mode in the step two, and obtaining a data set of 10 (31+15000) = 150310 characteristic parameters in total. The AUC value of the target virtual screening can be calculated by inputting the characteristic parameters into a trained deep learning model (see Table 2).
Table 2. AUC values calculated for muv data sets using conventional methods and the method of this example
The evaluation index of the AUC value of the drug virtual screening is a common standard for evaluating the accuracy of the screening method, the AUC value ranges from 0 to 1, and the closer the value is 1, the more accurate the screening method is. As shown in the experimental results in tables 1 and 2, compared with the prior art, the invention has obvious improvement on the drug screening accuracy, and simultaneously, the calculation speed of high flux of screening is maintained. The scoring method can screen out as many molecules with potential biological activity as possible from a database containing a large number of molecules, and the more accurate the screening method is, the easier the molecules with potential activity can be found, so that any potential active molecules can be omitted as much as possible, and the problem of false positive or false negative is solved.
The above examples are provided for convenience of description of the present invention and are not to be construed as limiting the invention in any way, and any person skilled in the art will make partial changes or modifications to the invention by using the disclosed technical content without departing from the technical features of the invention.