Protein small molecule docking scoring scheme based on deep learning
Technical Field
The invention belongs to the field of protein micromolecules, and particularly relates to a protein micromolecule docking and scoring scheme based on deep learning.
Background
At present, the treatment of diseases is currently an important content of research in various countries, constituting a great risk for the health of humans. The research and development of drugs are the most important steps for disease treatment, and are the key and difficult points of academic, enterprise and even national institutes. The research and development of new drugs face the difficulties of long development period, high cost, uncertainty of side effects and the like, and the development difficulty is huge. With the widespread use of computers, it has become possible to simulate various steps in drug development using computers. At present, the use of computer simulation to assist drug development is indispensable in research and development, and can save time and money greatly.
As drug development, screening of small molecules, i.e. discovery of lead compounds, is a key step. The traditional method is usually carried out in a laboratory, and the docking of the small molecules and target protein is continuously carried out, and finally candidate small molecules are screened out. This approach is time consuming and laborious and those compounds which are not native are not easily found. And the small molecule library is utilized, and the computer is utilized for virtual screening, so that the screening time can be obviously reduced.
The interaction of small molecules and proteins is the essence and theoretical basis of molecular docking, and the interaction force comprises van der waals force, hydrogen bond, hydrophobic interaction, charge force and the like. Currently, most molecular docking software uses an experience-based molecular docking force formula, such as AutoDock, Dock, and the like. The force formulas used by various docking software are different in form, but essentially adopt results based on a priori knowledge. For the calculation of intermolecular acting force, the different algorithms adopt the position parameters including various forms, and most of the algorithms are continuously tested and improved to finally enable the formula to be closer to the actual acting force result.
With the rise of machine learning, scoring of molecular docking by using a deep learning model also becomes a hot point of research. The algorithm is characterized in that molecular descriptor vectors are respectively extracted from proteins and small molecules, and molecular docking activity values are predicted for different protein small molecule vector pairs.
However, this kind of method does not need to know the concept of molecular docking in advance, and uses the respective properties of protein and molecule to predict ligand, because the spatial structure cannot be described, the accuracy is still to be improved. In addition, deep learning models which use the grid values of the compound to predict exist, but the deep learning models often cannot comprehensively consider the compound and still have high limitation. The molecular docking software utilizes an empirical formula to calculate the molecular docking fraction, and although the formula is continuously close to the final result through fitting, the calculation process is too complex and the calculation time is long, so that the docking performance is seriously reduced. Meanwhile, the position parameters in the formula need to be fitted continuously to approach the actual condition, and the final application of the formula usually has a great degree of uncertainty.
Disclosure of Invention
The invention provides a protein small molecule docking scoring scheme based on deep learning, which can adopt a multi-channel grid, fully consider various information of a compound, is more suitable for model input of the deep learning, and accurately and quickly predict a docking score by scoring the multi-channel compound grid.
The technical scheme of the invention is realized as follows: a protein small molecule docking scoring scheme based on deep learning is characterized by comprising the following steps:
step 1, constructing a compound database, and selecting a plurality of protein ligand pairs as a training set, a plurality of protein ligand pairs as a verification set, and a plurality of protein ligand pairs as a test set;
step 2, establishing a surrounding grid for the compound to obtain a compound grid, and processing to obtain a multi-channel compound descriptor;
step 3, training the multichannel compound descriptor in the step 2 by using the training set in the step 1 to obtain a deep learning model;
step 4, verifying the deep learning model in the step 3 by using the verification set in the step 1;
and 5, obtaining a compound descriptor by using the test set in the step 1 and adopting the processing mode in the step 2, inputting the compound descriptor into the deep learning model in the step 3 for prediction, and obtaining a scoring value of the compound.
As a preferred embodiment, the database of complexes in step 1 comprises the PDBb i nd database, wherein 10000 pairs of protein ligand pairs are selected as training set, 2000 pairs of protein ligand pairs are selected as validation set, and the remaining 4151 pairs of protein ligand pairs are selected as test set.
In a preferred embodiment, the processing method in step 2 comprises the steps of performing grid discretization on a composite grid at intervals of 1 angstrom, and simultaneously performing 3 angstrom filling on the whole periphery of the grid.
As a preferred embodiment, the multi-channel complex descriptors in step 2 are divided into four complex descriptors according to the complex atom type, van der waals forces, hydrogen bonding forces, and charge.
As a preferred embodiment, the multi-channel compound descriptor comprises 6 channels divided according to compound atom types, wherein the atom types are C, H, N, O, P and other types which are totally 6, and 6 channels are correspondingly arranged, wherein only the single atom is considered in each channel.
As a preferred embodiment, the multi-channel composite descriptor comprises 1 channel arranged by van der waals forces.
As a preferred embodiment, the multichannel composite descriptor comprises 1 channel arranged according to hydrogen bonding forces.
As a preferred embodiment, the multi-channel complex descriptor comprises 1 channel set by charge.
As a preferred embodiment, the van der waals force of any one grid in a channel is calculated as follows:
where i is all atoms of the compound, is a two-atom potential energy well, σ is the distance between the two-atom potential energy wells at 0, and r is the actual distance of the atom pair.
As a preferred embodiment, the hydrogen bonding forces for any one of the grids in the channel are calculated as follows:
wherein, i is only H, N, O three yards and is two hydrogen bond potential energy traps forming hydrogen bond atom pairs, sigma is the distance of potential energy acting force of 0, and r is the distance of donor and acceptor atoms forming the hydrogen bonds.
After the technical scheme is adopted, the invention has the beneficial effects that:
1. and the compound descriptor of the protein small molecules based on the voxel grid is adopted, and the information such as the space geometric structure of the compound is fully considered.
2. By adopting the multi-channel grid, various information of the compound can be fully compounded, and the method is more suitable for model input of deep learning.
3. And (4) adopting deep learning intelligent scoring to score the multi-channel compound grid, thereby accurately and quickly predicting the docking score.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is an overall flow chart of molecular docking scoring;
FIG. 2 is a diagram of a lattice creation for an atom type channel.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1 and 2, the PDBbind database contains detailed attribute information such as binding conformation, binding site, protein conformation, small molecule conformation, and corresponding binding activity of a protein and a ligand. The binding activity it provides can be used as the output of the final model. The latest version of the PDBbind database is version 2018, which contains 16151 proteins bound to ligands. The present invention uses 10000 pairs of protein-ligand pairs as training set, 2000 pairs as validation set, and the rest 4151 pairs as test.
The composite was first subjected to bounding box mesh creation and mesh discretization at 1 angstrom intervals, with 3 angstrom fill around the entire mesh. The resulting composite lattice is processed differently to obtain a multi-channel composite descriptor, including 6 channels divided by the type of composite atoms, 1 channel divided by van der waals forces, and 1 channel divided by hydrogen bonding forces.
For proteins and small molecules, the atom type is the most basic manifestation. The present invention divides all the types of the atoms into C, H, N, O, P types and other 6 types, and correspondingly sets 6 channels.
In each channel, only this single atom is considered. For example, C channel extracts coordinate information on all C in the compound, maps the coordinate information to the bounding box mesh of the compound according to the coordinate position of C, and fills all the meshes within the van der waals radius centered on the C atom to 1. If different C atoms cover the same grid, the grid values are superposed. This means that in this C-atom channel grid, the larger the grid value, the more C-atom van der waals radii it is within. The same treatment is carried out for other atom types, and finally a 6-channel composite grid is obtained.
In order to ensure smooth and coherent transition among different grids, the obtained 6-channel grid is subjected to Gaussian smoothing processing with sigma being 1, so that the obtained grid is smoother.
After the complex bounding box is obtained, the meaning of the median value in each grid needs to be clarified, traditional van der Waals calculations typically use the 6-12L ennard-Jones potential energy formula:
wherein i and j are atoms of molecule 1 and molecule 2 respectively and are potential energy traps of two atoms, sigma is the distance between the potential energy traps of two atoms and is 0, and r is the actual distance of an atom pair.
However, in the quantification of the actual grid values, two host molecules 1 and 2 calculated by van der waals force need to be defined, while for the grid, it is only one grid, which cannot generate acting force on the complex simply, and at the same time, the range span is large due to the calculation of van der waals force, which is not beneficial to the final training and learning. Thus, the present invention modifies the 6-12 formula of van der Waals forces. The van der waals forces for any one grid in a channel are calculated as follows:
where i is all atoms of the compound, is a two-atom potential energy well, σ is the distance between the two-atom potential energy wells at 0, and r is the actual distance of the atom pair. At EvdwThe atomic pair is not considered any more in the calculation of (2), and only the van der waals contribution of a single compound atom to the lattice is considered here. Mapping the sum of the contribution rates of all atoms of the final compound to the grid by adopting a sigmoid function to ensure that the final range of the grid value is positionedBetween 0 and 1. Calculating the van der waals force contribution of the compound atoms to all the grids to obtain the final van der waals force channel grid.
For the calculation of hydrogen bonds, it is necessary to consider only H, or N or O, which may have a strong lone pair, that forms the hydrogen bond, and therefore, only these three atoms are considered in calculating the hydrogen bond.A conventional hydrogen bond calculation formula is the 10-12L ennard-Jones potential energy formula:
wherein, i and j are respectively an acceptor atom and a donor atom of a hydrogen bond, two hydrogen bond potential energy traps forming a hydrogen bond atom pair, sigma is the distance of potential energy acting force of 0, and r is the distance of the donor atom and the acceptor atom forming the hydrogen bond.
For hydrogen bonding channels, we use a similar approach to van der waals force channels:
except that only H, N, O atoms are taken as i, and finally the grid values of all grids are calculated, so that the hydrogen bonding force channel grids are obtained.
The charge also has important contribution to the binding between molecules and is one of the important indexes of scoring. Here, we map the partial charges of all atoms into a grid according to their coordinate positions on the basis of a complex grid, so as to obtain a corresponding charge channel grid.
A3D convolutional neural network similar to VGG is adopted to construct a model, and the specific model is shown in the following table:
the final output of the model is a predicted value of activity, i.e., a score value.
10000 proteins of PDBbind (v2018) were randomly chosen for training, 2000 proteins for validation, the purpose of the validation set was to prevent overfitting and when the model stopped training.
During testing and prediction, the complex grid creation and 9-channel processing are also adopted for a new complex conformation, and finally the 9-channel grid is obtained and input into a model for prediction to obtain a scoring value of the complex.
The present invention is not limited to the above preferred embodiments, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.