CN112466410A - Method and device for predicting protein and ligand molecule binding free energy - Google Patents
Method and device for predicting protein and ligand molecule binding free energy Download PDFInfo
- Publication number
- CN112466410A CN112466410A CN202011332960.XA CN202011332960A CN112466410A CN 112466410 A CN112466410 A CN 112466410A CN 202011332960 A CN202011332960 A CN 202011332960A CN 112466410 A CN112466410 A CN 112466410A
- Authority
- CN
- China
- Prior art keywords
- protein
- binding
- machine learning
- ligand
- learning model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000003446 ligand Substances 0.000 title claims abstract description 113
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 94
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 94
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000009149 molecular binding Effects 0.000 title description 4
- 230000027455 binding Effects 0.000 claims abstract description 95
- 238000010801 machine learning Methods 0.000 claims abstract description 54
- 108091005942 ECFP Proteins 0.000 claims abstract description 38
- 238000012549 training Methods 0.000 claims abstract description 38
- 238000002902 MACCS key Methods 0.000 claims abstract description 33
- 150000001413 amino acids Chemical class 0.000 claims abstract description 31
- 238000012360 testing method Methods 0.000 claims abstract description 25
- 238000003062 neural network model Methods 0.000 claims description 17
- 238000007637 random forest analysis Methods 0.000 claims description 15
- -1 Amino Chemical group 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 12
- 239000002253 acid Substances 0.000 claims description 9
- 239000013078 crystal Substances 0.000 claims description 8
- 230000000694 effects Effects 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 8
- 238000000547 structure data Methods 0.000 claims description 8
- 210000002569 neuron Anatomy 0.000 claims description 6
- 238000007405 data analysis Methods 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 3
- 238000012795 verification Methods 0.000 claims description 3
- 238000009510 drug design Methods 0.000 description 8
- 239000003814 drug Substances 0.000 description 5
- 229940079593 drug Drugs 0.000 description 5
- 238000013473 artificial intelligence Methods 0.000 description 4
- 150000001875 compounds Chemical class 0.000 description 4
- 238000007877 drug screening Methods 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 239000002547 new drug Substances 0.000 description 3
- 238000012216 screening Methods 0.000 description 3
- 238000012706 support-vector machine Methods 0.000 description 3
- QGZKDVFQNNGYKY-UHFFFAOYSA-N Ammonia Chemical compound N QGZKDVFQNNGYKY-UHFFFAOYSA-N 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000009509 drug development Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 150000003384 small molecules Chemical class 0.000 description 2
- 238000004617 QSAR study Methods 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 229910021529 ammonia Inorganic materials 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000005764 inhibitory process Effects 0.000 description 1
- 238000005293 physical law Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 238000002907 substructure search Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/30—Prediction of properties of chemical compounds, compositions or mixtures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Chemical & Material Sciences (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Crystallography & Structural Chemistry (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a method and a device for predicting the free energy of protein and ligand molecule combination, wherein the method comprises the following steps: s1, constructing a local database; s2 analyzes the data to obtain ligand molecules and protein binding pockets centered on the ligand moleculesAn internal amino acid molecule; s3 calculating MACCS keys of ligand molecules and adjacent amino acids; s4 calculating ECFP fingerprints of the ligand molecules and adjacent amino acids; s5, converting the MACCS key and ECFP fingerprint information of ligand molecules and proteins into one-dimensional tensors to form a training set and a testing set; s6, establishing a machine learning model and training the machine learning model; s7, calculating and comparing the Pearson correlation coefficient and the absolute error, and verifying the prediction result of the machine learning model; s8 using the trained machine learning model to determine the binding constant of protein and ligand moleculeAnd (5) predicting, and calculating the binding free energy. Compared with the prediction result of a single fingerprint, the prediction result of the combined fingerprint is obviously better, and the accuracy of free energy prediction is improved.
Description
Technical Field
The invention relates to the technical field of computer-aided drug screening, in particular to a method and a device for predicting the binding free energy of protein and ligand molecules.
Background
How to obtain the final lead molecule by screening from tens of millions of molecules is a key technology for drug screening. The free energy calculation method is the core of realizing high-throughput drug screening. With the rapid development of computer technology and computational theory, a new drug development method using computer-aided drug design has emerged.
Computer-aided drug design is the science of achieving drug design with the help of high performance computers based on physical laws. The application of computers in the field of drug design marks the stage that people develop drugs from blind screening to rational design. With the development of computer performance, the demand for predictive performance of drug molecules is becoming higher and higher. The traditional calculation method is limited by calculation precision and calculation time consumption, and can not meet the current requirement on screening of massive molecular libraries. Compared with a mode based on an empirical formula, the organic combination of an artificial intelligence method and big data greatly changes the field of the current computer-aided drug design and becomes a new drug design method which surpasses the traditional calculation mode.
In recent years, artificial intelligence techniques have been gradually used for predicting the binding strength of protein and drug molecules, including convolutional neural network cnn (convolutional neural network), support vector machine svm (support vector machine), random forest rf (random forest), deep neural network dnn (deep neural network), and the like. By means of the methods, the internal rule can be obtained from the interaction structure of the complex protein and the small molecule, so that the binding free energy of the protein and the ligand molecule can be predicted.
The artificial intelligence assisted drug screening relies on molecular descriptors, and the information coding of biomolecules is a bridge connecting high-performance computers and new drug development. The method has occupied the research in the application of artificial intelligence to assist drug design. A molecular descriptor is a digitized representation of a molecule that encodes the molecule into a series of bit strings. For example, based on specific substructures (substructures) within a molecule, substructure-based MACCS keys have been developed that encode molecules based on the presence or absence of certain substructures or features in a given structural list. The number of bits of such a molecular fingerprint is determined by the number of defined substructures, each bit being associated with the presence or absence of a given feature in the molecule, thus converting the molecular structure into a binary number consisting of 0 and 1. Another type is topology or path based fingerprinting, such as the extended connectivity fingerprint ECFP, which is generated by analyzing all molecular fragments on a path starting from an atom until a specified number of bonds are reached, and then Hashing (Hashing) the identifiers corresponding to the sub-molecular structures in each path. An ECFP fingerprint of radius 2, known as an ECFP4 fingerprint, is often used. The fingerprint is suitable for any molecule, can adjust the length of the molecule, and can be used for quick substructure search and molecule filtration.
Although molecular fingerprints have been used in many aspects such as quantitative structure-activity relationship and prediction of physicochemical properties of drug molecules, calculation is usually performed based on only one molecular fingerprint, and since a single molecular fingerprint only contains specific structural information, it has certain limitations, and particularly in predicting such complex structures as proteins and ligand molecules (encoding proteins and ligand molecules as one structure), the following problems may occur: 1) according to different purposes, the fingerprints are processed differently, and certain information is probably lost in the process of converting the structure into the molecular fingerprint; 2) the protein and the ligand molecule are only encoded as a whole, and the spatial structure specificity of the protein and the ligand molecule is ignored. These problems all lead to technical problems that the accuracy of predicting the properties of molecules needs to be improved.
Disclosure of Invention
Aiming at the problems, the invention provides a method and a device for predicting the free energy of protein and ligand molecule combination, which effectively solve the technical problem of low prediction accuracy of the existing method.
The technical scheme provided by the invention is as follows:
a method for predicting the binding free energy of a protein to a ligand molecule, comprising:
s1, acquiring a three-dimensional structure of the combination of the protein and the ligand molecule from the PDBbind data set, and storing the three-dimensional structure in a local server; based on crystal structure data in a PDB database and a PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules;
s2 analyzing the data in the pdb document to obtain ligand molecules and protein binding pockets centered on the ligand moleculesAmino acid molecules within distance;
s3 calculating MACCS keys of the ligand molecules and the amino acids in the binding pockets respectively;
s4 calculating ECFP fingerprints of the ligand molecules and the amino acids in the binding pockets respectively;
s5, converting the MACCS key and ECFP fingerprint information of the ligand molecule and the MACCS key and ECFP fingerprint information of the amino acid in the protein binding pocket into one-dimensional tensors respectively, and correspondingly encoding all molecules in a local database to form a training set and a test set;
s6, establishing a machine learning model in the server, training the machine learning model by using a training set, and testing the trained machine learning model by using a testing set until a preset prediction effect is obtained;
s7, calculating a binding constant obtained by predicting the machine learning model, comparing a Pearson correlation coefficient and an absolute error with a result measured by an experiment, and verifying a prediction result of the machine learning model;
s8, the machine learning model after training is used for predicting the binding constant of the protein and the ligand molecule, and the binding constant is converted into the binding freedom through an Alanikus formula, so that the prediction of the binding self-organizing ability is completed.
The invention also provides a device for predicting the free energy of protein and ligand molecule combination, which comprises:
the local database construction module is used for acquiring a three-dimensional structure combined by protein and ligand molecules from the PDBbind data set and storing the three-dimensional structure in the local server; based on crystal structure data in a PDB database and a PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules;
a data analysis module for analyzing the data in the pdb file to obtain ligand molecules and protein binding pockets centered on the ligand moleculesAmino acid molecules within distance;
a fingerprint calculation module for calculating MACCS keys of the ligand molecules and the amino acids in the binding pockets respectively; and calculating the ECFP fingerprints of the ligand molecules and the amino acids in the binding pockets respectively;
the information conversion module is used for converting the MACCS key and ECFP fingerprint information of the ligand molecules and the MACCS key and ECFP fingerprint information of the amino acids in the protein binding pockets into one-dimensional tensors respectively, and correspondingly encoding all molecules in a local database to form a training set and a test set;
the model training module is used for establishing a machine learning model in the server, training the machine learning model by using a training set and testing the trained machine learning model by using a test set until a preset prediction effect is obtained;
the prediction result verification module is used for calculating the combined free energy obtained by predicting the machine learning model, comparing the Pearson correlation coefficient and the absolute error with the experimental result and verifying the prediction result of the machine learning model; and
and the binding energy prediction module predicts the binding constant of the protein and the ligand molecule by using the trained machine learning model, converts the binding constant into binding freedom by an Alanikus formula and completes the prediction of the binding self-routing energy.
The method and the device for predicting the free energy of the combination of the protein and the ligand molecule, provided by the invention, can at least bring the following beneficial effects:
1) drug molecules are usually bound to cavities on or within proteins, sites commonly referred to as substrate molecule binding pockets. The type and position of amino acids near these sites determines the physicochemical properties of the protein-small molecule binding interaction, and in particular within the ligand binding pocket, the closer the protein structure is to the ligand determines the specificity of binding and the strength of binding. Therefore, the invention takes the ligand molecule as the center to obtain the protein binding pocketThe amino acid molecules within the distance are coded, so that the calculation pertinence can be effectively improved, and the calculation amount is reduced.
2) Unlike the method of combining protein and ligand molecule structure and converting molecular fingerprint, the present invention treats protein structure and ligand molecule structure separately and combines fingerprint. Especially, the method can keep the specificity of protein structure and ligand molecule more and is beneficial to improving the generalization ability of machine learning when the protein binding pocket or the compound molecule has structural similarity with the structure to be predicted.
3) By separately encoding the protein and the ligand molecules and adopting two encoding modes, the advantages of different molecular fingerprints are effectively exerted, and the information of the two molecular fingerprints is complemented (for example, the MACCS key contains atom species, valence bond species and the like, but does not contain topological structure information; the ECFP fingerprint is a topological fingerprint along a chemical bond and contains specific structural information around atoms), and different molecular fingerprints are verified by using machine learning methods such as a neural network or a random forest.
Drawings
The foregoing features, technical features, advantages and embodiments are further described in the following detailed description of the preferred embodiments, which is to be read in connection with the accompanying drawings.
FIG. 1 is a schematic flow chart of a method for predicting the binding free energy of a protein and a ligand molecule according to the present invention;
FIG. 2 is a schematic diagram of an application of the combined fingerprint in an example of a deep neural network model according to the present invention;
FIG. 3 is a schematic diagram of an application of the combined fingerprint in an example of a random forest model according to the present invention;
FIG. 4 Pearson coefficients predicted using neural network model instances with combined fingerprints according to the present invention;
FIG. 5 illustrates Pearson coefficients predicted using random forest model instances with combined fingerprints;
FIG. 6 is a distribution comparison diagram of predicted values and true values of combined fingerprints according to an embodiment of the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
Fig. 1 is a schematic flow chart of a method for predicting the binding free energy of a protein and a ligand molecule provided by the present invention, and as can be seen from the figure, the method comprises:
s1, acquiring a three-dimensional structure of the combination of the protein and the ligand molecule from the PDBbind data set, and storing the three-dimensional structure in a local server; based on the crystal structure data in the PDB database and the PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules.
The adopted data is derived from a PDBbind data set, the data set comprises two sub-data sets with different accuracies, namely a refined data set and a general data set, wherein the refined data set (comprising 4852 molecular structures of protein and small molecule compounds) comprises pdb files of the protein and small molecule compounds with high analytic accuracy and accurate inhibition coefficients obtained by experiments, and the general data set comprises pdb files of the protein and small molecule compounds with different analytic accuracies. In prediction of binding energy of protein and small molecules, both a refined data set and a general data set can be used and can be selected according to requirements.
S2 data in pdb document are analyzed to obtain ligand molecules and protein binding pockets centered around the ligand moleculesAmino acid molecules within distance. Here, the distance from the protein to the ligand molecule isThe microenvironment of the inner protein molecule is divided into ligand binding pockets.
S3 calculates MACCS keys for the ligand molecules and the amino acids within the binding pocket, respectively.
S4 ECFP fingerprints for the ligand molecules and the amino acids in the binding pocket are calculated, respectively.
S5, converting the MACCS key and ECFP fingerprint information of ligand molecules and the MACCS key and ECFP fingerprint information of protein binding pocket amino acids into one-dimensional tensors respectively, and correspondingly encoding all molecules in a local database to form a training set and a test set.
The proportion of the training set and the test set can be limited according to actual needs, for example, the refined data set is divided into the training set and the test set according to the proportion of 8:2, and the like. In order to input data into a machine learning model, the MACCS key and ECFP fingerprint information of ligand molecules and the MACCS key and ECFP fingerprint information of protein-bound pocket amino acids are hashed and processed into data with fixed length such as 512 and 1024. In addition, in the process, the combined fingerprint is adopted to convert the pdb format file of the protein and small molecule compound into computer-processed digital coding information, so that the structural information of the compound can be better processed.
In the step, MACCS keys and ECFP fingerprints are respectively coded for amino acids in ligand molecules and protein binding pockets, the advantages of two coding modes and the complementation of coding information are fully exerted in the subsequent binding energy prediction process (166 key values are predefined by MACCS and comprise element types and appointed bonding types; ECFP draws a circle by a selected atom based on the connection information around the atom, and a code is distributed to a sub-molecular structure in the circle and comprises the connection information between the atom and the periphery), so that the atom type and sub-structure information and the topological structure information around the atom are reserved.
S6, establishing a machine learning model in the server, training the machine learning model by using the training set, and testing the trained machine learning model by using the testing set until a preset prediction effect is obtained. The established machine learning model can be limited according to actual conditions, such as a random forest model, a deep neural network model, an XGboost model, a support vector machine model and the like. During the training process, the MACCS key and ECFP fingerprint information of ligand molecules, the MACCS key and ECFP fingerprint information of protein binding pocket amino acids are used as input quantities, and the binding free energy is used as output quantity. Activation functions include first not limited to relu. The deep neural network model can effectively learn the coding information, analyze potential rules from molecular fingerprints, and establish a nonlinear relation between the combined fingerprint and the combined free energy in a nonlinear fitting mode.
S7, calculating a combination constant obtained by predicting the machine learning model, comparing the Pearson correlation coefficient and the absolute error with the experimental result, and verifying the prediction result of the machine learning model. Specifically, the fitting relation r between the predicted value and the true value is calculated by the following formula2:
Wherein, pKexpFor the binding constant to be determined experimentally,the mean value of the experimentally determined binding constants, pKcalIn order to predict the resulting binding constant by the present method,is the average of the binding constants predicted by the present method.
S8, using the trained machine learning model to predict the binding constant of protein and ligand molecule, and converting the binding constant into binding freedom by the Alanikus formula as follows:
wherein, pKcalThe binding constant, Δ G, is the binding free energy, k, predicted by the methodBBoltzmann constant, T is absolute temperature.
Correspondingly, the invention also provides a device for predicting the free energy of the combination of the protein and the ligand molecule, which is applied to the method for predicting the free energy of the combination of the protein and the ligand molecule, and the device for predicting the free energy of the combination of the protein and the ligand molecule comprises: the local database construction module is used for acquiring a three-dimensional structure combined by protein and ligand molecules from the PDBbind data set and storing the three-dimensional structure in the local server; based on crystal structure data in a PDB database and a PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules; a data analysis module for analyzing the data in the pdb file to obtain ligand molecules and protein binding pockets centered on the ligand moleculesAmino acid molecules within distance; the fingerprint calculation module is used for respectively calculating MACCS keys of ligand molecules and amino acids in binding pockets; and calculating ECFP fingerprints of the ligand molecules and the amino acids in the binding pockets respectively; an information conversion module for respectively combining the MACCS key and ECFP fingerprint information of the ligand molecule and the protein with the ammonia in the pocketConverting the MACCS key and ECFP fingerprint information of the basic acid into one-dimensional tensors to form a training set and a test set; the model training module is used for establishing a machine learning model in the server, training the machine learning model by using a training set and testing the trained machine learning model by using a test set until a preset prediction effect is obtained; the prediction result verification module is used for calculating the combined free energy obtained by the prediction of the machine learning model, comparing the Pearson correlation coefficient and the absolute error with the experimental result and verifying the prediction result of the machine learning model; and the binding energy prediction module is used for predicting the binding free energy of the protein and the ligand molecules by using the trained machine learning model.
In an example, a flow of the prediction method is described by taking a deep neural network model as an example, as shown in fig. 2, the method specifically includes the following steps:
s1, acquiring a three-dimensional structure (corresponding to the protein and ligand molecule compound in figure 2) of the combination of the protein and the ligand molecule from the PDBbind data set, and storing the three-dimensional structure in a local server; based on the crystal structure data in the PDB database and the PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules.
S2 data in pdb document are analyzed to obtain ligand molecules and protein binding pockets centered around the ligand moleculesAmino acid molecules within distance.
S3 calculates MACCS keys for the ligand molecules and binding pocket amino acids, respectively (corresponding to the protein structure, ligand molecule structure codes in fig. 2).
S4 ECFP fingerprints (corresponding to the protein structure, ligand molecule structure codes in FIG. 2) were calculated for the ligand molecules and the amino acids in the binding pocket, respectively.
S5, converting the MACCS key and ECFP fingerprint information of ligand molecules and the MACCS key and ECFP fingerprint information of protein binding pocket amino acids into one-dimensional tensors, correspondingly encoding all molecules in a local database to form a combined fingerprint of the protein and the ligand molecules, and splitting the combined fingerprint into a training set and a testing set according to the ratio of 8: 2.
S6, establishing a deep neural network model in the server, training the deep neural network model by using a training set, and verifying the trained deep neural network model by using a test set until a preset prediction effect is obtained. Specifically, the constructed deep neural network model comprises two hidden layers, wherein the first hidden layer (fully connected layer) comprises 2048 neurons, the second hidden layer (fully connected layer) comprises 100 neurons, and each hidden layer is followed by a dropout layer with a loss ratio of 0.1.
S7, calculating a combination constant obtained by predicting the machine learning model, comparing the Pearson correlation coefficient and the absolute error with the experimental result, and verifying the prediction result of the machine learning model.
S8, the machine learning model after training is used for predicting the binding constant of the protein and the ligand molecule, and the binding constant is converted into the binding freedom through an Alanikus formula, so that the prediction of the binding self-organizing ability is completed.
In this example, no optimization of the hyper-parameters is performed, and the same hyper-parameters are used for the individual MACCS, ECFP and combined fingerprints. In other examples, the prediction results may be further enhanced by hyper-parametric optimization of the system.
In another example, a flow of the prediction method is described by taking a random forest model as an example, as shown in fig. 3, the method specifically includes the following steps:
s1, acquiring a three-dimensional structure (corresponding to the protein and ligand molecule compound in figure 3) of the combination of the protein and the ligand molecule from the PDBbind data set, and storing the three-dimensional structure in a local server; based on the crystal structure data in the PDB database and the PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules.
S2 data in pdb document are analyzed to obtain ligand molecules and protein binding pockets centered around the ligand moleculesAmino acid molecules within distance.
S3 calculates MACCS keys for the ligand molecules and binding pocket amino acids, respectively (corresponding to the protein structure, ligand molecule structure codes in fig. 3).
S4 ECFP fingerprints (corresponding to the protein structure, ligand molecule structure codes in FIG. 3) were calculated for the ligand molecules and the amino acids in the binding pocket, respectively.
S5, converting the MACCS key and ECFP fingerprint information of ligand molecules and the MACCS key and ECFP fingerprint information of protein binding pocket amino acids into one-dimensional tensors respectively, and splitting the tensors into a training set and a testing set according to the ratio of 7: 3.
S6, establishing a random forest model in the server, training the random forest model by using the training set, and verifying the trained random forest model by using the testing set until a preset prediction effect is obtained.
S7, calculating a combination constant obtained by predicting the machine learning model, comparing the Pearson correlation coefficient and the absolute error with the experimental result, and verifying the prediction result of the machine learning model.
S8, the machine learning model after training is used for predicting the binding constant of the protein and the ligand molecule, and the binding constant is converted into the binding freedom through an Alanikus formula, so that the prediction of the binding self-organizing ability is completed.
In this example, no optimization of the hyper-parameters is performed, and the same hyper-parameters are used for the individual MACCS, ECFP and combined fingerprints. In other examples, the prediction results may be further enhanced by hyper-parametric optimization of the system.
Fig. 4 shows Pearson coefficients predicted in an example of using a deep neural network model with a combined fingerprint, and fig. 5 shows Pearson coefficients predicted in an example of using a random forest model with a combined fingerprint. In addition, when the MACCS, ECFP and combined fingerprint are used, the mean square deviations of the predicted binding energy and the true value are respectively as follows: 1.55, 1.57 and 1.47, mean square deviations of predicted binding energies from true values when using the random forest model are: 1.36, 1.34 and 0.85, the use of the combined fingerprint in the two embodiments reduces the mean square error between the predicted value and the true value, and effectively improves the calculation precision. FIG. 6 shows the distribution of the predicted values and the real values of the combined fingerprints, wherein (a) the distribution of the predicted values and the real values of the deep neural network model is adopted, and (b) the distribution of the predicted values and the real values of the random forest model is adopted. .
It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for persons skilled in the art, numerous modifications and adaptations can be made without departing from the principle of the present invention, and such modifications and adaptations should be considered as within the scope of the present invention.
Claims (8)
1. A method for predicting the binding free energy of a protein to a ligand molecule, comprising:
s1, acquiring a three-dimensional structure of the combination of the protein and the ligand molecule from the PDBbind data set, and storing the three-dimensional structure in a local server; based on crystal structure data in a PDB database and a PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules;
s2 analyzing the data in the pdb document to obtain ligand molecules and protein binding pockets centered on the ligand moleculesAmino acid molecules within distance;
s3 calculating MACCS keys of the ligand molecules and the amino acids in the binding pockets respectively;
s4 calculating ECFP fingerprints of the ligand molecules and the amino acids in the binding pockets respectively;
s5, converting the MACCS key and ECFP fingerprint information of the ligand molecule and the MACCS key and ECFP fingerprint information of the amino acid in the protein binding pocket into one-dimensional tensors respectively, and correspondingly encoding all molecules in a local database to form a training set and a test set;
s6, establishing a machine learning model in the server, training the machine learning model by using a training set, and testing the trained machine learning model by using a testing set until a preset prediction effect is obtained;
s7, calculating a combination constant obtained by predicting the machine learning model, calculating a Pearson correlation coefficient and an absolute error of a predicted value and a true value, and verifying a prediction result of the machine learning model;
s8, the machine learning model after training is used for predicting the binding constant of the protein and the ligand molecule, and the binding constant is converted into the binding freedom through an Alanikus formula, so that the prediction of the binding self-organizing ability is completed.
3. A prediction method as claimed in claim 1 or 2, characterized in that in step S6, the machine learning model constructed is a random forest model or a deep neural network model.
4. The prediction method according to claim 3, wherein, when the machine learning model constructed in step S6 is a deep neural network model, the deep neural network model includes two hidden layers, wherein a first hidden layer includes 2048 neurons, a second hidden layer includes 100 neurons, and each hidden layer is followed by a dropout layer with a loss ratio of 0.1.
5. An apparatus for predicting the binding free energy of a protein to a ligand molecule, comprising:
the local database construction module is used for acquiring a three-dimensional structure combined by protein and ligand molecules from the PDBbind data set and storing the three-dimensional structure in the local server; based on crystal structure data in a PDB database and a PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules;
a data analysis module for analyzing the data in the pdb file to obtain ligand molecules and protein binding pockets centered on the ligand moleculesAmino acid molecules within distance;
a fingerprint calculation module for calculating MACCS keys of the ligand molecules and the amino acids in the binding pockets respectively; and calculating the ECFP fingerprints of the ligand molecules and the amino acids in the binding pockets respectively;
the information conversion module is used for converting the MACCS key and ECFP fingerprint information of the ligand molecules and the MACCS key and ECFP fingerprint information of the amino acids in the protein binding pockets into one-dimensional tensors to form a training set and a test set;
the model training module is used for establishing a machine learning model in the server, training the machine learning model by using a training set and testing the trained machine learning model by using a test set until a preset prediction effect is obtained;
the prediction result verification module is used for calculating the combined free energy obtained by predicting the machine learning model, comparing the Pearson correlation coefficient and the absolute error with the experimental result and verifying the prediction result of the machine learning model; and
and the binding energy prediction module is used for predicting the binding constant of the protein and the ligand molecule by using the trained machine learning model, and converting the binding constant into binding freedom through an Alanikus formula to complete the prediction of the binding self-routing energy.
7. A prediction apparatus as claimed in claim 5 or 6, characterized in that in the model training module, the machine learning model constructed is a random forest model or a deep neural network model.
8. The prediction apparatus according to claim 7, wherein when the machine learning model constructed by the model training module is a deep neural network model, the deep neural network model includes two hidden layers, wherein a first hidden layer includes 2048 neurons, a second hidden layer includes 100 neurons, and each hidden layer is followed by a dropout layer with a loss ratio of 0.1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011332960.XA CN112466410B (en) | 2020-11-24 | 2020-11-24 | Method and device for predicting binding free energy of protein and ligand molecule |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011332960.XA CN112466410B (en) | 2020-11-24 | 2020-11-24 | Method and device for predicting binding free energy of protein and ligand molecule |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112466410A true CN112466410A (en) | 2021-03-09 |
CN112466410B CN112466410B (en) | 2024-02-20 |
Family
ID=74798301
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011332960.XA Active CN112466410B (en) | 2020-11-24 | 2020-11-24 | Method and device for predicting binding free energy of protein and ligand molecule |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112466410B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117037946A (en) * | 2022-11-14 | 2023-11-10 | 上海微观纪元数字科技有限公司 | Method for optimizing structure of compound based on protein binding pocket |
CN117037946B (en) * | 2022-11-14 | 2024-05-10 | 合肥微观纪元数字科技有限公司 | Method for optimizing structure of compound based on protein binding pocket |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020055536A1 (en) * | 1996-09-26 | 2002-05-09 | Dewitte Robert S. | System and method for structure-based drug design that includes accurate prediction of binding free energy |
US20050053999A1 (en) * | 2000-11-14 | 2005-03-10 | Gough David A. | Method for predicting G-protein coupled receptor-ligand interactions |
CN102282560A (en) * | 2008-12-05 | 2011-12-14 | 狄克雷佩特公司 | Method for creating virtual compound libraries within markush structure patent claims |
US20120053067A1 (en) * | 2009-05-04 | 2012-03-01 | University Of Maryland, Baltimore | Method for binding site identification by molecular dynamics simulation (silcs: site identification by ligand competitive saturation) |
US20130230705A1 (en) * | 2012-03-02 | 2013-09-05 | Wisconsin Alumni Research Foundation | Patterning in the directed assembly of block copolymers using triblock or multiblock copolymers |
CN103869003A (en) * | 2012-12-13 | 2014-06-18 | 滇虹药业集团股份有限公司 | Establishing method of double-solvent fused HPLC fingerprint of medicinal phellodendron and standard fingerprint of medicinal phellodendron |
CN106777986A (en) * | 2016-12-19 | 2017-05-31 | 南京邮电大学 | Ligand molecular fingerprint generation method based on depth Hash in drug screening |
CN108399316A (en) * | 2018-03-02 | 2018-08-14 | 南京邮电大学 | Ligand molecular Feature Selection device and screening technique in drug design |
US20180285731A1 (en) * | 2017-03-30 | 2018-10-04 | Atomwise Inc. | Systems and methods for correcting error in a first classifier by evaluating classifier output in parallel |
US20190050538A1 (en) * | 2017-08-08 | 2019-02-14 | International Business Machines Corporation | Prediction and generation of hypotheses on relevant drug targets and mechanisms for adverse drug reactions |
US20190065667A1 (en) * | 2017-07-25 | 2019-02-28 | University Of Massachusetts Medical School | Method for Probing at Least One Binding Site of a Protein |
US20190272468A1 (en) * | 2018-03-05 | 2019-09-05 | The Board Of Trustees Of The Leland Stanford Junior University | Systems and Methods for Spatial Graph Convolutions with Applications to Drug Discovery and Molecular Simulation |
US20190304568A1 (en) * | 2018-03-30 | 2019-10-03 | Board Of Trustees Of Michigan State University | System and methods for machine learning for drug design and discovery |
CN110910951A (en) * | 2019-11-19 | 2020-03-24 | 江苏理工学院 | Method for predicting protein and ligand binding free energy based on progressive neural network |
CN111126554A (en) * | 2018-10-31 | 2020-05-08 | 深圳市云网拜特科技有限公司 | Drug lead compound screening method and system based on generation of confrontation network |
CN111292800A (en) * | 2020-01-21 | 2020-06-16 | 中南大学 | Molecular characterization based on predicted protein affinity and application thereof |
-
2020
- 2020-11-24 CN CN202011332960.XA patent/CN112466410B/en active Active
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020055536A1 (en) * | 1996-09-26 | 2002-05-09 | Dewitte Robert S. | System and method for structure-based drug design that includes accurate prediction of binding free energy |
US20050053999A1 (en) * | 2000-11-14 | 2005-03-10 | Gough David A. | Method for predicting G-protein coupled receptor-ligand interactions |
CN102282560A (en) * | 2008-12-05 | 2011-12-14 | 狄克雷佩特公司 | Method for creating virtual compound libraries within markush structure patent claims |
US20120053067A1 (en) * | 2009-05-04 | 2012-03-01 | University Of Maryland, Baltimore | Method for binding site identification by molecular dynamics simulation (silcs: site identification by ligand competitive saturation) |
US20130230705A1 (en) * | 2012-03-02 | 2013-09-05 | Wisconsin Alumni Research Foundation | Patterning in the directed assembly of block copolymers using triblock or multiblock copolymers |
CN103869003A (en) * | 2012-12-13 | 2014-06-18 | 滇虹药业集团股份有限公司 | Establishing method of double-solvent fused HPLC fingerprint of medicinal phellodendron and standard fingerprint of medicinal phellodendron |
CN106777986A (en) * | 2016-12-19 | 2017-05-31 | 南京邮电大学 | Ligand molecular fingerprint generation method based on depth Hash in drug screening |
US20180285731A1 (en) * | 2017-03-30 | 2018-10-04 | Atomwise Inc. | Systems and methods for correcting error in a first classifier by evaluating classifier output in parallel |
US20190065667A1 (en) * | 2017-07-25 | 2019-02-28 | University Of Massachusetts Medical School | Method for Probing at Least One Binding Site of a Protein |
US20190050538A1 (en) * | 2017-08-08 | 2019-02-14 | International Business Machines Corporation | Prediction and generation of hypotheses on relevant drug targets and mechanisms for adverse drug reactions |
CN108399316A (en) * | 2018-03-02 | 2018-08-14 | 南京邮电大学 | Ligand molecular Feature Selection device and screening technique in drug design |
US20190272468A1 (en) * | 2018-03-05 | 2019-09-05 | The Board Of Trustees Of The Leland Stanford Junior University | Systems and Methods for Spatial Graph Convolutions with Applications to Drug Discovery and Molecular Simulation |
US20190304568A1 (en) * | 2018-03-30 | 2019-10-03 | Board Of Trustees Of Michigan State University | System and methods for machine learning for drug design and discovery |
CN111126554A (en) * | 2018-10-31 | 2020-05-08 | 深圳市云网拜特科技有限公司 | Drug lead compound screening method and system based on generation of confrontation network |
CN110910951A (en) * | 2019-11-19 | 2020-03-24 | 江苏理工学院 | Method for predicting protein and ligand binding free energy based on progressive neural network |
CN111292800A (en) * | 2020-01-21 | 2020-06-16 | 中南大学 | Molecular characterization based on predicted protein affinity and application thereof |
Non-Patent Citations (5)
Title |
---|
DRUGAI: ""Nat. Methods | 基于几何深度学习解密蛋白分子表面的相互作用指纹"", pages 1 - 14, Retrieved from the Internet <URL:《https://blog.csdn.net/u012325865/article/details/105683776》> * |
H. M. ASHTAWY AND N. R. MAHAPATRA: "\"A Comparative Assessment of Predictive Accuracies of Conventional and Machine Learning Scoring Functions for Protein-Ligand Binding Affinity Prediction,\"", 《IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS》, vol. 12, no. 2, pages 335 - 347, XP011578059, DOI: 10.1109/TCBB.2014.2351824 * |
XIE, LIANGXU等: ""Improvement of prediction performance with conjoint molecular fingerprint in deep learning"", 《FRONTIERS IN PHARMACOLOGY》, vol. 11, pages 1 - 15 * |
肖高铿: ""分子指纹及其在虚拟筛选中的应用"", pages 1 - 8, Retrieved from the Internet <URL:《http://blog.molcalx.com.cn/2019/01/29/fingerprint.html》> * |
谢良旭,薛亮亮,李峰: ""神经网络的深度与宽度对药物分子pKa预测性能影响的研究"", 《江苏理工学院学报》, vol. 27, no. 2, pages 1 - 8 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117037946A (en) * | 2022-11-14 | 2023-11-10 | 上海微观纪元数字科技有限公司 | Method for optimizing structure of compound based on protein binding pocket |
CN117037946B (en) * | 2022-11-14 | 2024-05-10 | 合肥微观纪元数字科技有限公司 | Method for optimizing structure of compound based on protein binding pocket |
Also Published As
Publication number | Publication date |
---|---|
CN112466410B (en) | 2024-02-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113593631B (en) | Method and system for predicting protein-polypeptide binding site | |
CN110289050B (en) | Drug-target interaction prediction method based on graph convolution sum and word vector | |
CN112639831A (en) | Mutual information countermeasure automatic encoder | |
CN111223532B (en) | Method, device, apparatus, medium for determining a reactant of a target compound | |
Jiang et al. | Explainable deep hypergraph learning modeling the peptide secondary structure prediction | |
CN116486900B (en) | Drug target affinity prediction method based on depth mode data fusion | |
CN113571125A (en) | Drug target interaction prediction method based on multilayer network and graph coding | |
CN116206688A (en) | Multi-mode information fusion model and method for DTA prediction | |
Alkuhlani et al. | Pustackngly: positive-unlabeled and stacking learning for n-linked glycosylation site prediction | |
Zhu et al. | Associative learning mechanism for drug‐target interaction prediction | |
CN116153390A (en) | Quantum convolutional neural network-based drug binding energy prediction method | |
Wang et al. | MGPLI: exploring multigranular representations for protein–ligand interaction prediction | |
KR102407120B1 (en) | Molecule design method using deep generative model based on molecular fragment and analysis apparatus | |
Du et al. | Improving protein domain classification for third-generation sequencing reads using deep learning | |
Jha et al. | Prediction of protein‐protein interactions using stacked auto‐encoder | |
Wei et al. | ConPep: Prediction of peptide contact maps with pre-trained biological language model and multi-view feature extracting strategy | |
CN114090769A (en) | Entity mining method, entity mining device, computer equipment and storage medium | |
CN116612810A (en) | Medicine target interaction prediction method based on interaction inference network | |
CN112466410B (en) | Method and device for predicting binding free energy of protein and ligand molecule | |
CN112116949A (en) | Protein folding identification method based on triple loss | |
CN116386733A (en) | Protein function prediction method based on multi-view multi-scale multi-attention mechanism | |
Zhao et al. | GIFDTI: Prediction of drug-target interactions based on global molecular and intermolecular interaction representation learning | |
CN114298052B (en) | Entity joint annotation relation extraction method and system based on probability graph | |
Pakhrin | Deep learning-based approaches for prediction of post-translational modification sites in proteins | |
Hu et al. | Structure enhanced protein-drug interaction prediction using transformer and graph embedding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information | ||
CB03 | Change of inventor or designer information |
Inventor after: Hu Shuai Inventor after: Xie Liangxu Inventor after: Xu Xiaojun Inventor after: Xu Lei Inventor before: Xie Liangxu Inventor before: Xu Xiaojun Inventor before: Xu Lei |
|
GR01 | Patent grant | ||
GR01 | Patent grant |