CN112466410A - Method and device for predicting protein and ligand molecule binding free energy - Google Patents

Method and device for predicting protein and ligand molecule binding free energy Download PDF

Info

Publication number
CN112466410A
CN112466410A CN202011332960.XA CN202011332960A CN112466410A CN 112466410 A CN112466410 A CN 112466410A CN 202011332960 A CN202011332960 A CN 202011332960A CN 112466410 A CN112466410 A CN 112466410A
Authority
CN
China
Prior art keywords
protein
binding
machine learning
ligand
learning model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011332960.XA
Other languages
Chinese (zh)
Other versions
CN112466410B (en
Inventor
谢良旭
许晓军
许磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu China Israel Industrial Technology Research Institute
Jiangsu University of Technology
Original Assignee
Jiangsu China Israel Industrial Technology Research Institute
Jiangsu University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu China Israel Industrial Technology Research Institute, Jiangsu University of Technology filed Critical Jiangsu China Israel Industrial Technology Research Institute
Priority to CN202011332960.XA priority Critical patent/CN112466410B/en
Publication of CN112466410A publication Critical patent/CN112466410A/en
Application granted granted Critical
Publication of CN112466410B publication Critical patent/CN112466410B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method and a device for predicting the free energy of protein and ligand molecule combination, wherein the method comprises the following steps: s1, constructing a local database; s2 analyzes the data to obtain ligand molecules and protein binding pockets centered on the ligand molecules
Figure DDA0002796321160000011
An internal amino acid molecule; s3 calculating MACCS keys of ligand molecules and adjacent amino acids; s4 calculating ECFP fingerprints of the ligand molecules and adjacent amino acids; s5, converting the MACCS key and ECFP fingerprint information of ligand molecules and proteins into one-dimensional tensors to form a training set and a testing set; s6, establishing a machine learning model and training the machine learning model; s7, calculating and comparing the Pearson correlation coefficient and the absolute error, and verifying the prediction result of the machine learning model; s8 using the trained machine learning model to determine the binding constant of protein and ligand moleculeAnd (5) predicting, and calculating the binding free energy. Compared with the prediction result of a single fingerprint, the prediction result of the combined fingerprint is obviously better, and the accuracy of free energy prediction is improved.

Description

Method and device for predicting protein and ligand molecule binding free energy
Technical Field
The invention relates to the technical field of computer-aided drug screening, in particular to a method and a device for predicting the binding free energy of protein and ligand molecules.
Background
How to obtain the final lead molecule by screening from tens of millions of molecules is a key technology for drug screening. The free energy calculation method is the core of realizing high-throughput drug screening. With the rapid development of computer technology and computational theory, a new drug development method using computer-aided drug design has emerged.
Computer-aided drug design is the science of achieving drug design with the help of high performance computers based on physical laws. The application of computers in the field of drug design marks the stage that people develop drugs from blind screening to rational design. With the development of computer performance, the demand for predictive performance of drug molecules is becoming higher and higher. The traditional calculation method is limited by calculation precision and calculation time consumption, and can not meet the current requirement on screening of massive molecular libraries. Compared with a mode based on an empirical formula, the organic combination of an artificial intelligence method and big data greatly changes the field of the current computer-aided drug design and becomes a new drug design method which surpasses the traditional calculation mode.
In recent years, artificial intelligence techniques have been gradually used for predicting the binding strength of protein and drug molecules, including convolutional neural network cnn (convolutional neural network), support vector machine svm (support vector machine), random forest rf (random forest), deep neural network dnn (deep neural network), and the like. By means of the methods, the internal rule can be obtained from the interaction structure of the complex protein and the small molecule, so that the binding free energy of the protein and the ligand molecule can be predicted.
The artificial intelligence assisted drug screening relies on molecular descriptors, and the information coding of biomolecules is a bridge connecting high-performance computers and new drug development. The method has occupied the research in the application of artificial intelligence to assist drug design. A molecular descriptor is a digitized representation of a molecule that encodes the molecule into a series of bit strings. For example, based on specific substructures (substructures) within a molecule, substructure-based MACCS keys have been developed that encode molecules based on the presence or absence of certain substructures or features in a given structural list. The number of bits of such a molecular fingerprint is determined by the number of defined substructures, each bit being associated with the presence or absence of a given feature in the molecule, thus converting the molecular structure into a binary number consisting of 0 and 1. Another type is topology or path based fingerprinting, such as the extended connectivity fingerprint ECFP, which is generated by analyzing all molecular fragments on a path starting from an atom until a specified number of bonds are reached, and then Hashing (Hashing) the identifiers corresponding to the sub-molecular structures in each path. An ECFP fingerprint of radius 2, known as an ECFP4 fingerprint, is often used. The fingerprint is suitable for any molecule, can adjust the length of the molecule, and can be used for quick substructure search and molecule filtration.
Although molecular fingerprints have been used in many aspects such as quantitative structure-activity relationship and prediction of physicochemical properties of drug molecules, calculation is usually performed based on only one molecular fingerprint, and since a single molecular fingerprint only contains specific structural information, it has certain limitations, and particularly in predicting such complex structures as proteins and ligand molecules (encoding proteins and ligand molecules as one structure), the following problems may occur: 1) according to different purposes, the fingerprints are processed differently, and certain information is probably lost in the process of converting the structure into the molecular fingerprint; 2) the protein and the ligand molecule are only encoded as a whole, and the spatial structure specificity of the protein and the ligand molecule is ignored. These problems all lead to technical problems that the accuracy of predicting the properties of molecules needs to be improved.
Disclosure of Invention
Aiming at the problems, the invention provides a method and a device for predicting the free energy of protein and ligand molecule combination, which effectively solve the technical problem of low prediction accuracy of the existing method.
The technical scheme provided by the invention is as follows:
a method for predicting the binding free energy of a protein to a ligand molecule, comprising:
s1, acquiring a three-dimensional structure of the combination of the protein and the ligand molecule from the PDBbind data set, and storing the three-dimensional structure in a local server; based on crystal structure data in a PDB database and a PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules;
s2 analyzing the data in the pdb document to obtain ligand molecules and protein binding pockets centered on the ligand molecules
Figure BDA0002796321140000021
Amino acid molecules within distance;
s3 calculating MACCS keys of the ligand molecules and the amino acids in the binding pockets respectively;
s4 calculating ECFP fingerprints of the ligand molecules and the amino acids in the binding pockets respectively;
s5, converting the MACCS key and ECFP fingerprint information of the ligand molecule and the MACCS key and ECFP fingerprint information of the amino acid in the protein binding pocket into one-dimensional tensors respectively, and correspondingly encoding all molecules in a local database to form a training set and a test set;
s6, establishing a machine learning model in the server, training the machine learning model by using a training set, and testing the trained machine learning model by using a testing set until a preset prediction effect is obtained;
s7, calculating a binding constant obtained by predicting the machine learning model, comparing a Pearson correlation coefficient and an absolute error with a result measured by an experiment, and verifying a prediction result of the machine learning model;
s8, the machine learning model after training is used for predicting the binding constant of the protein and the ligand molecule, and the binding constant is converted into the binding freedom through an Alanikus formula, so that the prediction of the binding self-organizing ability is completed.
The invention also provides a device for predicting the free energy of protein and ligand molecule combination, which comprises:
the local database construction module is used for acquiring a three-dimensional structure combined by protein and ligand molecules from the PDBbind data set and storing the three-dimensional structure in the local server; based on crystal structure data in a PDB database and a PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules;
a data analysis module for analyzing the data in the pdb file to obtain ligand molecules and protein binding pockets centered on the ligand molecules
Figure BDA0002796321140000031
Amino acid molecules within distance;
a fingerprint calculation module for calculating MACCS keys of the ligand molecules and the amino acids in the binding pockets respectively; and calculating the ECFP fingerprints of the ligand molecules and the amino acids in the binding pockets respectively;
the information conversion module is used for converting the MACCS key and ECFP fingerprint information of the ligand molecules and the MACCS key and ECFP fingerprint information of the amino acids in the protein binding pockets into one-dimensional tensors respectively, and correspondingly encoding all molecules in a local database to form a training set and a test set;
the model training module is used for establishing a machine learning model in the server, training the machine learning model by using a training set and testing the trained machine learning model by using a test set until a preset prediction effect is obtained;
the prediction result verification module is used for calculating the combined free energy obtained by predicting the machine learning model, comparing the Pearson correlation coefficient and the absolute error with the experimental result and verifying the prediction result of the machine learning model; and
and the binding energy prediction module predicts the binding constant of the protein and the ligand molecule by using the trained machine learning model, converts the binding constant into binding freedom by an Alanikus formula and completes the prediction of the binding self-routing energy.
The method and the device for predicting the free energy of the combination of the protein and the ligand molecule, provided by the invention, can at least bring the following beneficial effects:
1) drug molecules are usually bound to cavities on or within proteins, sites commonly referred to as substrate molecule binding pockets. The type and position of amino acids near these sites determines the physicochemical properties of the protein-small molecule binding interaction, and in particular within the ligand binding pocket, the closer the protein structure is to the ligand determines the specificity of binding and the strength of binding. Therefore, the invention takes the ligand molecule as the center to obtain the protein binding pocket
Figure BDA0002796321140000032
The amino acid molecules within the distance are coded, so that the calculation pertinence can be effectively improved, and the calculation amount is reduced.
2) Unlike the method of combining protein and ligand molecule structure and converting molecular fingerprint, the present invention treats protein structure and ligand molecule structure separately and combines fingerprint. Especially, the method can keep the specificity of protein structure and ligand molecule more and is beneficial to improving the generalization ability of machine learning when the protein binding pocket or the compound molecule has structural similarity with the structure to be predicted.
3) By separately encoding the protein and the ligand molecules and adopting two encoding modes, the advantages of different molecular fingerprints are effectively exerted, and the information of the two molecular fingerprints is complemented (for example, the MACCS key contains atom species, valence bond species and the like, but does not contain topological structure information; the ECFP fingerprint is a topological fingerprint along a chemical bond and contains specific structural information around atoms), and different molecular fingerprints are verified by using machine learning methods such as a neural network or a random forest.
Drawings
The foregoing features, technical features, advantages and embodiments are further described in the following detailed description of the preferred embodiments, which is to be read in connection with the accompanying drawings.
FIG. 1 is a schematic flow chart of a method for predicting the binding free energy of a protein and a ligand molecule according to the present invention;
FIG. 2 is a schematic diagram of an application of the combined fingerprint in an example of a deep neural network model according to the present invention;
FIG. 3 is a schematic diagram of an application of the combined fingerprint in an example of a random forest model according to the present invention;
FIG. 4 Pearson coefficients predicted using neural network model instances with combined fingerprints according to the present invention;
FIG. 5 illustrates Pearson coefficients predicted using random forest model instances with combined fingerprints;
FIG. 6 is a distribution comparison diagram of predicted values and true values of combined fingerprints according to an embodiment of the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
Fig. 1 is a schematic flow chart of a method for predicting the binding free energy of a protein and a ligand molecule provided by the present invention, and as can be seen from the figure, the method comprises:
s1, acquiring a three-dimensional structure of the combination of the protein and the ligand molecule from the PDBbind data set, and storing the three-dimensional structure in a local server; based on the crystal structure data in the PDB database and the PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules.
The adopted data is derived from a PDBbind data set, the data set comprises two sub-data sets with different accuracies, namely a refined data set and a general data set, wherein the refined data set (comprising 4852 molecular structures of protein and small molecule compounds) comprises pdb files of the protein and small molecule compounds with high analytic accuracy and accurate inhibition coefficients obtained by experiments, and the general data set comprises pdb files of the protein and small molecule compounds with different analytic accuracies. In prediction of binding energy of protein and small molecules, both a refined data set and a general data set can be used and can be selected according to requirements.
S2 data in pdb document are analyzed to obtain ligand molecules and protein binding pockets centered around the ligand molecules
Figure BDA0002796321140000051
Amino acid molecules within distance. Here, the distance from the protein to the ligand molecule is
Figure BDA0002796321140000052
The microenvironment of the inner protein molecule is divided into ligand binding pockets.
S3 calculates MACCS keys for the ligand molecules and the amino acids within the binding pocket, respectively.
S4 ECFP fingerprints for the ligand molecules and the amino acids in the binding pocket are calculated, respectively.
S5, converting the MACCS key and ECFP fingerprint information of ligand molecules and the MACCS key and ECFP fingerprint information of protein binding pocket amino acids into one-dimensional tensors respectively, and correspondingly encoding all molecules in a local database to form a training set and a test set.
The proportion of the training set and the test set can be limited according to actual needs, for example, the refined data set is divided into the training set and the test set according to the proportion of 8:2, and the like. In order to input data into a machine learning model, the MACCS key and ECFP fingerprint information of ligand molecules and the MACCS key and ECFP fingerprint information of protein-bound pocket amino acids are hashed and processed into data with fixed length such as 512 and 1024. In addition, in the process, the combined fingerprint is adopted to convert the pdb format file of the protein and small molecule compound into computer-processed digital coding information, so that the structural information of the compound can be better processed.
In the step, MACCS keys and ECFP fingerprints are respectively coded for amino acids in ligand molecules and protein binding pockets, the advantages of two coding modes and the complementation of coding information are fully exerted in the subsequent binding energy prediction process (166 key values are predefined by MACCS and comprise element types and appointed bonding types; ECFP draws a circle by a selected atom based on the connection information around the atom, and a code is distributed to a sub-molecular structure in the circle and comprises the connection information between the atom and the periphery), so that the atom type and sub-structure information and the topological structure information around the atom are reserved.
S6, establishing a machine learning model in the server, training the machine learning model by using the training set, and testing the trained machine learning model by using the testing set until a preset prediction effect is obtained. The established machine learning model can be limited according to actual conditions, such as a random forest model, a deep neural network model, an XGboost model, a support vector machine model and the like. During the training process, the MACCS key and ECFP fingerprint information of ligand molecules, the MACCS key and ECFP fingerprint information of protein binding pocket amino acids are used as input quantities, and the binding free energy is used as output quantity. Activation functions include first not limited to relu. The deep neural network model can effectively learn the coding information, analyze potential rules from molecular fingerprints, and establish a nonlinear relation between the combined fingerprint and the combined free energy in a nonlinear fitting mode.
S7, calculating a combination constant obtained by predicting the machine learning model, comparing the Pearson correlation coefficient and the absolute error with the experimental result, and verifying the prediction result of the machine learning model. Specifically, the fitting relation r between the predicted value and the true value is calculated by the following formula2
Figure BDA0002796321140000061
Wherein, pKexpFor the binding constant to be determined experimentally,
Figure BDA0002796321140000062
the mean value of the experimentally determined binding constants, pKcalIn order to predict the resulting binding constant by the present method,
Figure BDA0002796321140000063
is the average of the binding constants predicted by the present method.
S8, using the trained machine learning model to predict the binding constant of protein and ligand molecule, and converting the binding constant into binding freedom by the Alanikus formula as follows:
Figure BDA0002796321140000064
wherein, pKcalThe binding constant, Δ G, is the binding free energy, k, predicted by the methodBBoltzmann constant, T is absolute temperature.
Correspondingly, the invention also provides a device for predicting the free energy of the combination of the protein and the ligand molecule, which is applied to the method for predicting the free energy of the combination of the protein and the ligand molecule, and the device for predicting the free energy of the combination of the protein and the ligand molecule comprises: the local database construction module is used for acquiring a three-dimensional structure combined by protein and ligand molecules from the PDBbind data set and storing the three-dimensional structure in the local server; based on crystal structure data in a PDB database and a PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules; a data analysis module for analyzing the data in the pdb file to obtain ligand molecules and protein binding pockets centered on the ligand molecules
Figure BDA0002796321140000071
Amino acid molecules within distance; the fingerprint calculation module is used for respectively calculating MACCS keys of ligand molecules and amino acids in binding pockets; and calculating ECFP fingerprints of the ligand molecules and the amino acids in the binding pockets respectively; an information conversion module for respectively combining the MACCS key and ECFP fingerprint information of the ligand molecule and the protein with the ammonia in the pocketConverting the MACCS key and ECFP fingerprint information of the basic acid into one-dimensional tensors to form a training set and a test set; the model training module is used for establishing a machine learning model in the server, training the machine learning model by using a training set and testing the trained machine learning model by using a test set until a preset prediction effect is obtained; the prediction result verification module is used for calculating the combined free energy obtained by the prediction of the machine learning model, comparing the Pearson correlation coefficient and the absolute error with the experimental result and verifying the prediction result of the machine learning model; and the binding energy prediction module is used for predicting the binding free energy of the protein and the ligand molecules by using the trained machine learning model.
In an example, a flow of the prediction method is described by taking a deep neural network model as an example, as shown in fig. 2, the method specifically includes the following steps:
s1, acquiring a three-dimensional structure (corresponding to the protein and ligand molecule compound in figure 2) of the combination of the protein and the ligand molecule from the PDBbind data set, and storing the three-dimensional structure in a local server; based on the crystal structure data in the PDB database and the PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules.
S2 data in pdb document are analyzed to obtain ligand molecules and protein binding pockets centered around the ligand molecules
Figure BDA0002796321140000072
Amino acid molecules within distance.
S3 calculates MACCS keys for the ligand molecules and binding pocket amino acids, respectively (corresponding to the protein structure, ligand molecule structure codes in fig. 2).
S4 ECFP fingerprints (corresponding to the protein structure, ligand molecule structure codes in FIG. 2) were calculated for the ligand molecules and the amino acids in the binding pocket, respectively.
S5, converting the MACCS key and ECFP fingerprint information of ligand molecules and the MACCS key and ECFP fingerprint information of protein binding pocket amino acids into one-dimensional tensors, correspondingly encoding all molecules in a local database to form a combined fingerprint of the protein and the ligand molecules, and splitting the combined fingerprint into a training set and a testing set according to the ratio of 8: 2.
S6, establishing a deep neural network model in the server, training the deep neural network model by using a training set, and verifying the trained deep neural network model by using a test set until a preset prediction effect is obtained. Specifically, the constructed deep neural network model comprises two hidden layers, wherein the first hidden layer (fully connected layer) comprises 2048 neurons, the second hidden layer (fully connected layer) comprises 100 neurons, and each hidden layer is followed by a dropout layer with a loss ratio of 0.1.
S7, calculating a combination constant obtained by predicting the machine learning model, comparing the Pearson correlation coefficient and the absolute error with the experimental result, and verifying the prediction result of the machine learning model.
S8, the machine learning model after training is used for predicting the binding constant of the protein and the ligand molecule, and the binding constant is converted into the binding freedom through an Alanikus formula, so that the prediction of the binding self-organizing ability is completed.
In this example, no optimization of the hyper-parameters is performed, and the same hyper-parameters are used for the individual MACCS, ECFP and combined fingerprints. In other examples, the prediction results may be further enhanced by hyper-parametric optimization of the system.
In another example, a flow of the prediction method is described by taking a random forest model as an example, as shown in fig. 3, the method specifically includes the following steps:
s1, acquiring a three-dimensional structure (corresponding to the protein and ligand molecule compound in figure 3) of the combination of the protein and the ligand molecule from the PDBbind data set, and storing the three-dimensional structure in a local server; based on the crystal structure data in the PDB database and the PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules.
S2 data in pdb document are analyzed to obtain ligand molecules and protein binding pockets centered around the ligand molecules
Figure BDA0002796321140000081
Amino acid molecules within distance.
S3 calculates MACCS keys for the ligand molecules and binding pocket amino acids, respectively (corresponding to the protein structure, ligand molecule structure codes in fig. 3).
S4 ECFP fingerprints (corresponding to the protein structure, ligand molecule structure codes in FIG. 3) were calculated for the ligand molecules and the amino acids in the binding pocket, respectively.
S5, converting the MACCS key and ECFP fingerprint information of ligand molecules and the MACCS key and ECFP fingerprint information of protein binding pocket amino acids into one-dimensional tensors respectively, and splitting the tensors into a training set and a testing set according to the ratio of 7: 3.
S6, establishing a random forest model in the server, training the random forest model by using the training set, and verifying the trained random forest model by using the testing set until a preset prediction effect is obtained.
S7, calculating a combination constant obtained by predicting the machine learning model, comparing the Pearson correlation coefficient and the absolute error with the experimental result, and verifying the prediction result of the machine learning model.
S8, the machine learning model after training is used for predicting the binding constant of the protein and the ligand molecule, and the binding constant is converted into the binding freedom through an Alanikus formula, so that the prediction of the binding self-organizing ability is completed.
In this example, no optimization of the hyper-parameters is performed, and the same hyper-parameters are used for the individual MACCS, ECFP and combined fingerprints. In other examples, the prediction results may be further enhanced by hyper-parametric optimization of the system.
Fig. 4 shows Pearson coefficients predicted in an example of using a deep neural network model with a combined fingerprint, and fig. 5 shows Pearson coefficients predicted in an example of using a random forest model with a combined fingerprint. In addition, when the MACCS, ECFP and combined fingerprint are used, the mean square deviations of the predicted binding energy and the true value are respectively as follows: 1.55, 1.57 and 1.47, mean square deviations of predicted binding energies from true values when using the random forest model are: 1.36, 1.34 and 0.85, the use of the combined fingerprint in the two embodiments reduces the mean square error between the predicted value and the true value, and effectively improves the calculation precision. FIG. 6 shows the distribution of the predicted values and the real values of the combined fingerprints, wherein (a) the distribution of the predicted values and the real values of the deep neural network model is adopted, and (b) the distribution of the predicted values and the real values of the random forest model is adopted. .
It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for persons skilled in the art, numerous modifications and adaptations can be made without departing from the principle of the present invention, and such modifications and adaptations should be considered as within the scope of the present invention.

Claims (8)

1. A method for predicting the binding free energy of a protein to a ligand molecule, comprising:
s1, acquiring a three-dimensional structure of the combination of the protein and the ligand molecule from the PDBbind data set, and storing the three-dimensional structure in a local server; based on crystal structure data in a PDB database and a PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules;
s2 analyzing the data in the pdb document to obtain ligand molecules and protein binding pockets centered on the ligand molecules
Figure FDA0002796321130000011
Amino acid molecules within distance;
s3 calculating MACCS keys of the ligand molecules and the amino acids in the binding pockets respectively;
s4 calculating ECFP fingerprints of the ligand molecules and the amino acids in the binding pockets respectively;
s5, converting the MACCS key and ECFP fingerprint information of the ligand molecule and the MACCS key and ECFP fingerprint information of the amino acid in the protein binding pocket into one-dimensional tensors respectively, and correspondingly encoding all molecules in a local database to form a training set and a test set;
s6, establishing a machine learning model in the server, training the machine learning model by using a training set, and testing the trained machine learning model by using a testing set until a preset prediction effect is obtained;
s7, calculating a combination constant obtained by predicting the machine learning model, calculating a Pearson correlation coefficient and an absolute error of a predicted value and a true value, and verifying a prediction result of the machine learning model;
s8, the machine learning model after training is used for predicting the binding constant of the protein and the ligand molecule, and the binding constant is converted into the binding freedom through an Alanikus formula, so that the prediction of the binding self-organizing ability is completed.
2. The prediction method of claim 1, wherein in step S2, the distance from the protein to the ligand molecule is determined
Figure FDA0002796321130000012
The microenvironment of the inner protein molecule is divided into ligand binding pockets.
3. A prediction method as claimed in claim 1 or 2, characterized in that in step S6, the machine learning model constructed is a random forest model or a deep neural network model.
4. The prediction method according to claim 3, wherein, when the machine learning model constructed in step S6 is a deep neural network model, the deep neural network model includes two hidden layers, wherein a first hidden layer includes 2048 neurons, a second hidden layer includes 100 neurons, and each hidden layer is followed by a dropout layer with a loss ratio of 0.1.
5. An apparatus for predicting the binding free energy of a protein to a ligand molecule, comprising:
the local database construction module is used for acquiring a three-dimensional structure combined by protein and ligand molecules from the PDBbind data set and storing the three-dimensional structure in the local server; based on crystal structure data in a PDB database and a PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules;
a data analysis module for analyzing the data in the pdb file to obtain ligand molecules and protein binding pockets centered on the ligand molecules
Figure FDA0002796321130000021
Amino acid molecules within distance;
a fingerprint calculation module for calculating MACCS keys of the ligand molecules and the amino acids in the binding pockets respectively; and calculating the ECFP fingerprints of the ligand molecules and the amino acids in the binding pockets respectively;
the information conversion module is used for converting the MACCS key and ECFP fingerprint information of the ligand molecules and the MACCS key and ECFP fingerprint information of the amino acids in the protein binding pockets into one-dimensional tensors to form a training set and a test set;
the model training module is used for establishing a machine learning model in the server, training the machine learning model by using a training set and testing the trained machine learning model by using a test set until a preset prediction effect is obtained;
the prediction result verification module is used for calculating the combined free energy obtained by predicting the machine learning model, comparing the Pearson correlation coefficient and the absolute error with the experimental result and verifying the prediction result of the machine learning model; and
and the binding energy prediction module is used for predicting the binding constant of the protein and the ligand molecule by using the trained machine learning model, and converting the binding constant into binding freedom through an Alanikus formula to complete the prediction of the binding self-routing energy.
6. The prediction apparatus of claim 5, wherein the data analysis module is configured to determine the distance from the protein to the ligand molecule
Figure FDA0002796321130000022
The microenvironment of the inner protein molecule is divided into ligand binding pockets.
7. A prediction apparatus as claimed in claim 5 or 6, characterized in that in the model training module, the machine learning model constructed is a random forest model or a deep neural network model.
8. The prediction apparatus according to claim 7, wherein when the machine learning model constructed by the model training module is a deep neural network model, the deep neural network model includes two hidden layers, wherein a first hidden layer includes 2048 neurons, a second hidden layer includes 100 neurons, and each hidden layer is followed by a dropout layer with a loss ratio of 0.1.
CN202011332960.XA 2020-11-24 2020-11-24 Method and device for predicting binding free energy of protein and ligand molecule Active CN112466410B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011332960.XA CN112466410B (en) 2020-11-24 2020-11-24 Method and device for predicting binding free energy of protein and ligand molecule

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011332960.XA CN112466410B (en) 2020-11-24 2020-11-24 Method and device for predicting binding free energy of protein and ligand molecule

Publications (2)

Publication Number Publication Date
CN112466410A true CN112466410A (en) 2021-03-09
CN112466410B CN112466410B (en) 2024-02-20

Family

ID=74798301

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011332960.XA Active CN112466410B (en) 2020-11-24 2020-11-24 Method and device for predicting binding free energy of protein and ligand molecule

Country Status (1)

Country Link
CN (1) CN112466410B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117037946A (en) * 2022-11-14 2023-11-10 上海微观纪元数字科技有限公司 Method for optimizing structure of compound based on protein binding pocket
CN117037946B (en) * 2022-11-14 2024-05-10 合肥微观纪元数字科技有限公司 Method for optimizing structure of compound based on protein binding pocket

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020055536A1 (en) * 1996-09-26 2002-05-09 Dewitte Robert S. System and method for structure-based drug design that includes accurate prediction of binding free energy
US20050053999A1 (en) * 2000-11-14 2005-03-10 Gough David A. Method for predicting G-protein coupled receptor-ligand interactions
CN102282560A (en) * 2008-12-05 2011-12-14 狄克雷佩特公司 Method for creating virtual compound libraries within markush structure patent claims
US20120053067A1 (en) * 2009-05-04 2012-03-01 University Of Maryland, Baltimore Method for binding site identification by molecular dynamics simulation (silcs: site identification by ligand competitive saturation)
US20130230705A1 (en) * 2012-03-02 2013-09-05 Wisconsin Alumni Research Foundation Patterning in the directed assembly of block copolymers using triblock or multiblock copolymers
CN103869003A (en) * 2012-12-13 2014-06-18 滇虹药业集团股份有限公司 Establishing method of double-solvent fused HPLC fingerprint of medicinal phellodendron and standard fingerprint of medicinal phellodendron
CN106777986A (en) * 2016-12-19 2017-05-31 南京邮电大学 Ligand molecular fingerprint generation method based on depth Hash in drug screening
CN108399316A (en) * 2018-03-02 2018-08-14 南京邮电大学 Ligand molecular Feature Selection device and screening technique in drug design
US20180285731A1 (en) * 2017-03-30 2018-10-04 Atomwise Inc. Systems and methods for correcting error in a first classifier by evaluating classifier output in parallel
US20190050538A1 (en) * 2017-08-08 2019-02-14 International Business Machines Corporation Prediction and generation of hypotheses on relevant drug targets and mechanisms for adverse drug reactions
US20190065667A1 (en) * 2017-07-25 2019-02-28 University Of Massachusetts Medical School Method for Probing at Least One Binding Site of a Protein
US20190272468A1 (en) * 2018-03-05 2019-09-05 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Spatial Graph Convolutions with Applications to Drug Discovery and Molecular Simulation
US20190304568A1 (en) * 2018-03-30 2019-10-03 Board Of Trustees Of Michigan State University System and methods for machine learning for drug design and discovery
CN110910951A (en) * 2019-11-19 2020-03-24 江苏理工学院 Method for predicting protein and ligand binding free energy based on progressive neural network
CN111126554A (en) * 2018-10-31 2020-05-08 深圳市云网拜特科技有限公司 Drug lead compound screening method and system based on generation of confrontation network
CN111292800A (en) * 2020-01-21 2020-06-16 中南大学 Molecular characterization based on predicted protein affinity and application thereof

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020055536A1 (en) * 1996-09-26 2002-05-09 Dewitte Robert S. System and method for structure-based drug design that includes accurate prediction of binding free energy
US20050053999A1 (en) * 2000-11-14 2005-03-10 Gough David A. Method for predicting G-protein coupled receptor-ligand interactions
CN102282560A (en) * 2008-12-05 2011-12-14 狄克雷佩特公司 Method for creating virtual compound libraries within markush structure patent claims
US20120053067A1 (en) * 2009-05-04 2012-03-01 University Of Maryland, Baltimore Method for binding site identification by molecular dynamics simulation (silcs: site identification by ligand competitive saturation)
US20130230705A1 (en) * 2012-03-02 2013-09-05 Wisconsin Alumni Research Foundation Patterning in the directed assembly of block copolymers using triblock or multiblock copolymers
CN103869003A (en) * 2012-12-13 2014-06-18 滇虹药业集团股份有限公司 Establishing method of double-solvent fused HPLC fingerprint of medicinal phellodendron and standard fingerprint of medicinal phellodendron
CN106777986A (en) * 2016-12-19 2017-05-31 南京邮电大学 Ligand molecular fingerprint generation method based on depth Hash in drug screening
US20180285731A1 (en) * 2017-03-30 2018-10-04 Atomwise Inc. Systems and methods for correcting error in a first classifier by evaluating classifier output in parallel
US20190065667A1 (en) * 2017-07-25 2019-02-28 University Of Massachusetts Medical School Method for Probing at Least One Binding Site of a Protein
US20190050538A1 (en) * 2017-08-08 2019-02-14 International Business Machines Corporation Prediction and generation of hypotheses on relevant drug targets and mechanisms for adverse drug reactions
CN108399316A (en) * 2018-03-02 2018-08-14 南京邮电大学 Ligand molecular Feature Selection device and screening technique in drug design
US20190272468A1 (en) * 2018-03-05 2019-09-05 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Spatial Graph Convolutions with Applications to Drug Discovery and Molecular Simulation
US20190304568A1 (en) * 2018-03-30 2019-10-03 Board Of Trustees Of Michigan State University System and methods for machine learning for drug design and discovery
CN111126554A (en) * 2018-10-31 2020-05-08 深圳市云网拜特科技有限公司 Drug lead compound screening method and system based on generation of confrontation network
CN110910951A (en) * 2019-11-19 2020-03-24 江苏理工学院 Method for predicting protein and ligand binding free energy based on progressive neural network
CN111292800A (en) * 2020-01-21 2020-06-16 中南大学 Molecular characterization based on predicted protein affinity and application thereof

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DRUGAI: ""Nat. Methods | 基于几何深度学习解密蛋白分子表面的相互作用指纹"", pages 1 - 14, Retrieved from the Internet <URL:《https://blog.csdn.net/u012325865/article/details/105683776》> *
H. M. ASHTAWY AND N. R. MAHAPATRA: "\"A Comparative Assessment of Predictive Accuracies of Conventional and Machine Learning Scoring Functions for Protein-Ligand Binding Affinity Prediction,\"", 《IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS》, vol. 12, no. 2, pages 335 - 347, XP011578059, DOI: 10.1109/TCBB.2014.2351824 *
XIE, LIANGXU等: ""Improvement of prediction performance with conjoint molecular fingerprint in deep learning"", 《FRONTIERS IN PHARMACOLOGY》, vol. 11, pages 1 - 15 *
肖高铿: ""分子指纹及其在虚拟筛选中的应用"", pages 1 - 8, Retrieved from the Internet <URL:《http://blog.molcalx.com.cn/2019/01/29/fingerprint.html》> *
谢良旭,薛亮亮,李峰: ""神经网络的深度与宽度对药物分子pKa预测性能影响的研究"", 《江苏理工学院学报》, vol. 27, no. 2, pages 1 - 8 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117037946A (en) * 2022-11-14 2023-11-10 上海微观纪元数字科技有限公司 Method for optimizing structure of compound based on protein binding pocket
CN117037946B (en) * 2022-11-14 2024-05-10 合肥微观纪元数字科技有限公司 Method for optimizing structure of compound based on protein binding pocket

Also Published As

Publication number Publication date
CN112466410B (en) 2024-02-20

Similar Documents

Publication Publication Date Title
CN113593631B (en) Method and system for predicting protein-polypeptide binding site
CN110289050B (en) Drug-target interaction prediction method based on graph convolution sum and word vector
CN112639831A (en) Mutual information countermeasure automatic encoder
CN111223532B (en) Method, device, apparatus, medium for determining a reactant of a target compound
Jiang et al. Explainable deep hypergraph learning modeling the peptide secondary structure prediction
CN116486900B (en) Drug target affinity prediction method based on depth mode data fusion
CN113571125A (en) Drug target interaction prediction method based on multilayer network and graph coding
CN116206688A (en) Multi-mode information fusion model and method for DTA prediction
Alkuhlani et al. Pustackngly: positive-unlabeled and stacking learning for n-linked glycosylation site prediction
Zhu et al. Associative learning mechanism for drug‐target interaction prediction
CN116153390A (en) Quantum convolutional neural network-based drug binding energy prediction method
Wang et al. MGPLI: exploring multigranular representations for protein–ligand interaction prediction
KR102407120B1 (en) Molecule design method using deep generative model based on molecular fragment and analysis apparatus
Du et al. Improving protein domain classification for third-generation sequencing reads using deep learning
Jha et al. Prediction of protein‐protein interactions using stacked auto‐encoder
Wei et al. ConPep: Prediction of peptide contact maps with pre-trained biological language model and multi-view feature extracting strategy
CN114090769A (en) Entity mining method, entity mining device, computer equipment and storage medium
CN116612810A (en) Medicine target interaction prediction method based on interaction inference network
CN112466410B (en) Method and device for predicting binding free energy of protein and ligand molecule
CN112116949A (en) Protein folding identification method based on triple loss
CN116386733A (en) Protein function prediction method based on multi-view multi-scale multi-attention mechanism
Zhao et al. GIFDTI: Prediction of drug-target interactions based on global molecular and intermolecular interaction representation learning
CN114298052B (en) Entity joint annotation relation extraction method and system based on probability graph
Pakhrin Deep learning-based approaches for prediction of post-translational modification sites in proteins
Hu et al. Structure enhanced protein-drug interaction prediction using transformer and graph embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Hu Shuai

Inventor after: Xie Liangxu

Inventor after: Xu Xiaojun

Inventor after: Xu Lei

Inventor before: Xie Liangxu

Inventor before: Xu Xiaojun

Inventor before: Xu Lei

GR01 Patent grant
GR01 Patent grant