CN112466410A

CN112466410A - Method and device for predicting protein and ligand molecule binding free energy

Info

Publication number: CN112466410A
Application number: CN202011332960.XA
Authority: CN
Inventors: 谢良旭; 许晓军; 许磊
Original assignee: Jiangsu China Israel Industrial Technology Research Institute; Jiangsu University of Technology
Current assignee: Jiangsu China Israel Industrial Technology Research Institute; Jiangsu University of Technology
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2021-03-09
Anticipated expiration: 2040-11-24
Also published as: CN112466410B

Abstract

The invention provides a method and a device for predicting the free energy of protein and ligand molecule combination, wherein the method comprises the following steps: s1, constructing a local database; s2 analyzes the data to obtain ligand molecules and protein binding pockets centered on the ligand molecules

An internal amino acid molecule; s3 calculating MACCS keys of ligand molecules and adjacent amino acids; s4 calculating ECFP fingerprints of the ligand molecules and adjacent amino acids; s5, converting the MACCS key and ECFP fingerprint information of ligand molecules and proteins into one-dimensional tensors to form a training set and a testing set; s6, establishing a machine learning model and training the machine learning model; s7, calculating and comparing the Pearson correlation coefficient and the absolute error, and verifying the prediction result of the machine learning model; s8 using the trained machine learning model to determine the binding constant of protein and ligand moleculeAnd (5) predicting, and calculating the binding free energy. Compared with the prediction result of a single fingerprint, the prediction result of the combined fingerprint is obviously better, and the accuracy of free energy prediction is improved.

Description

Method and device for predicting protein and ligand molecule binding free energy

Technical Field

The invention relates to the technical field of computer-aided drug screening, in particular to a method and a device for predicting the binding free energy of protein and ligand molecules.

Background

How to obtain the final lead molecule by screening from tens of millions of molecules is a key technology for drug screening. The free energy calculation method is the core of realizing high-throughput drug screening. With the rapid development of computer technology and computational theory, a new drug development method using computer-aided drug design has emerged.

Computer-aided drug design is the science of achieving drug design with the help of high performance computers based on physical laws. The application of computers in the field of drug design marks the stage that people develop drugs from blind screening to rational design. With the development of computer performance, the demand for predictive performance of drug molecules is becoming higher and higher. The traditional calculation method is limited by calculation precision and calculation time consumption, and can not meet the current requirement on screening of massive molecular libraries. Compared with a mode based on an empirical formula, the organic combination of an artificial intelligence method and big data greatly changes the field of the current computer-aided drug design and becomes a new drug design method which surpasses the traditional calculation mode.

In recent years, artificial intelligence techniques have been gradually used for predicting the binding strength of protein and drug molecules, including convolutional neural network cnn (convolutional neural network), support vector machine svm (support vector machine), random forest rf (random forest), deep neural network dnn (deep neural network), and the like. By means of the methods, the internal rule can be obtained from the interaction structure of the complex protein and the small molecule, so that the binding free energy of the protein and the ligand molecule can be predicted.

The artificial intelligence assisted drug screening relies on molecular descriptors, and the information coding of biomolecules is a bridge connecting high-performance computers and new drug development. The method has occupied the research in the application of artificial intelligence to assist drug design. A molecular descriptor is a digitized representation of a molecule that encodes the molecule into a series of bit strings. For example, based on specific substructures (substructures) within a molecule, substructure-based MACCS keys have been developed that encode molecules based on the presence or absence of certain substructures or features in a given structural list. The number of bits of such a molecular fingerprint is determined by the number of defined substructures, each bit being associated with the presence or absence of a given feature in the molecule, thus converting the molecular structure into a binary number consisting of 0 and 1. Another type is topology or path based fingerprinting, such as the extended connectivity fingerprint ECFP, which is generated by analyzing all molecular fragments on a path starting from an atom until a specified number of bonds are reached, and then Hashing (Hashing) the identifiers corresponding to the sub-molecular structures in each path. An ECFP fingerprint of radius 2, known as an ECFP4 fingerprint, is often used. The fingerprint is suitable for any molecule, can adjust the length of the molecule, and can be used for quick substructure search and molecule filtration.

Although molecular fingerprints have been used in many aspects such as quantitative structure-activity relationship and prediction of physicochemical properties of drug molecules, calculation is usually performed based on only one molecular fingerprint, and since a single molecular fingerprint only contains specific structural information, it has certain limitations, and particularly in predicting such complex structures as proteins and ligand molecules (encoding proteins and ligand molecules as one structure), the following problems may occur: 1) according to different purposes, the fingerprints are processed differently, and certain information is probably lost in the process of converting the structure into the molecular fingerprint; 2) the protein and the ligand molecule are only encoded as a whole, and the spatial structure specificity of the protein and the ligand molecule is ignored. These problems all lead to technical problems that the accuracy of predicting the properties of molecules needs to be improved.

Disclosure of Invention

Aiming at the problems, the invention provides a method and a device for predicting the free energy of protein and ligand molecule combination, which effectively solve the technical problem of low prediction accuracy of the existing method.

The technical scheme provided by the invention is as follows:

a method for predicting the binding free energy of a protein to a ligand molecule, comprising:

s1, acquiring a three-dimensional structure of the combination of the protein and the ligand molecule from the PDBbind data set, and storing the three-dimensional structure in a local server; based on crystal structure data in a PDB database and a PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules;

s2 analyzing the data in the pdb document to obtain ligand molecules and protein binding pockets centered on the ligand molecules

Amino acid molecules within distance;

s3 calculating MACCS keys of the ligand molecules and the amino acids in the binding pockets respectively;

s4 calculating ECFP fingerprints of the ligand molecules and the amino acids in the binding pockets respectively;

s5, converting the MACCS key and ECFP fingerprint information of the ligand molecule and the MACCS key and ECFP fingerprint information of the amino acid in the protein binding pocket into one-dimensional tensors respectively, and correspondingly encoding all molecules in a local database to form a training set and a test set;

s6, establishing a machine learning model in the server, training the machine learning model by using a training set, and testing the trained machine learning model by using a testing set until a preset prediction effect is obtained;

s7, calculating a binding constant obtained by predicting the machine learning model, comparing a Pearson correlation coefficient and an absolute error with a result measured by an experiment, and verifying a prediction result of the machine learning model;

s8, the machine learning model after training is used for predicting the binding constant of the protein and the ligand molecule, and the binding constant is converted into the binding freedom through an Alanikus formula, so that the prediction of the binding self-organizing ability is completed.

The invention also provides a device for predicting the free energy of protein and ligand molecule combination, which comprises:

the local database construction module is used for acquiring a three-dimensional structure combined by protein and ligand molecules from the PDBbind data set and storing the three-dimensional structure in the local server; based on crystal structure data in a PDB database and a PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules;

a data analysis module for analyzing the data in the pdb file to obtain ligand molecules and protein binding pockets centered on the ligand molecules

Amino acid molecules within distance;

a fingerprint calculation module for calculating MACCS keys of the ligand molecules and the amino acids in the binding pockets respectively; and calculating the ECFP fingerprints of the ligand molecules and the amino acids in the binding pockets respectively;

the information conversion module is used for converting the MACCS key and ECFP fingerprint information of the ligand molecules and the MACCS key and ECFP fingerprint information of the amino acids in the protein binding pockets into one-dimensional tensors respectively, and correspondingly encoding all molecules in a local database to form a training set and a test set;

the model training module is used for establishing a machine learning model in the server, training the machine learning model by using a training set and testing the trained machine learning model by using a test set until a preset prediction effect is obtained;

the prediction result verification module is used for calculating the combined free energy obtained by predicting the machine learning model, comparing the Pearson correlation coefficient and the absolute error with the experimental result and verifying the prediction result of the machine learning model; and

and the binding energy prediction module predicts the binding constant of the protein and the ligand molecule by using the trained machine learning model, converts the binding constant into binding freedom by an Alanikus formula and completes the prediction of the binding self-routing energy.

The method and the device for predicting the free energy of the combination of the protein and the ligand molecule, provided by the invention, can at least bring the following beneficial effects:

1) drug molecules are usually bound to cavities on or within proteins, sites commonly referred to as substrate molecule binding pockets. The type and position of amino acids near these sites determines the physicochemical properties of the protein-small molecule binding interaction, and in particular within the ligand binding pocket, the closer the protein structure is to the ligand determines the specificity of binding and the strength of binding. Therefore, the invention takes the ligand molecule as the center to obtain the protein binding pocket

The amino acid molecules within the distance are coded, so that the calculation pertinence can be effectively improved, and the calculation amount is reduced.

2) Unlike the method of combining protein and ligand molecule structure and converting molecular fingerprint, the present invention treats protein structure and ligand molecule structure separately and combines fingerprint. Especially, the method can keep the specificity of protein structure and ligand molecule more and is beneficial to improving the generalization ability of machine learning when the protein binding pocket or the compound molecule has structural similarity with the structure to be predicted.

3) By separately encoding the protein and the ligand molecules and adopting two encoding modes, the advantages of different molecular fingerprints are effectively exerted, and the information of the two molecular fingerprints is complemented (for example, the MACCS key contains atom species, valence bond species and the like, but does not contain topological structure information; the ECFP fingerprint is a topological fingerprint along a chemical bond and contains specific structural information around atoms), and different molecular fingerprints are verified by using machine learning methods such as a neural network or a random forest.

Drawings

The foregoing features, technical features, advantages and embodiments are further described in the following detailed description of the preferred embodiments, which is to be read in connection with the accompanying drawings.

FIG. 1 is a schematic flow chart of a method for predicting the binding free energy of a protein and a ligand molecule according to the present invention;

FIG. 2 is a schematic diagram of an application of the combined fingerprint in an example of a deep neural network model according to the present invention;

FIG. 3 is a schematic diagram of an application of the combined fingerprint in an example of a random forest model according to the present invention;

FIG. 4 Pearson coefficients predicted using neural network model instances with combined fingerprints according to the present invention;

FIG. 5 illustrates Pearson coefficients predicted using random forest model instances with combined fingerprints;

FIG. 6 is a distribution comparison diagram of predicted values and true values of combined fingerprints according to an embodiment of the present invention.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.

Fig. 1 is a schematic flow chart of a method for predicting the binding free energy of a protein and a ligand molecule provided by the present invention, and as can be seen from the figure, the method comprises:

s1, acquiring a three-dimensional structure of the combination of the protein and the ligand molecule from the PDBbind data set, and storing the three-dimensional structure in a local server; based on the crystal structure data in the PDB database and the PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules.

The adopted data is derived from a PDBbind data set, the data set comprises two sub-data sets with different accuracies, namely a refined data set and a general data set, wherein the refined data set (comprising 4852 molecular structures of protein and small molecule compounds) comprises pdb files of the protein and small molecule compounds with high analytic accuracy and accurate inhibition coefficients obtained by experiments, and the general data set comprises pdb files of the protein and small molecule compounds with different analytic accuracies. In prediction of binding energy of protein and small molecules, both a refined data set and a general data set can be used and can be selected according to requirements.

S2 data in pdb document are analyzed to obtain ligand molecules and protein binding pockets centered around the ligand molecules

Amino acid molecules within distance. Here, the distance from the protein to the ligand molecule is

The microenvironment of the inner protein molecule is divided into ligand binding pockets.

S3 calculates MACCS keys for the ligand molecules and the amino acids within the binding pocket, respectively.

S4 ECFP fingerprints for the ligand molecules and the amino acids in the binding pocket are calculated, respectively.

S5, converting the MACCS key and ECFP fingerprint information of ligand molecules and the MACCS key and ECFP fingerprint information of protein binding pocket amino acids into one-dimensional tensors respectively, and correspondingly encoding all molecules in a local database to form a training set and a test set.

The proportion of the training set and the test set can be limited according to actual needs, for example, the refined data set is divided into the training set and the test set according to the proportion of 8:2, and the like. In order to input data into a machine learning model, the MACCS key and ECFP fingerprint information of ligand molecules and the MACCS key and ECFP fingerprint information of protein-bound pocket amino acids are hashed and processed into data with fixed length such as 512 and 1024. In addition, in the process, the combined fingerprint is adopted to convert the pdb format file of the protein and small molecule compound into computer-processed digital coding information, so that the structural information of the compound can be better processed.

In the step, MACCS keys and ECFP fingerprints are respectively coded for amino acids in ligand molecules and protein binding pockets, the advantages of two coding modes and the complementation of coding information are fully exerted in the subsequent binding energy prediction process (166 key values are predefined by MACCS and comprise element types and appointed bonding types; ECFP draws a circle by a selected atom based on the connection information around the atom, and a code is distributed to a sub-molecular structure in the circle and comprises the connection information between the atom and the periphery), so that the atom type and sub-structure information and the topological structure information around the atom are reserved.

S6, establishing a machine learning model in the server, training the machine learning model by using the training set, and testing the trained machine learning model by using the testing set until a preset prediction effect is obtained. The established machine learning model can be limited according to actual conditions, such as a random forest model, a deep neural network model, an XGboost model, a support vector machine model and the like. During the training process, the MACCS key and ECFP fingerprint information of ligand molecules, the MACCS key and ECFP fingerprint information of protein binding pocket amino acids are used as input quantities, and the binding free energy is used as output quantity. Activation functions include first not limited to relu. The deep neural network model can effectively learn the coding information, analyze potential rules from molecular fingerprints, and establish a nonlinear relation between the combined fingerprint and the combined free energy in a nonlinear fitting mode.

S7, calculating a combination constant obtained by predicting the machine learning model, comparing the Pearson correlation coefficient and the absolute error with the experimental result, and verifying the prediction result of the machine learning model. Specifically, the fitting relation r between the predicted value and the true value is calculated by the following formula²：

Wherein, pK^expFor the binding constant to be determined experimentally,

the mean value of the experimentally determined binding constants, pK^calIn order to predict the resulting binding constant by the present method,

is the average of the binding constants predicted by the present method.

S8, using the trained machine learning model to predict the binding constant of protein and ligand molecule, and converting the binding constant into binding freedom by the Alanikus formula as follows:

wherein, pK^calThe binding constant, Δ G, is the binding free energy, k, predicted by the method_BBoltzmann constant, T is absolute temperature.

Correspondingly, the invention also provides a device for predicting the free energy of the combination of the protein and the ligand molecule, which is applied to the method for predicting the free energy of the combination of the protein and the ligand molecule, and the device for predicting the free energy of the combination of the protein and the ligand molecule comprises: the local database construction module is used for acquiring a three-dimensional structure combined by protein and ligand molecules from the PDBbind data set and storing the three-dimensional structure in the local server; based on crystal structure data in a PDB database and a PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules; a data analysis module for analyzing the data in the pdb file to obtain ligand molecules and protein binding pockets centered on the ligand molecules

Amino acid molecules within distance; the fingerprint calculation module is used for respectively calculating MACCS keys of ligand molecules and amino acids in binding pockets; and calculating ECFP fingerprints of the ligand molecules and the amino acids in the binding pockets respectively; an information conversion module for respectively combining the MACCS key and ECFP fingerprint information of the ligand molecule and the protein with the ammonia in the pocketConverting the MACCS key and ECFP fingerprint information of the basic acid into one-dimensional tensors to form a training set and a test set; the model training module is used for establishing a machine learning model in the server, training the machine learning model by using a training set and testing the trained machine learning model by using a test set until a preset prediction effect is obtained; the prediction result verification module is used for calculating the combined free energy obtained by the prediction of the machine learning model, comparing the Pearson correlation coefficient and the absolute error with the experimental result and verifying the prediction result of the machine learning model; and the binding energy prediction module is used for predicting the binding free energy of the protein and the ligand molecules by using the trained machine learning model.

In an example, a flow of the prediction method is described by taking a deep neural network model as an example, as shown in fig. 2, the method specifically includes the following steps:

s1, acquiring a three-dimensional structure (corresponding to the protein and ligand molecule compound in figure 2) of the combination of the protein and the ligand molecule from the PDBbind data set, and storing the three-dimensional structure in a local server; based on the crystal structure data in the PDB database and the PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules.

Amino acid molecules within distance.

S3 calculates MACCS keys for the ligand molecules and binding pocket amino acids, respectively (corresponding to the protein structure, ligand molecule structure codes in fig. 2).

S4 ECFP fingerprints (corresponding to the protein structure, ligand molecule structure codes in FIG. 2) were calculated for the ligand molecules and the amino acids in the binding pocket, respectively.

S5, converting the MACCS key and ECFP fingerprint information of ligand molecules and the MACCS key and ECFP fingerprint information of protein binding pocket amino acids into one-dimensional tensors, correspondingly encoding all molecules in a local database to form a combined fingerprint of the protein and the ligand molecules, and splitting the combined fingerprint into a training set and a testing set according to the ratio of 8: 2.

S6, establishing a deep neural network model in the server, training the deep neural network model by using a training set, and verifying the trained deep neural network model by using a test set until a preset prediction effect is obtained. Specifically, the constructed deep neural network model comprises two hidden layers, wherein the first hidden layer (fully connected layer) comprises 2048 neurons, the second hidden layer (fully connected layer) comprises 100 neurons, and each hidden layer is followed by a dropout layer with a loss ratio of 0.1.

S7, calculating a combination constant obtained by predicting the machine learning model, comparing the Pearson correlation coefficient and the absolute error with the experimental result, and verifying the prediction result of the machine learning model.

In this example, no optimization of the hyper-parameters is performed, and the same hyper-parameters are used for the individual MACCS, ECFP and combined fingerprints. In other examples, the prediction results may be further enhanced by hyper-parametric optimization of the system.

In another example, a flow of the prediction method is described by taking a random forest model as an example, as shown in fig. 3, the method specifically includes the following steps:

s1, acquiring a three-dimensional structure (corresponding to the protein and ligand molecule compound in figure 3) of the combination of the protein and the ligand molecule from the PDBbind data set, and storing the three-dimensional structure in a local server; based on the crystal structure data in the PDB database and the PDBbind database, a local database is established in a local database server through data preprocessing so as to store PDB files of protein and ligand molecules.

Amino acid molecules within distance.

S3 calculates MACCS keys for the ligand molecules and binding pocket amino acids, respectively (corresponding to the protein structure, ligand molecule structure codes in fig. 3).

S4 ECFP fingerprints (corresponding to the protein structure, ligand molecule structure codes in FIG. 3) were calculated for the ligand molecules and the amino acids in the binding pocket, respectively.

S5, converting the MACCS key and ECFP fingerprint information of ligand molecules and the MACCS key and ECFP fingerprint information of protein binding pocket amino acids into one-dimensional tensors respectively, and splitting the tensors into a training set and a testing set according to the ratio of 7: 3.

S6, establishing a random forest model in the server, training the random forest model by using the training set, and verifying the trained random forest model by using the testing set until a preset prediction effect is obtained.

Fig. 4 shows Pearson coefficients predicted in an example of using a deep neural network model with a combined fingerprint, and fig. 5 shows Pearson coefficients predicted in an example of using a random forest model with a combined fingerprint. In addition, when the MACCS, ECFP and combined fingerprint are used, the mean square deviations of the predicted binding energy and the true value are respectively as follows: 1.55, 1.57 and 1.47, mean square deviations of predicted binding energies from true values when using the random forest model are: 1.36, 1.34 and 0.85, the use of the combined fingerprint in the two embodiments reduces the mean square error between the predicted value and the true value, and effectively improves the calculation precision. FIG. 6 shows the distribution of the predicted values and the real values of the combined fingerprints, wherein (a) the distribution of the predicted values and the real values of the deep neural network model is adopted, and (b) the distribution of the predicted values and the real values of the random forest model is adopted. .

It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for persons skilled in the art, numerous modifications and adaptations can be made without departing from the principle of the present invention, and such modifications and adaptations should be considered as within the scope of the present invention.

Claims

1. A method for predicting the binding free energy of a protein to a ligand molecule, comprising:

Amino acid molecules within distance;

s7, calculating a combination constant obtained by predicting the machine learning model, calculating a Pearson correlation coefficient and an absolute error of a predicted value and a true value, and verifying a prediction result of the machine learning model;

2. The prediction method of claim 1, wherein in step S2, the distance from the protein to the ligand molecule is determined

3. A prediction method as claimed in claim 1 or 2, characterized in that in step S6, the machine learning model constructed is a random forest model or a deep neural network model.

4. The prediction method according to claim 3, wherein, when the machine learning model constructed in step S6 is a deep neural network model, the deep neural network model includes two hidden layers, wherein a first hidden layer includes 2048 neurons, a second hidden layer includes 100 neurons, and each hidden layer is followed by a dropout layer with a loss ratio of 0.1.

5. An apparatus for predicting the binding free energy of a protein to a ligand molecule, comprising:

Amino acid molecules within distance;

the information conversion module is used for converting the MACCS key and ECFP fingerprint information of the ligand molecules and the MACCS key and ECFP fingerprint information of the amino acids in the protein binding pockets into one-dimensional tensors to form a training set and a test set;

and the binding energy prediction module is used for predicting the binding constant of the protein and the ligand molecule by using the trained machine learning model, and converting the binding constant into binding freedom through an Alanikus formula to complete the prediction of the binding self-routing energy.

6. The prediction apparatus of claim 5, wherein the data analysis module is configured to determine the distance from the protein to the ligand molecule

7. A prediction apparatus as claimed in claim 5 or 6, characterized in that in the model training module, the machine learning model constructed is a random forest model or a deep neural network model.

8. The prediction apparatus according to claim 7, wherein when the machine learning model constructed by the model training module is a deep neural network model, the deep neural network model includes two hidden layers, wherein a first hidden layer includes 2048 neurons, a second hidden layer includes 100 neurons, and each hidden layer is followed by a dropout layer with a loss ratio of 0.1.