CN110910951B

CN110910951B - Method for predicting free energy of protein and ligand binding based on progressive neural network

Info

Publication number: CN110910951B
Application number: CN201911135246.9A
Authority: CN
Inventors: 谢良旭; 孟黎
Original assignee: Jiangsu University of Technology
Current assignee: Jiangsu University of Technology
Priority date: 2019-11-19
Filing date: 2019-11-19
Publication date: 2023-07-07
Anticipated expiration: 2039-11-19
Also published as: CN110910951A

Abstract

The invention discloses a method for predicting free energy of protein and ligand combination based on a progressive neural network, which belongs to the technical field of computer-aided drug design and comprises the steps of obtaining a pdb file from a pdbbbind database, establishing a local database, taking ligand molecules as the center, obtaining amino acid molecules within a distance of 4.5 angstroms in a protein binding pocket, performing extended connectivity fingerprint calculation, performing split fingerprint calculation, searching the number of salt bridges and hydrogen bonds between the protein and the ligand molecules, converting structural information of the protein and the ligand into one-dimensional tensors, establishing a training set, a verification set and a test set, training the progressive neural network by using the training set, optimizing and searching for superparameters for prediction, obtaining a higher pearson correlation coefficient by comparing with molecular docking results, solving the technical problems of how to convert the three-dimensional structures of the protein and the ligand molecules into tensors which are easy to calculate by a computer, and inputting the tensors into the progressive neural network for training and optimizing, and greatly accelerating calculation speed and prediction accuracy.

Description

Method for predicting free energy of protein and ligand binding based on progressive neural network

Technical Field

The invention belongs to the technical field of computer-aided drug design, and relates to a method for predicting the free energy of protein and ligand combination based on a progressive neural network.

Background

The free energy calculation method relates to the key field of drug design and is a core technology for realizing high-flux drug screening. The traditional experimental method based on random screening has the problems of long research and development period, high cost consumption, blindness of selecting massive drug molecules and the like. How to screen from tens of millions of molecules to obtain the final drug molecule is a problem that drug screening must face. With the rapid development of computer technology and computational theory methods, new drug development methods and computer-aided drug design have emerged. The method utilizes the rapid calculation of a computer, combines a physical model of interaction between a biological receptor macromolecule and a drug micromolecular ligand, guides and assists drug molecules through the complementarity of a molecular receptor structure and a ligand structure and the complementarity of energy between the molecular receptor structure and the ligand structure and through theoretical models and numerical calculation. The most common methods in theoretical models and numerical calculation methods include molecular docking, free energy calculation methods based on force fields, machine learning, deep learning and other methods.

The computer aided drug design has gained attention in the field of new drug development, wherein molecular docking (molecular docking) is an empirical method that can achieve rapid binding free energy calculation by fitting the interaction energy between protein receptor and ligand, which plays an important role in early high-throughput drug screening. However, the molecular docking adopts an empirical function, so that the fine structures of the protein and the ligand cannot be calculated in a targeted manner, and particularly after the induced fit model is proposed, people recognize that the protein and the ligand have large conformational changes in the binding process, and the pure static docking has certain defects and cannot provide accurate free energy calculation.

The free energy calculation method based on a physical model adopts a force field to describe the interaction energy of protein and ligand, can simulate molecular dynamics, realizes the dynamic calculation of each protein and ligand molecule, comprises a free energy perturbation (free energy perturbation), a thermodynamic integration (thermodynamic integration) and a free energy change in the process of combining the protein and the ligand based on a molecular mechanics-Boltzmann poisson surface area model (Molecular Mechanics Poisson Boltzmann Surface Area) and the like, and can study the combination free energy in different configurations by establishing a thermodynamic cycle through the free energy perturbation and thermodynamic integration method. The method has higher accuracy than molecular docking, however, the method has high demand on computing resources, and the application of the method in high-throughput screening is limited.

Machine learning is a popular computing method, and is widely applied to free energy computation of receptor and ligand combination, such as random forest, support vector machine, neural network and the like. The free energy predicted by the method and the free energy obtained by experiments realize better correlation. However, both of these methods require high quality databases and manually established feature extraction. When training data in machine learning has some unusual features or the training set does not include features of the test set, the machine learning method cannot achieve better generalization. Generalization refers to the extent to which the concept of model learning applies to data not seen by the model during training. High quality databases and efficient feature extraction methods require further development.

The field of deep learning has been shifted from a large number of theoretical studies to real world applications as an emerging approach. The new depth algorithm builds a powerful model from the data, can realize automatic calculation of the characteristics, and is beneficial to increasing the application in pharmaceutical companies. While many industries have already adopted new methods of deep learning, the adoption rates of the pharmaceutical industry have lagged. The Pande task group at the university of Stanford extends the multitasking deep neural network (multitask deep network) into the drug design process, achieving better generalization capability than traditional single-tasked neural networks. The deep learning method has important application prospect in the field of drug research and development. The progressive neural network structure search technique (Progressive Neural Architecture) was commonly proposed by several researchers such as doctor and Alan Yullie, university Liu Chenxi of john s hopkins, and google AI Li Feifei, li Jia. The progressive neural architecture searching method provided by researchers is mainly applied to image recognition, the computing speed of the neural network architecture is 8 times faster than that of a common neural network, the efficiency is improved by 5 times, and a model obtained by AI automatic searching obtains the highest current precision on an ImageNet large-scale data set. The architecture of the progressive neural network has more development potential and is expected to be used in the free energy prediction of protein and ligand molecules.

In summary, the existing general molecular docking method has the problems of low accuracy, large calculation time consumption, poor correlation of calculation results and the like in the calculation of the free energy of the combination of the protein and the ligand molecule in the free energy calculation method based on the physical model.

Disclosure of Invention

The invention aims to provide a method for predicting the binding free energy of protein and ligand based on a progressive neural network, which solves the technical problem that the three-dimensional structure of protein and ligand molecules is converted into tensor which is easy to calculate by a computer and is input into the progressive neural network for training and optimizing.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a method for predicting free energy of protein binding to ligand based on a progressive neural network, comprising the steps of:

step 1: establishing a database server, a structure information processing server and a neural network training server, wherein the database server, the structure information processing server and the neural network training server are communicated with each other through the Internet;

based on crystal structure data in the PDB database and the PDBbind database, establishing a PDBLig database in a local database server through data preprocessing, and storing PDB files of proteins and ligand molecules;

step 2: a fingerprint calculation module and a one-dimensional tensor module are established in a structure information processing server;

the structural information processing server acquires the pdb file of the protein and ligand molecules from the PDBLig database;

step 3: the fingerprint calculation module analyzes the data in the pdb file to obtain amino acid molecules in a distance of 4.5 angstroms in a protein binding pocket by taking a ligand molecule as a center;

step 4: the fingerprint calculation module is used for respectively carrying out extended connectivity fingerprint calculation on the ligand molecules and the amino acids in the binding pocket;

step 5: the fingerprint calculation module carries out SPLIF fingerprint calculation on the ligand molecules and the amino acids in the binding pocket;

step 6: the one-dimensional tensor module searches the number of salt bridges and hydrogen bonds between the protein and the ligand molecules;

step 7: according to the results obtained in the steps 4 to 5, the structural information of the protein and the ligand is converted into a one-dimensional tensor;

step 8: the neural network training server divides the subset of the defined and General in the PDBLig database into a training set, a verification set and a test set according to the ratio of 8:1:1;

step 9: establishing a progressive neural network in a neural network training server, and training the progressive neural network by using a training set;

step 10: verifying the trained progressive neural network by using a verification set until a preset prediction effect is obtained;

step 11: setting and optimizing a group of super parameters, and predicting a test set by utilizing the group of super parameters;

step 12: calculating a comparison pearson correlation coefficient of the binding free energy and the molecular docking result;

step 13: the neural network training server constructs a receiver operating characteristic curve of the molecular docking computation and training of the free energy results of the progressive neural network on the predictions of protein and ligand molecular structures in the test subset.

Preferably, in performing step 3, the pdb file includes data of a protein-ligand complex structure.

Preferably, in executing step 11, a grid method is used to determine the range of the hyper-parameters of the progressive neural network.

Preferably, in executing step 9, the neural network is trained using grid search optimization learning rate, iteration number, batch size, and activation function implicit layer data parameters.

Preferably, in performing step 12, the binding free energy is converted to a binding constant by the Arrhenius equation:

wherein logKa is the binding constant, ΔG is the free energy of binding, k _B Is the boltzmann constant, and T is the absolute temperature.

The method for predicting the binding free energy of the protein and the ligand based on the progressive neural network solves the technical problems that the three-dimensional structure of the protein and the ligand molecule is converted into tensor which is easy to calculate by a computer and is input into the progressive neural network for training and optimizing. Because the progressive neural network is only nonlinear numerical calculation, compared with a free energy calculation method based on a physical model, the calculation speed is greatly increased.

Drawings

FIG. 1 is a schematic diagram of three neural networks according to the present invention;

FIG. 2 is a schematic convergence diagram of a loss function of the super-parametric optimization process of the present invention;

FIG. 3 is a schematic diagram showing the comparison of the predicted free energy of binding and experimental free energy of the progressive neural network and molecular docking method of the present invention;

FIG. 4 is a graph of receiver operating characteristics of the progressive neural network and molecular docking method of the present invention.

Detailed Description

A method for predicting free energy of protein binding to ligand based on a progressive neural network as shown in fig. 1-4, comprising the steps of:

based on crystal structure data in the PDB database and the PDBbind database, a PDBLig database is established in a local database server through data preprocessing and is used for storing the PDB file of the protein and ligand molecules.

The data preprocessing process includes downloading data from PDB and PDBbind databases, deleting files with incomplete structures or no ligand molecules or no experimental binding constants.

In this embodiment, the database of pddbbind classifies the resolution of each pdb file, and in this embodiment, the database is divided into two subsets of defined and General by using the same classification, so as to obtain pdb files with different resolution. After data preprocessing, the Refined subset contains 3568 structures and the General subset contains 11303 structures.

the fingerprint calculation module calculates the distance between the protein and the ligand molecule through Python to obtain the ligand molecule and the amino acid molecules within 4.5 angstrom distance around the ligand molecule.

the extended connectivity fingerprint calculation, that is, ECFP, is realized by calling a command in RDkit or deep chem by a Python script, in this embodiment, taking a script of RDkit as an example, inputting a command allchem.

in this embodiment, the fingerprint calculation module invokes deep chem through Python script, inputs grid_ featurizer (mol) in the script, wherein mol refers to an input chemical molecule, and can convert the ligand molecule and the amino acid molecular structure in the binding pocket into a SPLIF fingerprint. In this embodiment, the charge applied to the partner molecule is derived from the atomic charge assigned to the X-TOOL.

Step 6: the one-dimensional tensor module searches the number of salt bridges and hydrogen bonds between the protein and the ligand molecules; this example calculates the number of salt bridges and hydrogen bonds formed by the ligand and amino acid molecule over a distance of 4.5 angstroms.

Step 7: according to the results obtained in the steps 4 to 5, the structural information of the protein and the ligand is converted into a one-dimensional tensor; that is, three-dimensional coordinates of the crystal structure are converted into one-dimensional vectors.

firstly, constructing a multi-layer neural network, and training a first task;

then constructing a second multi-layer neural network, then fixing the neural network of the previous task, and processing each layer of the neural network of the previous column to each layer of the neural network of the second column through a nonlinear adaptation function as an additional input. That is, each layer of the second neural network adds the input of the corresponding layer of the previous neural network processed by the nonlinear adaptation function besides the original input;

then constructing a third multi-layer neural network, training a third task, fixing the neural networks of the first two columns, and connecting the neural networks to the third neural network in the same way; and this step is repeated until all tasks are trained.

In this embodiment, a lattice point algorithm is adopted in training of the progressive neural network, so as to gradually optimize the range of the super parameters of the progressive neural network. Grid searching is a structure which determines optimal performance by listing different parameter combinations through an exhaustion method. The grid search method also requires policy establishment, and it is preferable to determine the approximate range of each hyper-parameter value at the initial stage. A large-stride grid search may be attempted over a smaller number of iterations or a smaller scale training set. Then in the next stage, setting larger iteration times or using the whole training set to realize small-amplitude accurate positioning.

The performance of each parameter is represented by a loss function, the learning rate interval is within 10 < -6 > -10 < -4 >, the iteration number is less than 50, the batch size is 100, the activation function can adopt relu, tanh and the like, and the neural network optimization algorithm adopts Adam algorithm. The parameter combination is obtained by step-by-step optimization, and has better performance in verification set. The calculation result is shown in fig. 2.

the progressive neural network is used in the prediction of the test subset with a combination of superparameters that also have a better effect on verifying the subset.

In step 13, the protein molecules in the test subset are molecularly docked with the ligand molecules, and scoring calculation is performed by using AutoDock Vina to obtain the binding free energy of each protein-ligand composite structure, and the result is used as a control group of the invention. The free binding energy of the molecular-docked protein to the complex structure of the ligand molecule is then predicted by a progressive neural network. And comparing the obtained binding free energy of the two methods by using the operation characteristic curve of the receiver to distinguish the screening capability of correct docking and incorrect docking. The area under the line of the receiver operating characteristic curves of the docking results of the progressive neural network and AutoDock Vina were found to be 0.87 and 0.52, respectively. The progressive neural network is proved to have better prediction capability.

According to the method for predicting the binding free energy of the protein and ligand composite structure based on the progressive neural network, which is provided by the embodiment, the structure of the protein and ligand molecules is converted into vectors, so that the function of the progressive neural network from image recognition is converted into the prediction of the binding free energy of the protein and ligand composite structure. The invention expands the application range of the progressive neural network, and the deep learning method is expected to be applied to the screening of drug molecules.

Regression analysis calculation and correlation of experimentally measured binding constants, so that the reliability of the method for calculating the free energy of the ligand binding based on the progressive neural network can be measured. The predicted binding free energy and experimental free energy of the progressive neural network were 0.75,0.71, respectively, in Pearson (Pearson) correlation coefficients in the refinished, general subset. Are all higher than the coefficients obtained in Autodock Vina (0.51,0.43).

As shown in fig. 3, the binding free energy obtained by the protein-ligand binding free energy calculation method based on the progressive neural network in the embodiment has a better correlation with the binding free energy indirectly obtained by the experiment, wherein the pearson correlation coefficient is 0.75, so that the protein-ligand binding free energy calculation method based on the progressive neural network model in the embodiment has higher reliability and can be used for virtual screening of targeted drugs.

The method for predicting the binding free energy of the protein and the ligand based on the progressive neural network solves the technical problem that the three-dimensional structure of the protein and the ligand molecule is converted into tensor which is easy to calculate by a computer and is input into the progressive neural network for training and optimizing. Because the progressive neural network is only nonlinear numerical calculation, compared with a free energy calculation method based on a physical model, the calculation speed is greatly increased.

According to the invention, the combination free energy of the protein and ligand composite structure predicted by the progressive neural network is compared with the combination free energy obtained by the calculation of the traditional software AutoDock Vina, and the combination free energy of the protein and ligand composite structure predicted by the progressive neural network is found to have better correlation with the combination free energy obtained by experiments. The method can realize the calculation of the binding free energy of the high-flux protein-ligand composite structure, thereby improving the calculation efficiency and accuracy.

It is well known that the accurate prediction of protein interactions with ligand molecules is related to the success or failure of drug screening. In the embodiment of the invention, the structures of the protein molecules and the ligand molecules are converted into vectors which are easy to read by a computer, and the structures of the ligand molecules and the protein binding pockets and the electrostatic energy of the ligand molecules and the protein are fully considered in the process of encoding the protein and the ligand molecules, so that the structural chemical information of the protein and the ligand is effectively reserved. The invention expands the progressive neural network method in the deep learning field from the image recognition field to the drug design field, and expands the application range of the progressive neural network. In the training of progressive neural networks, hyper-parameters useful for protein-ligand binding free energy calculation are obtained by loss function optimization. The method for predicting the binding free energy of the protein and the ligand based on the progressive neural network improves the correlation between the predicted binding free energy and the experimental free energy, and obtains more credible binding free energy. The invention not only expands the application range of the progressive neural network method, but also proves that the calculation of the interaction of the protein and the ligand molecule in the prediction of the progressive neural network is possible under the existing condition, and is hopeful to become a more effective screening method for targeted drug screening.

Claims

1. A method for predicting the free energy of protein and ligand binding based on a progressive neural network, which is characterized in that: the method comprises the following steps:

2. The method for predicting the free energy of protein binding to a ligand based on a progressive neural network according to claim 1, wherein: in performing step 3, the pdb file includes data for the protein-ligand complex structure.

3. The method for predicting the free energy of protein binding to a ligand based on a progressive neural network according to claim 1, wherein: when step 11 is executed, a grid method is adopted to determine the range of the hyper-parameters of the progressive neural network.

4. The method for predicting the free energy of protein binding to a ligand based on a progressive neural network according to claim 1, wherein: and (3) when the step 9 is executed, training the neural network by utilizing grid search optimization learning rate, iteration times, batch size and implicit layer data parameters of the activation function.

5. The method for predicting the free energy of protein binding to a ligand based on a progressive neural network according to claim 1, wherein: in performing step 12, the binding free energy is converted to a binding constant by the Arrhenius equation:

wherein logKa is the binding constantΔG is the binding free energy, k _B Is the boltzmann constant, and T is the absolute temperature.