CN110689920A

CN110689920A - Protein-ligand binding site prediction algorithm based on deep learning

Info

Publication number: CN110689920A
Application number: CN201910879922.7A
Authority: CN
Inventors: 夏春秋; 杨旸; 沈红斌
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2020-01-14
Anticipated expiration: 2039-09-18
Also published as: CN110689920B

Abstract

The invention discloses a protein-ligand binding site prediction algorithm based on deep learning, for a protein to be predicted, sequence characteristics and a distance matrix of the protein are firstly extracted, then the sequence characteristics are distributed to each residue through a sliding window method, the characteristics corresponding to the residues are input into a residual neural network and a mixed neural network one by one, the output results of the residual neural network and the mixed neural network are input into a Logistic regression classifier, and the final result is the binding probability corresponding to each residue in the protein. According to the invention, a classical bidirectional long-time and short-time memory network and a residual neural network are fused, the fused network can simultaneously process heterogeneous protein sequences and structural data, and the complementarity of sequence characteristics and structural characteristics is excavated. Compared with the existing method, the method has higher prediction accuracy, and has good generalization performance aiming at data sets of different ligands.

Description

Protein-ligand binding site prediction algorithm based on deep learning

Technical Field

The invention relates to the field of protein biology and pattern recognition, in particular to a protein-ligand binding site prediction algorithm based on deep learning.

Background

The interaction of proteins with ligands plays important roles in biological processes, such as signal transduction, post-translational modification, and antigen-antibody interaction. In addition, drug discovery and design also relies heavily on the analysis of the mechanism of protein-ligand interaction. For further exploration of the mechanism behind protein-ligand interactions, recognition of the binding site is a very critical step. As protein design techniques have emerged, and more new proteins have emerged, their properties and functions have not been explored, and the need for rapid, accurate binding site recognition tools has become more urgent. The current method for identifying the binding site of the protein by a wet experiment has the defects that: time consuming and costly.

Protein-ligand interactions can be classified into protein-protein interactions, protein-DNA/RNA interactions, and protein-small molecule interactions, depending on the type of ligand. At this stage, there are many computational methods based on sequence information (protein primary structure) or structural information (protein tertiary structure) that can predict protein-ligand binding sites.

Sequence-based methods can make site predictions for proteins with unknown three-dimensional structures using some purely sequence-based features such as evolutionary information and predicted secondary structures. However, since the position of the binding site is mainly determined by the tertiary structure of the protein, the prediction accuracy of the sequence-based method is relatively low.

The structure-based methods all require three-dimensional coordinates of every atom in the protein as input, but they follow different evaluation criteria, such as POCKETs assume that the binding SITE is more likely to be located in a depressed region of the protein surface, SITEHOUND uses an energy function to calculate the force field between the protein and the ligand, and TM-SITE is a template-based matching method.

Disclosure of Invention

The invention aims to provide a protein-ligand binding site prediction algorithm based on deep learning aiming at the current situation that the prediction algorithm in the prior art is low in precision so as to solve the problems in the prior art.

The invention provides a prediction method with higher precision by fusing a deep learning technology and the field knowledge of a protein structure aiming at the application scene of protein-ligand binding site recognition, and also provides an effective solution for partial problems, such as data imbalance problem, difficulty in registration between three-dimensional structures and the like.

The technical problem solved by the invention can be realized by adopting the following technical scheme:

a deep learning based protein-ligand binding site prediction algorithm comprising the steps of:

step 1) firstly, extracting sequence characteristics of a protein structure data set, then calculating Euclidean distance between each residue pair from three-dimensional space coordinates of each residue of the protein, and constructing a distance matrix; finally, intercepting a feature tensor of each residue by using a sliding window method;

step 2) taking each binding site as a positive sample and taking a non-binding site as a negative sample, extracting a subset from the negative sample by using a random down-sampling method and constructing a training subset with all the positive samples, and repeating for multiple times to obtain multiple training subsets; randomly up-sampling a positive sample when constructing the mini-batch;

step 3), constructing a residual error neural network by using a residual error module, and training on the distance matrix;

step 4), integrating the built residual error neural network and the bidirectional long-time memory network through a full connection layer, building a hybrid neural network, and training on the sequence characteristics and the distance matrix;

step 5) training a Logistic regression classifier according to the output results of the residual error neural network and the mixed neural network;

and 6) for the protein to be predicted, firstly extracting sequence characteristics and a distance matrix of the protein, then distributing the sequence characteristics to each residue through a sliding window method, then inputting the residues into a residual neural network and a mixed neural network one by one, and inputting output results of the residual neural network and the mixed neural network into a Logistic regression classifier, wherein the final result is the corresponding combination probability of each residue in the protein.

Further, the method for extracting the sequence feature and the distance matrix in the step 1) is as follows:

step 1.1) for the protein with the length of L, obtaining a position specificity scoring matrix PSSM thereof through a PSI-BLAST algorithm; the PSSM has a size of L × 20, wherein the ith row and the jth column element p_ijIndicates the possibility of mutating the ith residue into j amino acids, and the total number of the amino acids is 20;

then for each p_ijNormalization was performed as follows:

step 1.2) for the protein with the length of L, obtaining a scoring matrix HHM through an HHblits algorithm, wherein the HHM identifies the evolution information of the protein sequence; HHM size is L x 20, wherein the first 20 columns are emission probability of 20 amino acids, 21-27 columns are transition probability, 28-30 columns are local diversity;

for element h in HHM_ijNormalization was performed as follows:

step 1.3) predicting the secondary structure information and relative solvent accessibility of the protein with the length L by using an SCRATCH algorithm; the secondary structure information is represented as an L x 3 matrix, where each row s_iRepresenting the secondary structure of the ith residue as a helix, strand or otherwise in the form of a one-hot vector; solvent accessibility is represented as an L2 matrix, where each row r_iRepresenting the status of the ith residue as exposed or buried in the form of a one-hot vector;

step 1.4) for the protein with the length L, predicting the binding tendency of each residue of the protein through an S-SITE algorithm, and expressing the result as an L multiplied by 2 matrix; wherein each element q_i0And q is_i1Q represents the probability of binding and the probability of not binding, respectively, of the i-th residue_i0And q is_i1The sum of (1);

step 1.5) for a protein of length L, if the coordinates of each atom in space are known, by calculating the C of the i-th and j-th residues_αThe Euclidean distance between them, denoted as d_ij；

Constructing a distance matrix D ═ D according to the sequence order_ij}^L×LThen, the image is scaled to a size of L multiplied by 400 through an interpolation method;

step 1.6) splicing the sequence feature matrixes obtained in the steps 1.1) to 1.4) into an L × 57 sequence feature matrix according to rows, and intercepting each residue by using a sliding window with the size of W to finally obtain a feature matrix with the size of W × 57; and intercepting the distance matrix by using a sliding window with the size of W to obtain a distance matrix with the size of W multiplied by 400 corresponding to each residue.

Further, the random down-sampling in the step 2) and the up-sampling in the mini-batch need to satisfy the following conditions:

1) in random down-sampling, each negative sample is randomly selected from the original data set with a probability of 20%, and the selected negative sample and all positive samples are combined into a training subset; obtaining N in the same manner_setA training subset;

2) in upsampling in the mini-batch, N is cyclically selected from the set of all positive samples and the set of all negative samples_pA positive sample and N_nA negative sample according to N_pThe following formula gives:

N_p＝[0.3×N_b]

wherein N is_bIs the size of the mini-batch [. degree]Is a rounded symbol, and N_n＝N_b-N_p。

Further, the definition of the residual block and the construction process of the residual neural network are as follows:

in a neural network, the convolutional layer can be represented as Conv (X, W, H, D), where X is the input variable, W and H are the width and height of the convolutional kernels, respectively, and D is the number of convolutional kernels; the residual block is formed by stacking three convolution layers as shown in the following formula:

Res(X)＝σ(Conv(σ(Conv(σ(Conv(X，1，1，D))，3，3，D))，1，1，4×D)+X)

wherein σ is an activation function; the residual error neural network is formed by stacking a plurality of residual error blocks and optimized by an Adam algorithm, and the input of the residual error neural network is a distance matrix of each residue;

in said N_setOn each subset, N can be trained for each residue in the protein_resA separate residual neural network, wherein N_res≤N_set。

Further, the hybrid neural network in the step 4) integrates a residual neural network and the BilSTM, and is optimized by an Adam algorithm; the input to the BiLSTM is the sequence characteristics of each residue;

in said N_setOn subsets, N can be trained for each residue in the protein_hybridA separate hybrid network, wherein N_hybrid＝N_set-N_res。

Further, N corresponds to each residue in the step 5)_resA residual error network and N_hybirdThe output of the hybrid network is spliced into a length N_setThe vector of (a); taking the vector as an input, and training a Logistic regression classifier in a cross validation mode; adding l to the loss function of the Logistic classifier₁The regularization term prevents overfitting.

Further, in the step 6), for a length L and C_αFirstly, extracting sequence characteristics and a distance matrix of a protein with known spatial coordinates to be predicted, then distributing the sequence characteristics to each residue by a sliding window method with the size of W, then inputting the characteristics corresponding to the residues into a plurality of residual neural networks and mixed neural networks one by one, inputting the output results of the residual neural networks and the mixed neural networks into a Logistic regression classifier, and finally obtaining the combination probability corresponding to each residue in the protein.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention provides a novel hybrid neural network, which fuses a classical bidirectional long-term memory network and a residual neural network, the fused network can simultaneously process heterogeneous protein sequences and structural data, and the complementarity of sequence characteristics and structural characteristics is excavated.

2. The invention adopts a random down-sampling and integration method to solve the problem of unbalance of positive and negative samples, and adopts batch-by-batch up-sampling of positive samples to further reduce the influence of the data input in the form of mini-batch in a neural network.

3. Compared with the existing method, the method has higher prediction precision, and has good generalization performance aiming at data sets of different ligands.

Drawings

FIG. 1 is a flow chart of the deep learning-based protein-ligand binding site prediction algorithm of the present invention.

FIG. 2 is a schematic diagram of a residual error network module according to the present invention.

The device comprises a hybrid neural network architecture diagram (a), a sequence feature and distance matrix extraction module (b) and a bidirectional long-time and short-time memory network module (c).

FIG. 3 is a schematic diagram of a random sampling and integration method according to the present invention.

FIG. 4 is a schematic diagram of an implementation of a residual block in the residual neural network of the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.

Referring to fig. 1, the present invention provides a deep learning-based protein-ligand binding site prediction algorithm, which comprises the following steps:

step 1) for a given protein structure data set, firstly, respectively extracting evolution information, secondary structure information, relative solvent accessibility and combination probability of the given protein structure data set by utilizing a PSI-BLAST algorithm, a HHblits algorithm, a SCRATH algorithm and an S-SITE algorithm, and carrying out normalization processing on the evolution information; secondly, calculating Euclidean distance between each residue pair from three-dimensional space coordinates of each residue of the protein, and constructing a distance matrix; truncating the feature tensor for each residue using a sliding window strategy;

step 2) taking each binding site as a positive sample and taking a non-binding site as a negative sample, extracting a subset from the negative sample by using a random down-sampling method and constructing a training subset with all the positive samples, and repeating for multiple times to obtain multiple training subsets; then randomly up-sampling a positive sample when constructing the mini-batch;

step 3), constructing a residual error neural network (ResNet) by using a residual error module, and training on the distance matrix obtained in the step 1);

step 4) integrating the residual error network in the step 3) with a bidirectional long-time and short-time memory network (BiLISTM) through a full connection layer to construct a hybrid neural network, and training on the sequence characteristics and the distance matrix obtained in the step 1);

step 5) training a Logistic regression classifier by using the residual error neural network in the step 3) and the output result of the mixed network in the step 4);

and 6) for a protein to be predicted, firstly extracting sequence characteristics and a distance matrix of the protein, then distributing the characteristics to each residue through a sliding window method, then inputting the characteristics into a residual error network and a mixed neural network one by one, and then inputting an output result into a Logistic regression classifier, wherein a final result is the corresponding combination probability of each residue in the protein.

Wherein the specific process of the step 1) is as follows:

then for each p_ijNormalization was performed as follows:

for element h in HHM_ijNormalization was performed as follows:

step 1.6) splicing the sequence feature matrixes obtained in the steps 1.1) to 1.4) into an L × 57 sequence feature matrix according to rows, and intercepting each residue by using a sliding window with the size of W to finally obtain a feature matrix with the size of W × 57; as shown in part a of fig. 2, the 4 features will be divided into two groups that are input to two bilstms, one of which contains only PSSM, SS (secondary structure information predicted by SCRATCH), RSA (relative solvent accessibility predicted by SCRATCH) and SST (binding tendency predicted by S-SITE), and the other contains only HHM, SS, RSA and SST. And intercepting the distance matrix by using a sliding window with the same size W to obtain a distance matrix with the size of W multiplied by 400 corresponding to each residue.

The random down-sampling and the up-sampling in the mini-batch in the step 2) are shown in fig. 3, and the following conditions need to be satisfied:

N_p＝[0.3×N_b]

Further, in step 3, the definition of the residual block and the construction of the residual network are as follows:

as shown in fig. 4, the residual block is generally composed of a plurality of convolutional layers and an identity map, and nonlinear mapping is implemented between convolutional layers by an activation function. Fig. 4 shows a general residual block on the left side and a residual block in the form of a bottleneck (bottleeck) on the right side, which is advantageous in that parameters can be reduced while ensuring performance. The present invention employs a bottleneck-form residual block, which is described as follows:

wherein σ is an activation function, Conv (X, W, H, D) is a convolution function, X is an input variable, W and H are the width and height of a convolution kernel respectively, and k is the number of the convolution kernels;

in the invention, a residual error network is formed by stacking a plurality of residual error blocks, as shown in fig. 2(b), and is optimized by an Adam algorithm, wherein the input of the network is a distance matrix of each residue. The specific network architecture is summarized in table 1.

^aThe setting of the convolution layer respectively represents the size of convolution kernels, the number of the convolution kernels and the step length;

^bthe step size of the residual block in the form of a bottleneck is 1.

TABLE 1 residual neural network module architecture

Further, in the step 4), the hybrid neural network integrates the residual error network and the BiLSTM in the step 3) through a full connection layer, and is optimized through an Adam algorithm, and the overall architecture of the hybrid neural network is shown in fig. 2. As described in step 2), the inputs of two bilstms are two sets of sequence features, respectively.

N corresponding to each residue in the step 5)_resA residual error network and N_hybirdThe output of the hybrid network is spliced into a length N_setThe vector of (a); training a Logistic regression classifier by taking the vector as input in a cross validation mode, wherein the specific form is shown in FIG. 3; adding l to the loss function of the Logistic classifier₁The regularization term prevents overfitting.

In the step 6), for a length L and C_αWaiting for prediction with known spatial coordinatesAnd (2) measuring the protein, firstly extracting the sequence characteristics and a distance matrix of the protein, then distributing the sequence characteristics to each residue by a sliding window method with the size of W, then inputting the residues into a plurality of residual neural networks and mixed neural networks one by one, then inputting the output results of the residual neural networks and the mixed neural networks into a Logistic regression classifier, and finally obtaining the result, namely the combination probability corresponding to each residue in the protein.

Then, dividing the binding probability by an optimal threshold T epsilon (0, 1) learned on a training set, and if the binding probability is greater than T, considering the residue as a binding site; conversely, this residue is considered a non-binding site.

Examples

With protein and MN²⁺The binding site data set of (a) serves as a training set and a test set. The training set contains a total of 440 proteins, of which there are 1931 binding residues and 150229 non-binding residues; the test set contained a total of 144 proteins, of which there were 612 binding residues and 50838 non-binding residues.

Firstly, extracting evolution information, secondary structure information, relative solvent accessibility and combination probability of all proteins in a training set and a test set respectively by using a PSI-BLAST algorithm, an HHblits algorithm, an SCRATCH algorithm and an S-SITE algorithm, and normalizing the evolution information (including PSSM and HHM); secondly, calculating Euclidean distance between each residue pair from three-dimensional space coordinates of all residues of the proteins in the training set and the test set, constructing a distance matrix, and scaling the column number of the matrix to 400; finally, the eigentensor is truncated for each residue using a size 37 sliding window strategy, so that all residues correspond to a sequence eigenmatrix of size 37 × 57 and a distance matrix of size 37 × 400.

Due to the extremely unbalanced state of the data, i.e. binding sites (positive samples) are much less than non-binding sites (negative samples), the negative samples are randomly down-sampled with a sampling rate of 20%. The sampled negative samples are then combined with all positive samples to form a training subset, and the process is repeated until 13 training subsets are obtained. For each training subset, a hybrid neural network or a residual neural network may be trained. In this example, 10 hybrid networks and 3 residual networks are trained in total.

And inputting the data in each training subset into an independent mixed neural network or residual neural network according to the mini-batch with the size of 32, and ensuring that the ratio of positive samples to negative samples is controlled to be 3: 7. The parameters of the network are then optimized by the Adam algorithm until the effect of the neural network on the validation set is no longer improved.

The results of all 13 networks are concatenated into a length 13 vector, and a Logistic regression classifier is trained by means of cross-validation. Thus, the models included in the present invention are trained.

The data in the test set is then characterized in the same way and input into the network, except that the binding sites for the test data are unknown, i do not need to update the weights of the network through an optimization algorithm. Finally, the results of the multiple networks are input into a trained Logistic regression classifier to obtain the binding probability corresponding to each residue, and then the binding probabilities are divided according to a predetermined threshold, in this example, the threshold is 0.345.

The evaluation indexes adopted by the invention are as follows:

REC＝TP/(TP+FN)

PRE＝TP/(TP+FP)

wherein, TP, FP, TN and FN are true positive, false positive, true negative and false negative results respectively.

The predicted results of the experiment are as follows:

in the experimental phase, the present invention was compared with other representative protein-ligand binding site prediction methods, and the results are shown in the following table. The invention achieves the best result on the comprehensive index MCC, and is improved by 4.9 percent compared with the second good method IonCom. Although the method of the present invention works somewhat less well on the REC index because a higher threshold is selected to maximize MCC, the method of the present invention is significantly better than other existing methods in general.

Method of producing a composite material	REC	PRE	MCC
				COACH	0.562	0.272	0.381
IonCom	0.531	0.495	0.506
				TargetS	0.395	0.499	0.438
The method used in the present invention	0.513	0.632	0.565

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A deep learning based protein-ligand binding site prediction algorithm comprising the steps of:

step 4) integrating the residual error neural network and the bidirectional long-time memory network through a full connection layer to construct a hybrid neural network, and training on the sequence characteristics and the distance matrix;

and 6) for the protein to be predicted, firstly extracting sequence characteristics and a distance matrix of the protein, then distributing the sequence characteristics to each residue through a sliding window method, then inputting the characteristics corresponding to the residues into a residual neural network and a mixed neural network one by one, and inputting the output results of the residual neural network and the mixed neural network into a Logistic regression classifier, wherein the final result is the combination probability corresponding to each residue in the protein.

2. The deep learning-based protein-ligand binding site prediction algorithm according to claim 1, wherein the extraction method of the sequence feature and distance matrix in step 1) is as follows:

then for each p_ijNormalization was performed as follows:

for element h in HHM_ijNormalization was performed as follows:

step 1.4) for the protein with the length L, predicting the binding tendency of each residue of the protein through an S-SITE algorithm, and expressing the result as an L multiplied by 2 matrix; each of whichAn element q_i0And q is_i1Q represents the probability of binding and the probability of not binding, respectively, of the i-th residue_i0And q is_i1The sum of (1);

3. The deep learning based protein-ligand binding site prediction algorithm according to claim 1 or 2, wherein the random down-sampling in step 2) and the up-sampling in the mini-batch satisfy the following condition:

N_p＝[0.3×N_b]

4. The deep learning based protein-ligand binding site prediction algorithm of claim 3, wherein the definition of the residual block and the construction of the residual neural network are as follows:

5. The deep learning based protein-ligand binding site prediction algorithm of claim 4, wherein the hybrid neural network in step 4) integrates a residual neural network and BilSTM and is optimized by Adam algorithm; the input to the BiLSTM is the sequence characteristics of each residue;

in said N_resOn subsets, N can be trained for each residue in the protein_hybridA separate hybrid network, wherein N_hybrid＝N_set-N_res。

6. The deep learning-based protein-ligand binding site prediction algorithm of claim 5, wherein the N for each residue in step 5) is_resA residual error network and N_hybirdThe output of the hybrid network is spliced into a length N_setThe vector of (a); taking the vector as an input, and training a Logistic regression classifier in a cross validation mode; adding to the loss function of the Logistic classifierL to₁The regularization term prevents overfitting.

7. The deep learning-based protein-ligand binding site prediction algorithm of claim 6, wherein in the step 6), for a length L and C_aFirstly, extracting sequence characteristics and a distance matrix of a protein with known spatial coordinates to be predicted, then distributing the sequence characteristics to each residue by a sliding window method with the size of W, then inputting the characteristics corresponding to the residues into a plurality of residual neural networks and mixed neural networks one by one, inputting the output results of the residual neural networks and the mixed neural networks into a Logistic regression classifier, and finally obtaining the combination probability corresponding to each residue in the protein.