CN114708903A

CN114708903A - Method for predicting distance between protein residues based on self-attention mechanism

Info

Publication number: CN114708903A
Application number: CN202210245505.9A
Authority: CN
Inventors: 张贵军; 张福金; 赵凯龙; 李章维
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-07-05

Abstract

A protein residue distance prediction method based on a self-attention mechanism comprises the steps of firstly utilizing HHblits tools to search Unicluster 30 and BFD sequence databases with the maximum sequence similarity of 90% and the coverage rate of 75% to obtain a multi-sequence alignment file consisting of homologous sequences of query sequences; utilizing a HHsearch tool to search a PDB70 database by using MSA to obtain a template of a query sequence; then extracting a series of characteristics from the MSA and the template and inputting the characteristics into a network model for training; the network model comprises a triangular multiplication updating module, an axis attention mechanism module and a convolution residual module, the training process iterates for 50 generations to obtain a training model, and the extracted features of the test protein are input into the training model to obtain the probability distribution of the distance between the protein residue pairs falling into each interval. The method effectively improves the prediction accuracy of the distance between protein residues.

Description

Method for predicting distance between protein residues based on self-attention mechanism

Technical Field

The invention relates to the fields of bioinformatics and computer application, in particular to a method for predicting the distance between protein residues based on an attention-free mechanism.

Background

Protein is a biological macromolecule and is the material basis for all vital activities. The function of a protein depends on its three-dimensional structure, so accurate prediction of the protein structure plays a very important role in annotation of protein functions, drug design, and the like. The traditional experimental methods such as X-ray crystal diffraction, nuclear magnetic resonance method, freezing electron microscope technology and the like are time-consuming, labor-consuming and expensive to measure the tertiary structure of the protein, and for some special proteins, the experimental methods are difficult to measure the structure of the protein. With the rapid development of sequencing technology over decades, the use of computer design algorithms to predict the tertiary structure of proteins de novo, starting from amino acid sequences, has made great progress. However, the accuracy of protein structure prediction is affected by the defects of inaccuracy of a physical energy force field function used by a de novo prediction technology, insufficient conformational space sampling capability and the like.

With the development of deep learning in recent years, the deep learning technology is utilized to predict the distance distribution among protein residues, the distance distribution is used as prior knowledge and a scoring model to guide conformation search, errors caused by inaccuracy of an original force field energy function can be reduced, and the search space of the conformation is effectively reduced.

The deep learning can perform characterization learning on data, the residual convolutional neural network is a special deep neural network and is mainly applied to image processing, however, the convolutional neural network does not process all data of the previous layer, but only processes small-range data of a convolutional kernel position, the receptive field is small, and only local information can be used for calculating target pixels. In recent years, attention mechanism has been developed, which can realize the association between a target pixel and any one pixel, and calculate the target pixel by using the weighted sum of all pixels, that is, the receptive field is a global pixel. However, for multidimensional feature maps, the attention mechanism is computationally intensive and very computationally inefficient. The axis attention mechanism well solves the problem, can fuse global information, keeps long-range dependency and reduces the calculated amount. The axis attention is to apply an attention mechanism in the row direction or the column direction of the feature map, the receptive fields of the feature map are pixels of the same row or the same column of the target pixel respectively, and the combination of the row direction attention and the column direction attention can well fuse global information.

Residue coevolution is a main principle for predicting the distance between residues, and the currently widely used method for inferring coevolution is an indirect strategy, namely manual features such as covariance features are extracted from Multiple Sequence Alignment (MSA), however, the indirect strategy often causes the loss of a large amount of information, so that the prediction of the distance between residues is inaccurate. Therefore, using MSA transform and the outer product aggregation algorithm, co-evolution information can be learned directly from MSA to estimate inter-residue distances.

Disclosure of Invention

The technical problem is solved. The invention provides a protein residue distance prediction method based on a self-attention mechanism, which comprises the steps of firstly searching a multi-sequence comparison file and a template of a query sequence, then extracting a series of characteristics from the multi-sequence comparison file and the template, inputting the characteristics into a network model for training, wherein the network model mainly comprises a triangular multiplication updating module, an axis attention mechanism module and a convolution residual module, and obtaining a training model after the training process iterates for 50 generations. And inputting the extracted features of the test protein into a training model to obtain the probability distribution of the distance between the protein residue pairs falling into each interval. The method effectively improves the prediction precision of the distance between protein residues.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for predicting inter-residue distance of a protein based on an attention-driven mechanism, the method comprising the steps of:

1) constructing a data set: selecting M proteins with the sequence length L of 30-300 as a data set by taking the set sequence similarity as a threshold value in an SCOPE protein database; the data set is divided into a training set and a verification set;

2) making label data: for each protein in the data set, the C of each pair of residues in the protein was calculated_βEuclidean distance between atoms (C in GLY)_αAtomic substitution); the distance between the residue pairs

The equal parts are divided into 36 intervals, and each interval is

There is an additional interval representing the discontiguous residue pair, for a total of 37 intervals;

3) obtaining a multi-sequence alignment file: searching Unicluster 30 and BFD sequence databases by using HHblits tool with the maximum 90% sequence similarity and 75% coverage rate to obtain a multi-sequence alignment file consisting of homologous sequences of the query sequence, namely MSA;

4) acquiring a template file: searching the PDB70 database by using the MSA obtained in the step 3) by using a HHsearch tool to obtain a template of a query sequence;

5) reading the length L of the query sequence;

6) the input features are extracted by the following process:

6.1) MSA-based extraction features: sequence spectrum with characteristic dimension L x 42; position entropy, with characteristic dimensions L x 2; a secondary structure with a characteristic dimension of L x 6; solvent accessibility, characteristic dimension L x 2; contact potential energy with a characteristic dimension of L x L1; MI, characteristic dimension L x 1; MIP with characteristic dimension L x L1. Connecting the 7 features together to obtain a total feature dimension of L x 55;

6.2) randomly selecting N homologous sequences in the MSA, and inputting the N homologous sequences into a pre-trained MSATransformer network model to obtain MSA characteristics with characteristic dimensions of N × L × 768 and line attention map characteristics with characteristic dimensions of L × 144;

6.3) according to the template homologism probability sequence output by the HHsearch, selecting the top 10 templates, and extracting the following characteristics based on the templates: scalar features composed of homology probability, sequence similarity and sequence consistency, and the feature dimension is L x 3 after horizontal and vertical tiling; one-dimensional features consisting of position similarity, secondary structure scores and alignment confidence scores are tiled left and right and added, and the feature dimension is L x 6; the distance between the aligned template structure residue pairs, i.e. two-dimensional features, with a feature dimension of L x 1, the features of the three dimensions are connected together to give a total feature dimension of 10 x L10;

7) performing characteristic pretreatment, wherein the process is as follows:

7.1) reducing the MSA characteristic obtained in 6.2) from 768 to 32 through a linear layer with three layers of neuron sizes of 256, 64 and 32 respectively to reduce the calculation memory; then, the weight w of each homologous sequence is calculated by formula (1)_kAll homologous sequences in the weighted MSA are aggregated together, and simultaneously, the outer product g of the MSA feature after only dimension reduction and the weighted MSA feature is calculated by formula (2)_ij(ii) a Connecting the result obtained by the outer product, the weighted aggregation result and the query sequence in the MSA after dimensionality reduction together to obtain a new pair of characteristics, wherein the characteristic dimensionality is L × 256;

w_k＝softmax[(q×W^Q)(k×W^K)^T] (1)

wherein q represents the query sequence, k represents other homologous sequences in the MSA, W^Q、W^KRepresenting a weight matrix, and T represents transposition;

wherein, X_k(i) Denotes the i residue of the k homologous sequence in the MSA, X_k(j) Denotes the jth residue of the kth homologous sequence in the MSA,

representing an outer product operation;

7.2) the line attention map characteristics obtained in 6.2) are symmetrically processed by the formula (3);

wherein, F_{row_att}A line of attention map feature is shown,

a transpose representing a line attention map feature;

7.3) processing the template features obtained in 6.3) by a layer of axis attention mechanism, then calculating the weight of each template, wherein the feature dimension obtained after weighted aggregation is L x 64;

8) building a network model, wherein the process is as follows:

8.1) the network model is mainly composed of three parts. The first part is a triangular multiplication updating network, the information of a residue pair ij is updated by using the information of the residue pair ik and the residue pair jk, and the information consists of a Linear layer and a sigmoid layer respectively;

8.2) the second part is an axis attention mechanism network which consists of 4 attention blocks, each attention block consists of a row attention layer, a column attention layer and a feedforward layer, and each attention layer is provided with 8 attention heads;

8.3) the third part is a convolution residual error network which is composed of an input layer, 13 convolution residual error blocks and an output layer. The input layer consists of a two-dimensional convolution layer containing 64 1 x 1 convolution kernels, a normalization layer and an ELU layer; the residual block consists of two-dimensional convolution layers with convolution kernels of 3 x 3, two normalization layers, two ELU layers and one Dropout layer; the output layer consists of a two-dimensional convolution layer containing 37 1 x 1 convolution kernels and a Softmax layer;

9) training model parameters: connecting the features obtained in 6.1) and the processed features in 7.1), 7.2) and 7.3), fusing the features into a feature with a dimension L519, and inputting the feature into the network model. During the training process, cross entropy cross EntropyLoss is used as a loss function, an Adam optimizer is used as an optimizer, an appropriate learning rate is set, m epochs are set in total, and each epoch executes the data of the whole training set. Training was stopped when the loss of 5 consecutive epoch validation sets did not drop. Taking the model parameter which enables the loss of the verification set to be the lowest as a final model parameter;

10) predicted inter-residue distance: and extracting the characteristics of the test set, inputting the characteristics into a trained model, and obtaining the probability distribution of the distance between the protein residue pairs falling into each interval.

Further, in the step 1), 6890 proteins with a sequence length L of 30 to 300 are selected as a data set by taking 30% sequence similarity as a threshold; with 90% of the protein as the training set and 10% as the validation set.

The invention has the following beneficial effects: MSA characteristics and row attention maps processed by an MSA Transformer pre-training model are taken as characteristics, and co-evolution information and the relation among residues in a sequence are captured better; in addition, template features are added, and the inter-residue distance information of the homologous template enables the input features to be richer and the prediction effect to be better. Triangle multiplication updating of the network part ensures reasonability of the distance between residues, and a long-range interaction can be captured by an attention mechanism. The improvement of the characteristics and the network greatly improves the precision of the prediction of the distance between residues.

Drawings

FIG. 1 is an overall flow chart of a method for predicting the distance between protein residues based on the self-attention mechanism.

FIG. 2 is a diagram of the MSA characterization process obtained in the pre-treatment 6.2).

FIG. 3 is a diagram of a pre-processing template feature process.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1, 2 and 3, a method for predicting the distance between protein residues based on the self-attention mechanism includes the following steps:

1) constructing a data set: in an SCOPE protein database, with 30% sequence similarity as a threshold value, 6890 proteins with sequence length L of 30-300 are selected as a data set; wherein 90% of the protein is used as training set and 10% of the protein is used as validation set;

2) system for makingLabeling data: for each protein in the data set, the C for each pair of residues in the protein was calculated_βEuclidean distance between atoms (C in GLY)_αAtomic substitution); the distance between the residue pairs

The equal parts are divided into 36 intervals, and each interval is

4) acquiring a template file: searching the PDB70 database by using the MSA obtained in the step 3) by using an HHsearch tool to obtain a template of a query sequence;

5) reading the length L of the query sequence;

6) extracting input features, the process is as follows:

6.1) MSA-based extraction features: a sequence spectrum with characteristic dimensions L x 42; position entropy, with characteristic dimensions L x 2; a secondary structure with a characteristic dimension of L x 6; solvent accessibility, characteristic dimension L x 2; contact potential energy with a characteristic dimension of L x L1; MI, characteristic dimension L x 1; MIP with characteristic dimension L x L1. Connecting the 7 features together to obtain a total feature dimension of L x 55;

7) performing characteristic pretreatment, wherein the process is as follows:

7.1) reducing the MSA characteristic obtained in 6.2) from 768 to 32 through a linear layer with three layers of neuron sizes of 256, 64 and 32 respectively to reduce the calculation memory; then calculating the weight of each homologous sequence through a formula (1), aggregating all homologous sequences in the weighted MSA, and simultaneously calculating the outer product of the MSA features after dimension reduction and weighting through a formula (2); connecting the result obtained by the outer product, the weighted aggregation result and the query sequence in the MSA after dimensionality reduction together to obtain a new pair of characteristics, wherein the characteristic dimensionality is L × 256;

w_k＝softmax[(q×W^Q)(k×W^K)^T] (1)

representing an outer product operation;

wherein, F_{row_att}A line of attention map feature is shown,

a transpose representing a line attention map feature;

8) building a network model, wherein the process is as follows:

8.1) the network model consists of three parts, wherein the first part is a triangular multiplication updating network, updates the information of a residue pair ij by using the information of the residue pair ik and the residue pair jk, and consists of a Linear layer and a sigmoid layer respectively;

9) training model parameters: connecting the features obtained in 6.1) and the processed features in 7.1), 7.2) and 7.3), fusing the features into a feature with a dimension L519, and inputting the feature into the network model. In the training process, cross entropy cross EntropyLoss is used as a loss function, an Adam optimizer is used as an optimizer, an appropriate learning rate is set, m epochs are set in total, each epoch executes data of the whole training set, and when the loss of 5 continuous epoch verification sets does not decrease, the training is stopped. Taking the model parameter which enables the loss of the verification set to be the lowest as a final model parameter;

The example takes protein 2APN _ A with the sequence length of 114 as an example, and the method for predicting the distance between protein residues based on attention and a residual error network comprises the following steps:

2) making label data: for each protein in the data set, the C of each pair of residues in the protein was calculated_βEuclidean distance between atoms (C for GLY)_αAtomic substitution); the distance between the residue pairs

Is divided into 36 intervals in equal parts, and each interval is

5) read query sequence length 114;

6) the input features are extracted by the following process:

6.1) MSA-based extraction features: a sequence spectrum with characteristic dimensions 114 x 42; position entropy, with characteristic dimensions of 114 × 2; secondary structure with characteristic dimension 114 x 6; solvent accessibility, characteristic dimension 114 x 2; contact potential energy with characteristic dimension 114 x 1; MI, characteristic dimension 114 × 1; MIP with characteristic dimension 114 × 1. Connecting the 7 features together to obtain a total feature dimension of 114 x 55;

6.2) randomly selecting 64 homologous sequences in the MSA, inputting the homologous sequences into a pre-trained MSA Transformer network model to obtain MSA characteristics with a characteristic dimension of 64 × 114 × 768 and line attention diagram characteristics with a characteristic dimension of 114 × 144;

6.3) according to the template homologism probability sequence output by the HHsearch, selecting the top 10 templates, and extracting the following characteristics based on the templates: scalar features consisting of homology probability, sequence similarity and sequence consistency, and the feature dimension is 114 × 3 after horizontal and vertical tiling; one-dimensional features consisting of position similarity, secondary structure scores and alignment confidence scores are tiled left and right, and the feature dimension is 114 × 6; the distance between the aligned template structure residue pairs, i.e. two-dimensional features with a feature dimension of 114 x 1, the features of the three dimensions are connected together to obtain a total feature dimension of 10 x 114 x 10;

7) performing characteristic pretreatment, wherein the process is as follows:

7.1) reducing the MSA characteristic obtained in 6.2) from 768 to 32 through a linear layer with three layers of neuron sizes of 256, 64 and 32 respectively to reduce the calculation memory; then calculating the weight of each homologous sequence through a formula (1), aggregating all homologous sequences in the weighted MSA, and simultaneously calculating the outer product of the MSA features after dimension reduction and weighting through a formula (2); connecting the result obtained by the outer product, the weighted aggregation result and the query sequence in the MSA after dimensionality reduction together to obtain a new pair feature, wherein the feature dimensionality is 114 × 256;

w_k＝softmax[(q×W^Q)(k×W^K)^T] (1)

representing an outer product operation;

wherein, F_{row_att}A line of attention map feature is shown,

a transpose representing a line attention map feature;

7.3) processing the template features obtained in 6.3) by a layer of axis attention mechanism, and then calculating the weight of each template, wherein the feature dimension obtained after weighted aggregation is 114 × 64;

8) building a network model, wherein the process is as follows:

8.1) the network model consists of three parts, wherein the first part is a triangular multiplication updating network, and updates the information of a residue pair ij by using the information of the residue pair ik and the residue pair jk, and the three parts respectively consist of a Linear layer and a sigmoid layer;

8.2) the second part is an axis attention mechanism network which consists of 4 attention blocks, each attention block consists of a row attention layer, a column attention layer and a feedforward layer, and each attention layer has 8 attention heads;

8.3) the third part is a convolution residual error network which consists of an input layer, 13 convolution residual error blocks and an output layer. The input layer consists of a two-dimensional convolution layer containing 64 1 x 1 convolution kernels, a normalization layer and an ELU layer; the residual block consists of two-dimensional convolution layers with convolution kernels of 3 x 3, two normalization layers, two ELU layers and one Dropout layer; the output layer consists of a two-dimensional convolution layer containing 37 1 x 1 convolution kernels and a Softmax layer;

9) training model parameters: connecting the features obtained in 6.1) and the processed features in 7.1), 7.2) and 7.3), fusing the features into a feature with a dimension of 114 x 519, and inputting the feature into the network model. In the training process, cross entropy cross EntropyLoss is used as a loss function, an Adam optimizer is used as an optimizer, the learning rate is set to be 0.0001, 100 epochs are set in total, each epoch executes the data of the whole training set, and when the loss of 5 continuous epoch verification sets does not decrease, the training is stopped. Taking the model parameters which enable the loss of the verification set to be lowest as final model parameters;

While the foregoing has described the preferred embodiments of the present invention, it will be apparent that the invention is not limited to the embodiments described, but can be practiced with modification without departing from the basic spirit of the invention and without departing from the spirit of the invention.

Claims

1. A method for predicting the distance between protein residues based on the self-attention mechanism, comprising the steps of:

1) constructing a data set: selecting M proteins with the sequence length L of 30-300 as a data set by taking the set sequence similarity as a threshold value in an SCOPE protein database; dividing the data set into a training set and a verification set;

2) making label data: for each protein in the data set, the C of each pair of residues in the protein was calculated_βEuclidean distances between atoms; the distance between the residue pairs

Is divided into 36 intervals in equal parts, and each interval is

5) reading the length L of a query sequence;

6) the input features are extracted by the following process:

6.1) MSA-based extraction features: sequence spectrum with characteristic dimension L x 42; position entropy, and feature dimension is L x 2; a secondary structure with a characteristic dimension of L x 6; solvent accessibility, characteristic dimension L x 2; contact potential energy, characteristic dimension L x 1; MI, characteristic dimension L x 1; MIP feature dimension L x L1, concatenating the above 7 features together resulting in total feature dimension L x L55;

6.3) according to the template homologism probability sequence output by the HHsearch, selecting the top 10 templates, and extracting the following characteristics based on the templates: scalar features composed of homology probability, sequence similarity and sequence consistency, and the feature dimension is L x 3 after horizontal and vertical tiling; one-dimensional features consisting of position similarity, secondary structure scores and alignment confidence scores are tiled left and right and added, and the feature dimension is L x 6; the distance between the aligned template structure residue pairs, i.e., the two-dimensional feature, has a feature dimension of L x L1. Connecting the features of the three dimensions together to obtain a total feature dimension of 10 x L x 10;

7) performing characteristic pretreatment, wherein the process is as follows:

7.1) reducing the MSA characteristic obtained in 6.2) from 768 to 32 through a linear layer with three layers of neuron sizes of 256, 64 and 32 respectively to reduce the calculation memory; then calculating to obtain the weight of each homologous sequence through a formula (1), aggregating all the homologous sequences in the weighted MSA, and simultaneously calculating the outer product of the MSA features after dimension reduction and weighting through a formula (2); connecting the result obtained by the outer product, the weighted aggregation result and the query sequence in the MSA after dimensionality reduction together to obtain a new pair of characteristics, wherein the characteristic dimensionality is L × 256;

w_k＝softmax[(q×W^Q)(k×W^K)^T] (1)

representing an outer product operation;

wherein, F_{row_att}A line of attention map feature is shown,

a transpose representing a line attention map feature;

8) building a network model, wherein the process is as follows:

9) training model parameters: connecting the features obtained in 6.1) and the processed features in 7.1), 7.2) and 7.3), fusing the features into a feature with a dimension L519, and inputting the feature into the network model. During training, cross entropy cross EntropyLoss is used as a loss function, Adam optimizer is used as an optimizer, an appropriate learning rate is set, m epochs are set in total, and each epoch executes data of the whole training set. Training was stopped when the loss of 5 consecutive epoch validation sets did not drop. Taking the model parameter which enables the loss of the verification set to be the lowest as a final model parameter;

10) predicted inter-residue distance: and extracting the characteristics of the test set, and inputting the characteristics into a trained model to obtain the probability distribution of the distance between the protein residue pairs falling into each interval.

2. The method for predicting the distance between protein residues based on the attention mechanism as claimed in claim 1, wherein 6890 proteins with the sequence length L of 30 to 300 are selected as the data set in step 1) by taking 30% sequence similarity as a threshold; with 90% of the protein as the training set and 10% as the validation set.