CN114708903A - Method for predicting distance between protein residues based on self-attention mechanism - Google Patents

Method for predicting distance between protein residues based on self-attention mechanism Download PDF

Info

Publication number
CN114708903A
CN114708903A CN202210245505.9A CN202210245505A CN114708903A CN 114708903 A CN114708903 A CN 114708903A CN 202210245505 A CN202210245505 A CN 202210245505A CN 114708903 A CN114708903 A CN 114708903A
Authority
CN
China
Prior art keywords
sequence
msa
layer
attention
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210245505.9A
Other languages
Chinese (zh)
Inventor
张贵军
张福金
赵凯龙
李章维
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202210245505.9A priority Critical patent/CN114708903A/en
Publication of CN114708903A publication Critical patent/CN114708903A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A protein residue distance prediction method based on a self-attention mechanism comprises the steps of firstly utilizing HHblits tools to search Unicluster 30 and BFD sequence databases with the maximum sequence similarity of 90% and the coverage rate of 75% to obtain a multi-sequence alignment file consisting of homologous sequences of query sequences; utilizing a HHsearch tool to search a PDB70 database by using MSA to obtain a template of a query sequence; then extracting a series of characteristics from the MSA and the template and inputting the characteristics into a network model for training; the network model comprises a triangular multiplication updating module, an axis attention mechanism module and a convolution residual module, the training process iterates for 50 generations to obtain a training model, and the extracted features of the test protein are input into the training model to obtain the probability distribution of the distance between the protein residue pairs falling into each interval. The method effectively improves the prediction accuracy of the distance between protein residues.

Description

Method for predicting distance between protein residues based on self-attention mechanism
Technical Field
The invention relates to the fields of bioinformatics and computer application, in particular to a method for predicting the distance between protein residues based on an attention-free mechanism.
Background
Protein is a biological macromolecule and is the material basis for all vital activities. The function of a protein depends on its three-dimensional structure, so accurate prediction of the protein structure plays a very important role in annotation of protein functions, drug design, and the like. The traditional experimental methods such as X-ray crystal diffraction, nuclear magnetic resonance method, freezing electron microscope technology and the like are time-consuming, labor-consuming and expensive to measure the tertiary structure of the protein, and for some special proteins, the experimental methods are difficult to measure the structure of the protein. With the rapid development of sequencing technology over decades, the use of computer design algorithms to predict the tertiary structure of proteins de novo, starting from amino acid sequences, has made great progress. However, the accuracy of protein structure prediction is affected by the defects of inaccuracy of a physical energy force field function used by a de novo prediction technology, insufficient conformational space sampling capability and the like.
With the development of deep learning in recent years, the deep learning technology is utilized to predict the distance distribution among protein residues, the distance distribution is used as prior knowledge and a scoring model to guide conformation search, errors caused by inaccuracy of an original force field energy function can be reduced, and the search space of the conformation is effectively reduced.
The deep learning can perform characterization learning on data, the residual convolutional neural network is a special deep neural network and is mainly applied to image processing, however, the convolutional neural network does not process all data of the previous layer, but only processes small-range data of a convolutional kernel position, the receptive field is small, and only local information can be used for calculating target pixels. In recent years, attention mechanism has been developed, which can realize the association between a target pixel and any one pixel, and calculate the target pixel by using the weighted sum of all pixels, that is, the receptive field is a global pixel. However, for multidimensional feature maps, the attention mechanism is computationally intensive and very computationally inefficient. The axis attention mechanism well solves the problem, can fuse global information, keeps long-range dependency and reduces the calculated amount. The axis attention is to apply an attention mechanism in the row direction or the column direction of the feature map, the receptive fields of the feature map are pixels of the same row or the same column of the target pixel respectively, and the combination of the row direction attention and the column direction attention can well fuse global information.
Residue coevolution is a main principle for predicting the distance between residues, and the currently widely used method for inferring coevolution is an indirect strategy, namely manual features such as covariance features are extracted from Multiple Sequence Alignment (MSA), however, the indirect strategy often causes the loss of a large amount of information, so that the prediction of the distance between residues is inaccurate. Therefore, using MSA transform and the outer product aggregation algorithm, co-evolution information can be learned directly from MSA to estimate inter-residue distances.
Disclosure of Invention
The technical problem is solved. The invention provides a protein residue distance prediction method based on a self-attention mechanism, which comprises the steps of firstly searching a multi-sequence comparison file and a template of a query sequence, then extracting a series of characteristics from the multi-sequence comparison file and the template, inputting the characteristics into a network model for training, wherein the network model mainly comprises a triangular multiplication updating module, an axis attention mechanism module and a convolution residual module, and obtaining a training model after the training process iterates for 50 generations. And inputting the extracted features of the test protein into a training model to obtain the probability distribution of the distance between the protein residue pairs falling into each interval. The method effectively improves the prediction precision of the distance between protein residues.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for predicting inter-residue distance of a protein based on an attention-driven mechanism, the method comprising the steps of:
1) constructing a data set: selecting M proteins with the sequence length L of 30-300 as a data set by taking the set sequence similarity as a threshold value in an SCOPE protein database; the data set is divided into a training set and a verification set;
2) making label data: for each protein in the data set, the C of each pair of residues in the protein was calculatedβEuclidean distance between atoms (C in GLY)αAtomic substitution); the distance between the residue pairs
Figure BDA0003545123860000021
The equal parts are divided into 36 intervals, and each interval is
Figure BDA0003545123860000022
There is an additional interval representing the discontiguous residue pair, for a total of 37 intervals;
3) obtaining a multi-sequence alignment file: searching Unicluster 30 and BFD sequence databases by using HHblits tool with the maximum 90% sequence similarity and 75% coverage rate to obtain a multi-sequence alignment file consisting of homologous sequences of the query sequence, namely MSA;
4) acquiring a template file: searching the PDB70 database by using the MSA obtained in the step 3) by using a HHsearch tool to obtain a template of a query sequence;
5) reading the length L of the query sequence;
6) the input features are extracted by the following process:
6.1) MSA-based extraction features: sequence spectrum with characteristic dimension L x 42; position entropy, with characteristic dimensions L x 2; a secondary structure with a characteristic dimension of L x 6; solvent accessibility, characteristic dimension L x 2; contact potential energy with a characteristic dimension of L x L1; MI, characteristic dimension L x 1; MIP with characteristic dimension L x L1. Connecting the 7 features together to obtain a total feature dimension of L x 55;
6.2) randomly selecting N homologous sequences in the MSA, and inputting the N homologous sequences into a pre-trained MSATransformer network model to obtain MSA characteristics with characteristic dimensions of N × L × 768 and line attention map characteristics with characteristic dimensions of L × 144;
6.3) according to the template homologism probability sequence output by the HHsearch, selecting the top 10 templates, and extracting the following characteristics based on the templates: scalar features composed of homology probability, sequence similarity and sequence consistency, and the feature dimension is L x 3 after horizontal and vertical tiling; one-dimensional features consisting of position similarity, secondary structure scores and alignment confidence scores are tiled left and right and added, and the feature dimension is L x 6; the distance between the aligned template structure residue pairs, i.e. two-dimensional features, with a feature dimension of L x 1, the features of the three dimensions are connected together to give a total feature dimension of 10 x L10;
7) performing characteristic pretreatment, wherein the process is as follows:
7.1) reducing the MSA characteristic obtained in 6.2) from 768 to 32 through a linear layer with three layers of neuron sizes of 256, 64 and 32 respectively to reduce the calculation memory; then, the weight w of each homologous sequence is calculated by formula (1)kAll homologous sequences in the weighted MSA are aggregated together, and simultaneously, the outer product g of the MSA feature after only dimension reduction and the weighted MSA feature is calculated by formula (2)ij(ii) a Connecting the result obtained by the outer product, the weighted aggregation result and the query sequence in the MSA after dimensionality reduction together to obtain a new pair of characteristics, wherein the characteristic dimensionality is L × 256;
wk=softmax[(q×WQ)(k×WK)T] (1)
wherein q represents the query sequence, k represents other homologous sequences in the MSA, WQ、WKRepresenting a weight matrix, and T represents transposition;
Figure BDA0003545123860000031
wherein, Xk(i) Denotes the i residue of the k homologous sequence in the MSA, Xk(j) Denotes the jth residue of the kth homologous sequence in the MSA,
Figure BDA0003545123860000032
representing an outer product operation;
7.2) the line attention map characteristics obtained in 6.2) are symmetrically processed by the formula (3);
Figure BDA0003545123860000033
wherein, Frow_attA line of attention map feature is shown,
Figure BDA0003545123860000034
a transpose representing a line attention map feature;
7.3) processing the template features obtained in 6.3) by a layer of axis attention mechanism, then calculating the weight of each template, wherein the feature dimension obtained after weighted aggregation is L x 64;
8) building a network model, wherein the process is as follows:
8.1) the network model is mainly composed of three parts. The first part is a triangular multiplication updating network, the information of a residue pair ij is updated by using the information of the residue pair ik and the residue pair jk, and the information consists of a Linear layer and a sigmoid layer respectively;
8.2) the second part is an axis attention mechanism network which consists of 4 attention blocks, each attention block consists of a row attention layer, a column attention layer and a feedforward layer, and each attention layer is provided with 8 attention heads;
8.3) the third part is a convolution residual error network which is composed of an input layer, 13 convolution residual error blocks and an output layer. The input layer consists of a two-dimensional convolution layer containing 64 1 x 1 convolution kernels, a normalization layer and an ELU layer; the residual block consists of two-dimensional convolution layers with convolution kernels of 3 x 3, two normalization layers, two ELU layers and one Dropout layer; the output layer consists of a two-dimensional convolution layer containing 37 1 x 1 convolution kernels and a Softmax layer;
9) training model parameters: connecting the features obtained in 6.1) and the processed features in 7.1), 7.2) and 7.3), fusing the features into a feature with a dimension L519, and inputting the feature into the network model. During the training process, cross entropy cross EntropyLoss is used as a loss function, an Adam optimizer is used as an optimizer, an appropriate learning rate is set, m epochs are set in total, and each epoch executes the data of the whole training set. Training was stopped when the loss of 5 consecutive epoch validation sets did not drop. Taking the model parameter which enables the loss of the verification set to be the lowest as a final model parameter;
10) predicted inter-residue distance: and extracting the characteristics of the test set, inputting the characteristics into a trained model, and obtaining the probability distribution of the distance between the protein residue pairs falling into each interval.
Further, in the step 1), 6890 proteins with a sequence length L of 30 to 300 are selected as a data set by taking 30% sequence similarity as a threshold; with 90% of the protein as the training set and 10% as the validation set.
The invention has the following beneficial effects: MSA characteristics and row attention maps processed by an MSA Transformer pre-training model are taken as characteristics, and co-evolution information and the relation among residues in a sequence are captured better; in addition, template features are added, and the inter-residue distance information of the homologous template enables the input features to be richer and the prediction effect to be better. Triangle multiplication updating of the network part ensures reasonability of the distance between residues, and a long-range interaction can be captured by an attention mechanism. The improvement of the characteristics and the network greatly improves the precision of the prediction of the distance between residues.
Drawings
FIG. 1 is an overall flow chart of a method for predicting the distance between protein residues based on the self-attention mechanism.
FIG. 2 is a diagram of the MSA characterization process obtained in the pre-treatment 6.2).
FIG. 3 is a diagram of a pre-processing template feature process.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1, 2 and 3, a method for predicting the distance between protein residues based on the self-attention mechanism includes the following steps:
1) constructing a data set: in an SCOPE protein database, with 30% sequence similarity as a threshold value, 6890 proteins with sequence length L of 30-300 are selected as a data set; wherein 90% of the protein is used as training set and 10% of the protein is used as validation set;
2) system for makingLabeling data: for each protein in the data set, the C for each pair of residues in the protein was calculatedβEuclidean distance between atoms (C in GLY)αAtomic substitution); the distance between the residue pairs
Figure BDA0003545123860000051
The equal parts are divided into 36 intervals, and each interval is
Figure BDA0003545123860000052
There is an additional interval representing the discontiguous residue pair, for a total of 37 intervals;
3) obtaining a multi-sequence alignment file: searching Unicluster 30 and BFD sequence databases by using HHblits tool with the maximum 90% sequence similarity and 75% coverage rate to obtain a multi-sequence alignment file consisting of homologous sequences of the query sequence, namely MSA;
4) acquiring a template file: searching the PDB70 database by using the MSA obtained in the step 3) by using an HHsearch tool to obtain a template of a query sequence;
5) reading the length L of the query sequence;
6) extracting input features, the process is as follows:
6.1) MSA-based extraction features: a sequence spectrum with characteristic dimensions L x 42; position entropy, with characteristic dimensions L x 2; a secondary structure with a characteristic dimension of L x 6; solvent accessibility, characteristic dimension L x 2; contact potential energy with a characteristic dimension of L x L1; MI, characteristic dimension L x 1; MIP with characteristic dimension L x L1. Connecting the 7 features together to obtain a total feature dimension of L x 55;
6.2) randomly selecting N homologous sequences in the MSA, and inputting the N homologous sequences into a pre-trained MSATransformer network model to obtain MSA characteristics with characteristic dimensions of N × L × 768 and line attention map characteristics with characteristic dimensions of L × 144;
6.3) according to the template homologism probability sequence output by the HHsearch, selecting the top 10 templates, and extracting the following characteristics based on the templates: scalar features composed of homology probability, sequence similarity and sequence consistency, and the feature dimension is L x 3 after horizontal and vertical tiling; one-dimensional features consisting of position similarity, secondary structure scores and alignment confidence scores are tiled left and right and added, and the feature dimension is L x 6; the distance between the aligned template structure residue pairs, i.e. two-dimensional features, with a feature dimension of L x 1, the features of the three dimensions are connected together to give a total feature dimension of 10 x L10;
7) performing characteristic pretreatment, wherein the process is as follows:
7.1) reducing the MSA characteristic obtained in 6.2) from 768 to 32 through a linear layer with three layers of neuron sizes of 256, 64 and 32 respectively to reduce the calculation memory; then calculating the weight of each homologous sequence through a formula (1), aggregating all homologous sequences in the weighted MSA, and simultaneously calculating the outer product of the MSA features after dimension reduction and weighting through a formula (2); connecting the result obtained by the outer product, the weighted aggregation result and the query sequence in the MSA after dimensionality reduction together to obtain a new pair of characteristics, wherein the characteristic dimensionality is L × 256;
wk=softmax[(q×WQ)(k×WK)T] (1)
wherein q represents the query sequence, k represents other homologous sequences in the MSA, WQ、WKRepresenting a weight matrix, and T represents transposition;
Figure BDA0003545123860000061
wherein, Xk(i) Denotes the i residue of the k homologous sequence in the MSA, Xk(j) Denotes the jth residue of the kth homologous sequence in the MSA,
Figure BDA0003545123860000062
representing an outer product operation;
7.2) the line attention map characteristics obtained in 6.2) are symmetrically processed by the formula (3);
Figure BDA0003545123860000063
wherein, Frow_attA line of attention map feature is shown,
Figure BDA0003545123860000064
a transpose representing a line attention map feature;
7.3) processing the template features obtained in 6.3) by a layer of axis attention mechanism, then calculating the weight of each template, wherein the feature dimension obtained after weighted aggregation is L x 64;
8) building a network model, wherein the process is as follows:
8.1) the network model consists of three parts, wherein the first part is a triangular multiplication updating network, updates the information of a residue pair ij by using the information of the residue pair ik and the residue pair jk, and consists of a Linear layer and a sigmoid layer respectively;
8.2) the second part is an axis attention mechanism network which consists of 4 attention blocks, each attention block consists of a row attention layer, a column attention layer and a feedforward layer, and each attention layer is provided with 8 attention heads;
8.3) the third part is a convolution residual error network which is composed of an input layer, 13 convolution residual error blocks and an output layer. The input layer consists of a two-dimensional convolution layer containing 64 1 x 1 convolution kernels, a normalization layer and an ELU layer; the residual block consists of two-dimensional convolution layers with convolution kernels of 3 x 3, two normalization layers, two ELU layers and one Dropout layer; the output layer consists of a two-dimensional convolution layer containing 37 1 x 1 convolution kernels and a Softmax layer;
9) training model parameters: connecting the features obtained in 6.1) and the processed features in 7.1), 7.2) and 7.3), fusing the features into a feature with a dimension L519, and inputting the feature into the network model. In the training process, cross entropy cross EntropyLoss is used as a loss function, an Adam optimizer is used as an optimizer, an appropriate learning rate is set, m epochs are set in total, each epoch executes data of the whole training set, and when the loss of 5 continuous epoch verification sets does not decrease, the training is stopped. Taking the model parameter which enables the loss of the verification set to be the lowest as a final model parameter;
10) predicted inter-residue distance: and extracting the characteristics of the test set, inputting the characteristics into a trained model, and obtaining the probability distribution of the distance between the protein residue pairs falling into each interval.
The example takes protein 2APN _ A with the sequence length of 114 as an example, and the method for predicting the distance between protein residues based on attention and a residual error network comprises the following steps:
1) constructing a data set: in an SCOPE protein database, with 30% sequence similarity as a threshold value, 6890 proteins with sequence length L of 30-300 are selected as a data set; wherein 90% of the protein is used as training set and 10% of the protein is used as validation set;
2) making label data: for each protein in the data set, the C of each pair of residues in the protein was calculatedβEuclidean distance between atoms (C for GLY)αAtomic substitution); the distance between the residue pairs
Figure BDA0003545123860000071
Is divided into 36 intervals in equal parts, and each interval is
Figure BDA0003545123860000072
There is an additional interval representing the discontiguous residue pair, for a total of 37 intervals;
3) obtaining a multi-sequence alignment file: searching Unicluster 30 and BFD sequence databases by using HHblits tool with the maximum 90% sequence similarity and 75% coverage rate to obtain a multi-sequence alignment file consisting of homologous sequences of the query sequence, namely MSA;
4) acquiring a template file: searching the PDB70 database by using the MSA obtained in the step 3) by using an HHsearch tool to obtain a template of a query sequence;
5) read query sequence length 114;
6) the input features are extracted by the following process:
6.1) MSA-based extraction features: a sequence spectrum with characteristic dimensions 114 x 42; position entropy, with characteristic dimensions of 114 × 2; secondary structure with characteristic dimension 114 x 6; solvent accessibility, characteristic dimension 114 x 2; contact potential energy with characteristic dimension 114 x 1; MI, characteristic dimension 114 × 1; MIP with characteristic dimension 114 × 1. Connecting the 7 features together to obtain a total feature dimension of 114 x 55;
6.2) randomly selecting 64 homologous sequences in the MSA, inputting the homologous sequences into a pre-trained MSA Transformer network model to obtain MSA characteristics with a characteristic dimension of 64 × 114 × 768 and line attention diagram characteristics with a characteristic dimension of 114 × 144;
6.3) according to the template homologism probability sequence output by the HHsearch, selecting the top 10 templates, and extracting the following characteristics based on the templates: scalar features consisting of homology probability, sequence similarity and sequence consistency, and the feature dimension is 114 × 3 after horizontal and vertical tiling; one-dimensional features consisting of position similarity, secondary structure scores and alignment confidence scores are tiled left and right, and the feature dimension is 114 × 6; the distance between the aligned template structure residue pairs, i.e. two-dimensional features with a feature dimension of 114 x 1, the features of the three dimensions are connected together to obtain a total feature dimension of 10 x 114 x 10;
7) performing characteristic pretreatment, wherein the process is as follows:
7.1) reducing the MSA characteristic obtained in 6.2) from 768 to 32 through a linear layer with three layers of neuron sizes of 256, 64 and 32 respectively to reduce the calculation memory; then calculating the weight of each homologous sequence through a formula (1), aggregating all homologous sequences in the weighted MSA, and simultaneously calculating the outer product of the MSA features after dimension reduction and weighting through a formula (2); connecting the result obtained by the outer product, the weighted aggregation result and the query sequence in the MSA after dimensionality reduction together to obtain a new pair feature, wherein the feature dimensionality is 114 × 256;
wk=softmax[(q×WQ)(k×WK)T] (1)
wherein q represents the query sequence, k represents other homologous sequences in the MSA, WQ、WKRepresenting a weight matrix, and T represents transposition;
Figure BDA0003545123860000091
wherein, Xk(i) Denotes the i residue of the k homologous sequence in the MSA, Xk(j) Denotes the jth residue of the kth homologous sequence in the MSA,
Figure BDA0003545123860000092
representing an outer product operation;
7.2) the line attention map characteristics obtained in 6.2) are symmetrically processed by the formula (3);
Figure BDA0003545123860000093
wherein, Frow_attA line of attention map feature is shown,
Figure BDA0003545123860000094
a transpose representing a line attention map feature;
7.3) processing the template features obtained in 6.3) by a layer of axis attention mechanism, and then calculating the weight of each template, wherein the feature dimension obtained after weighted aggregation is 114 × 64;
8) building a network model, wherein the process is as follows:
8.1) the network model consists of three parts, wherein the first part is a triangular multiplication updating network, and updates the information of a residue pair ij by using the information of the residue pair ik and the residue pair jk, and the three parts respectively consist of a Linear layer and a sigmoid layer;
8.2) the second part is an axis attention mechanism network which consists of 4 attention blocks, each attention block consists of a row attention layer, a column attention layer and a feedforward layer, and each attention layer has 8 attention heads;
8.3) the third part is a convolution residual error network which consists of an input layer, 13 convolution residual error blocks and an output layer. The input layer consists of a two-dimensional convolution layer containing 64 1 x 1 convolution kernels, a normalization layer and an ELU layer; the residual block consists of two-dimensional convolution layers with convolution kernels of 3 x 3, two normalization layers, two ELU layers and one Dropout layer; the output layer consists of a two-dimensional convolution layer containing 37 1 x 1 convolution kernels and a Softmax layer;
9) training model parameters: connecting the features obtained in 6.1) and the processed features in 7.1), 7.2) and 7.3), fusing the features into a feature with a dimension of 114 x 519, and inputting the feature into the network model. In the training process, cross entropy cross EntropyLoss is used as a loss function, an Adam optimizer is used as an optimizer, the learning rate is set to be 0.0001, 100 epochs are set in total, each epoch executes the data of the whole training set, and when the loss of 5 continuous epoch verification sets does not decrease, the training is stopped. Taking the model parameters which enable the loss of the verification set to be lowest as final model parameters;
10) predicted inter-residue distance: and extracting the characteristics of the test set, inputting the characteristics into a trained model, and obtaining the probability distribution of the distance between the protein residue pairs falling into each interval.
While the foregoing has described the preferred embodiments of the present invention, it will be apparent that the invention is not limited to the embodiments described, but can be practiced with modification without departing from the basic spirit of the invention and without departing from the spirit of the invention.

Claims (2)

1. A method for predicting the distance between protein residues based on the self-attention mechanism, comprising the steps of:
1) constructing a data set: selecting M proteins with the sequence length L of 30-300 as a data set by taking the set sequence similarity as a threshold value in an SCOPE protein database; dividing the data set into a training set and a verification set;
2) making label data: for each protein in the data set, the C of each pair of residues in the protein was calculatedβEuclidean distances between atoms; the distance between the residue pairs
Figure FDA0003545123850000011
Is divided into 36 intervals in equal parts, and each interval is
Figure FDA0003545123850000012
There is an additional interval representing the discontiguous residue pair, for a total of 37 intervals;
3) obtaining a multi-sequence alignment file: searching Unicluster 30 and BFD sequence databases by using HHblits tool with the maximum 90% sequence similarity and 75% coverage rate to obtain a multi-sequence alignment file consisting of homologous sequences of the query sequence, namely MSA;
4) acquiring a template file: searching the PDB70 database by using the MSA obtained in the step 3) by using a HHsearch tool to obtain a template of a query sequence;
5) reading the length L of a query sequence;
6) the input features are extracted by the following process:
6.1) MSA-based extraction features: sequence spectrum with characteristic dimension L x 42; position entropy, and feature dimension is L x 2; a secondary structure with a characteristic dimension of L x 6; solvent accessibility, characteristic dimension L x 2; contact potential energy, characteristic dimension L x 1; MI, characteristic dimension L x 1; MIP feature dimension L x L1, concatenating the above 7 features together resulting in total feature dimension L x L55;
6.2) randomly selecting N homologous sequences in the MSA, and inputting the N homologous sequences into a pre-trained MSATransformer network model to obtain MSA characteristics with characteristic dimensions of N × L × 768 and line attention map characteristics with characteristic dimensions of L × 144;
6.3) according to the template homologism probability sequence output by the HHsearch, selecting the top 10 templates, and extracting the following characteristics based on the templates: scalar features composed of homology probability, sequence similarity and sequence consistency, and the feature dimension is L x 3 after horizontal and vertical tiling; one-dimensional features consisting of position similarity, secondary structure scores and alignment confidence scores are tiled left and right and added, and the feature dimension is L x 6; the distance between the aligned template structure residue pairs, i.e., the two-dimensional feature, has a feature dimension of L x L1. Connecting the features of the three dimensions together to obtain a total feature dimension of 10 x L x 10;
7) performing characteristic pretreatment, wherein the process is as follows:
7.1) reducing the MSA characteristic obtained in 6.2) from 768 to 32 through a linear layer with three layers of neuron sizes of 256, 64 and 32 respectively to reduce the calculation memory; then calculating to obtain the weight of each homologous sequence through a formula (1), aggregating all the homologous sequences in the weighted MSA, and simultaneously calculating the outer product of the MSA features after dimension reduction and weighting through a formula (2); connecting the result obtained by the outer product, the weighted aggregation result and the query sequence in the MSA after dimensionality reduction together to obtain a new pair of characteristics, wherein the characteristic dimensionality is L × 256;
wk=softmax[(q×WQ)(k×WK)T] (1)
wherein q represents the query sequence, k represents other homologous sequences in the MSA, WQ、WKRepresenting a weight matrix, and T represents transposition;
Figure FDA0003545123850000021
wherein, Xk(i) Denotes the i residue of the k homologous sequence in the MSA, Xk(j) Denotes the jth residue of the kth homologous sequence in the MSA,
Figure FDA0003545123850000022
representing an outer product operation;
7.2) the line attention map characteristics obtained in 6.2) are symmetrically processed by the formula (3);
Figure FDA0003545123850000023
wherein, Frow_attA line of attention map feature is shown,
Figure FDA0003545123850000024
a transpose representing a line attention map feature;
7.3) processing the template features obtained in 6.3) by a layer of axis attention mechanism, then calculating the weight of each template, wherein the feature dimension obtained after weighted aggregation is L x 64;
8) building a network model, wherein the process is as follows:
8.1) the network model consists of three parts, wherein the first part is a triangular multiplication updating network, and updates the information of a residue pair ij by using the information of the residue pair ik and the residue pair jk, and the three parts respectively consist of a Linear layer and a sigmoid layer;
8.2) the second part is an axis attention mechanism network which consists of 4 attention blocks, each attention block consists of a row attention layer, a column attention layer and a feedforward layer, and each attention layer is provided with 8 attention heads;
8.3) the third part is a convolution residual error network which is composed of an input layer, 13 convolution residual error blocks and an output layer. The input layer consists of a two-dimensional convolution layer containing 64 1 x 1 convolution kernels, a normalization layer and an ELU layer; the residual block consists of two-dimensional convolution layers with convolution kernels of 3 x 3, two normalization layers, two ELU layers and one Dropout layer; the output layer consists of a two-dimensional convolution layer containing 37 1 x 1 convolution kernels and a Softmax layer;
9) training model parameters: connecting the features obtained in 6.1) and the processed features in 7.1), 7.2) and 7.3), fusing the features into a feature with a dimension L519, and inputting the feature into the network model. During training, cross entropy cross EntropyLoss is used as a loss function, Adam optimizer is used as an optimizer, an appropriate learning rate is set, m epochs are set in total, and each epoch executes data of the whole training set. Training was stopped when the loss of 5 consecutive epoch validation sets did not drop. Taking the model parameter which enables the loss of the verification set to be the lowest as a final model parameter;
10) predicted inter-residue distance: and extracting the characteristics of the test set, and inputting the characteristics into a trained model to obtain the probability distribution of the distance between the protein residue pairs falling into each interval.
2. The method for predicting the distance between protein residues based on the attention mechanism as claimed in claim 1, wherein 6890 proteins with the sequence length L of 30 to 300 are selected as the data set in step 1) by taking 30% sequence similarity as a threshold; with 90% of the protein as the training set and 10% as the validation set.
CN202210245505.9A 2022-03-14 2022-03-14 Method for predicting distance between protein residues based on self-attention mechanism Withdrawn CN114708903A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210245505.9A CN114708903A (en) 2022-03-14 2022-03-14 Method for predicting distance between protein residues based on self-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210245505.9A CN114708903A (en) 2022-03-14 2022-03-14 Method for predicting distance between protein residues based on self-attention mechanism

Publications (1)

Publication Number Publication Date
CN114708903A true CN114708903A (en) 2022-07-05

Family

ID=82169455

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210245505.9A Withdrawn CN114708903A (en) 2022-03-14 2022-03-14 Method for predicting distance between protein residues based on self-attention mechanism

Country Status (1)

Country Link
CN (1) CN114708903A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115458039A (en) * 2022-08-08 2022-12-09 北京分子之心科技有限公司 Single-sequence protein structure prediction method and system based on machine learning
CN115527605A (en) * 2022-11-04 2022-12-27 南京理工大学 Antibody structure prediction method based on depth map model
CN116206675A (en) * 2022-09-05 2023-06-02 北京分子之心科技有限公司 Method, apparatus, medium and program product for predicting protein complex structure
WO2024104490A1 (en) * 2022-11-18 2024-05-23 中国科学院深圳先进技术研究院 Protein residue contact prediction method and related device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115458039A (en) * 2022-08-08 2022-12-09 北京分子之心科技有限公司 Single-sequence protein structure prediction method and system based on machine learning
CN115458039B (en) * 2022-08-08 2023-10-10 北京分子之心科技有限公司 Method and system for predicting single-sequence protein structure based on machine learning
CN116206675A (en) * 2022-09-05 2023-06-02 北京分子之心科技有限公司 Method, apparatus, medium and program product for predicting protein complex structure
CN116206675B (en) * 2022-09-05 2023-09-15 北京分子之心科技有限公司 Method, apparatus, medium and program product for predicting protein complex structure
CN115527605A (en) * 2022-11-04 2022-12-27 南京理工大学 Antibody structure prediction method based on depth map model
CN115527605B (en) * 2022-11-04 2023-12-12 南京理工大学 Antibody structure prediction method based on depth map model
WO2024104490A1 (en) * 2022-11-18 2024-05-23 中国科学院深圳先进技术研究院 Protein residue contact prediction method and related device

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN114708903A (en) Method for predicting distance between protein residues based on self-attention mechanism
CN111667884A (en) Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism
CN106021990B (en) A method of biological gene is subjected to classification and Urine scent with specific character
CN112949740B (en) Small sample image classification method based on multilevel measurement
CN115098620A (en) Cross-modal Hash retrieval method for attention similarity migration
CN113360701A (en) Sketch processing method and system based on knowledge distillation
CN115048539B (en) Social media data online retrieval method and system based on dynamic memory
CN115422369B (en) Knowledge graph completion method and device based on improved TextRank
CN111582506A (en) Multi-label learning method based on global and local label relation
CN112784782A (en) Three-dimensional object identification method based on multi-view double-attention network
CN110348287A (en) A kind of unsupervised feature selection approach and device based on dictionary and sample similar diagram
CN116580848A (en) Multi-head attention mechanism-based method for analyzing multiple groups of chemical data of cancers
CN114022687B (en) Image description countermeasure generation method based on reinforcement learning
CN111079011A (en) Deep learning-based information recommendation method
López-Cifuentes et al. Attention-based knowledge distillation in scene recognition: the impact of a dct-driven loss
CN112085245B (en) Protein residue contact prediction method based on depth residual neural network
CN117116383A (en) Medicine molecule optimization method and device based on pretraining fine adjustment
CN117497058A (en) Antibody antigen neutralization prediction method and device based on graphic neural network
CN110045691B (en) Multi-task processing fault monitoring method for multi-source heterogeneous big data
CN116188428A (en) Bridging multi-source domain self-adaptive cross-domain histopathological image recognition method
CN116129189A (en) Plant disease identification method, plant disease identification equipment, storage medium and plant disease identification device
CN111914108A (en) Discrete supervision cross-modal Hash retrieval method based on semantic preservation
CN112465054B (en) FCN-based multivariate time series data classification method
CN115017366A (en) Unsupervised video hash retrieval method based on multi-granularity contextualization and multi-structure storage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20220705

WW01 Invention patent application withdrawn after publication