CN113223608A

CN113223608A - Method and system for predicting three-dimensional structure of protein by double-layer mutual reinforcement

Info

Publication number: CN113223608A
Application number: CN202110626528.XA
Authority: CN
Inventors: 苗洪江; 汤一珂; 沈邱难; 薛贵荣
Original assignee: Shanghai Tianran Intelligent Technology Co ltd
Current assignee: Shanghai Tianran Intelligent Technology Co ltd
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2021-08-06

Abstract

The invention provides a method and a system for predicting a three-dimensional structure of a protein by double-layer mutual reinforcement, which comprises the following steps: acquiring a multi-sequence array matrix and a protein template containing protein co-evolution information, respectively converting the multi-sequence array matrix and the protein template into embedded vectors, and inputting the embedded vectors into a multi-head neural network based on an attention mechanism; extracting the interaction relation of each pair of amino acid sequences in the protein by respectively carrying out sequence side and residue side and mutually updating; one head of the multi-head neural network outputs the probability distribution of the distance between each pair of amino acid residues, and the other head outputs the angle probability distribution of the torsion angle of each amino acid residue; and obtaining a three-dimensional structure of the most consistent predicted distance/angle as a prediction result by a gradient descent method according to the output probability distribution of the distance between each pair of amino acid residues and the output angle probability distribution of the torsion angle of each amino acid residue. The invention can lead the network to extract global information more evenly, thereby predicting the integral three-dimensional structure of the protein more accurately.

Description

Method and system for predicting three-dimensional structure of protein by double-layer mutual reinforcement

Technical Field

The invention relates to the field of computer software and bioinformatics, in particular to a protein three-dimensional structure prediction method and system based on double-layer mutual reinforcement of an attention mechanism.

Background

Proteins are the main players of life activities, and many important life processes in organisms are involved in proteins. The protein is formed by connecting peptide chains formed by dehydration condensation of 20 common amino acids. The three-dimensional structure of a protein determines the function of the protein. Predicting the three-dimensional structure of a protein from an amino acid sequence is a fundamental and unsolved problem in bioinformatics.

To date, the acquisition methods for determining the three-dimensional structure of proteins are mainly divided into two main categories: one is determined by wet experiments and one is predicted based on protein sequence. Methods for measurement by wet experiments include X-ray crystallography (X-ray crystallography), Nuclear Magnetic Resonance (NMR), and Cryo-electron microscopy (Cryo-EM), but these methods have some obvious disadvantages, such as being time-consuming, expensive, limited in resolution, etc., and thus the above speed of protein structure measurement cannot meet the requirement of structure/function analysis on the whole Proteome Scale (Proteome-Scale), and much less follows the speed of gene sequencing. Based on the "self-assembly theory" proposed by the nobel chemical prize-chair Christian b.anfsen, the amino acid sequence of the polypeptide chain contains all the information necessary for forming its mechanically stable native conformation, and how to directly and accurately predict the spatial structure of a protein from the amino acid sequence is a key problem for the study of protein structure.

For the study of direct prediction of the three-dimensional structure of proteins from sequences, there are two main categories of methods: one is template-based modeling (template-based) and one is non-template direct de novo prediction (ab initio modeling). De novo prediction is becoming increasingly important as template-based modeling approaches are limited by the number and quality of structural templates available in protein structure databases. Although these methods have achieved some success in protein structure prediction, these methods have limited applicability and still do not allow high throughput accurate structure prediction of most proteins.

Protein structure prediction not only helps to enhance human cognition for protein folding mechanisms. Furthermore, structural prediction is of fundamental significance for new protein design-to design a new protein with a particular function/structure, structural prediction is undoubtedly a tool to shorten the design process. Therefore, more effective methods are urgently needed to fill the shortages of the traditional protein structure prediction means.

In recent two years, the methods of de novo prediction have achieved unprecedented improvements in predicting the tertiary structure of proteins by the latest Deep Learning (Deep Learning) technique. The prediction of protein structure based on deep learning is mainly to reconstruct the structure by predicting two key attributes of distance and angle between amino acids in the protein structure.

(1) Distance: the distance here is the Euclidean geometric distance D (i, j) between any two amino acid residues i and j in the amino acid sequence of the protein, and in general, it is used

(angstroms) as a distance unit.

(2) Angle: the Angle herein refers to the geometric Angle between the backbone atoms after one amino acid residue in the protein backbone (Back Bone) is bonded to another amino acid residue, and is specifically expressed as the Torsion Angle between C-Alpha and N, C atoms (Torsion Angle), and is generally referred to as the (phi ) Angle.

The current protein three-dimensional structure prediction based on deep learning mainly comprises three steps: 1. constructing input of a neural network according to the amino acid sequence of the protein; 2. extracting structural information from the constructed input features through a deep neural network, and predicting the distance/torsion angle between amino acid residues; 3. the reconstruction of the three-dimensional structure of the protein is carried out by a gradient descent method.

Three defects exist in the process, so that the prediction result is very universal and difficult to achieve the accuracy of actual use. The first drawback is: the Resnet network structure is limited by the number of channels, and the local interaction relationship in the amino acid sequence is emphasized too much to extract the global structure information effectively enough. The second drawback is: the input of the neural network is an artificially designed feature, and the capability of the neural network for extracting information and distinguishing useless/wrong information is greatly limited. The third drawback is: the input of current neural networks relies entirely on co-evolutionary information derived from amino acid sequences, ignoring potential templates (templates) with rich information. These three defects cause the accuracy of prediction of the three-dimensional structures of the two proteins to be greatly affected.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method and a system for predicting a three-dimensional protein structure by double-layer mutual reinforcement.

The invention provides a method for predicting a three-dimensional structure of a protein by double-layer mutual reinforcement, which comprises the following steps:

a data input step: acquiring a multi-sequence array matrix and a protein template containing protein co-evolution information, respectively converting the multi-sequence array matrix and the protein template into embedded vectors, and inputting the embedded vectors into a multi-head neural network based on an attention mechanism;

the extraction step comprises: the attention-based multi-head neural network extracts the interaction relation of each pair of amino acid sequences in the protein in a mode of respectively performing sequence side and residue side and updating each other;

an output step: after the sequence edge and the residue edge are updated mutually, one head of the multi-head neural network outputs the probability distribution of the distance between each pair of amino acid residues, and the other head outputs the angle probability distribution of the torsion angle of each amino acid residue;

a prediction step: and obtaining a three-dimensional structure of the most consistent predicted distance/angle as a prediction result by a gradient descent method according to the output probability distribution of the distance between each pair of amino acid residues and the output angle probability distribution of the torsion angle of each amino acid residue.

Preferably, the multi-sequence permutation matrix containing protein coevolution information comprises: in consideration of evolutionary events such as mutation, insertion, deletion, recombination, etc., a plurality of amino acid sequences are aligned and compared column by column to construct a data set similar to the predicted protein sequence in the database.

Preferably, the protein template comprises: known structural proteins in the protein structure database that are similar to the target protein.

Preferably, the attention-based multi-headed neural network includes:

on the sequence side, distinguishing sequences embedded in the input protein multi-sequence permutation matrix embedded vector matrix M from sequences containing irrelevant information, and paying attention to the updated M to update the embedded vector matrix E on the residue side:

E’＝f(E+W)

wherein E' is the updated residue edge embedding vector matrix, f is the activation function, and W is the weight matrix of the base M:

wherein D is the depth of the protein multi-sequence arrangement matrix.

Preferably, the attention-based multi-headed neural network includes:

on residue edges, the interaction relationships between each pair of amino acid residues are extracted, attention is paid to the updated E, and the embedded vector matrix M of the sequence edges is updated by the following steps:

M’＝α(M)

wherein M' is the updated sequence edge embedding vector matrix, and α is the E-based weight matrix:

α_ij＝F(E_ij)

where F is the activation function.

The invention provides a protein three-dimensional structure prediction system with double layers mutually reinforced, which comprises:

Preferably, the attention-based multi-headed neural network includes:

E’＝f(E+W)

wherein D is the depth of the protein multi-sequence arrangement matrix.

Preferably, the attention-based multi-headed neural network includes:

M’＝α(M)

α_ij＝F(E_ij)

where F is the activation function.

Compared with the prior art, the invention has the following beneficial effects:

the invention can lead the network to extract global information more evenly, thereby predicting the integral three-dimensional structure of the protein more accurately. The invention directly uses a Multiple Sequence Alignment matrix (Multiple Sequence Alignment) containing protein co-evolution information as input, and extracts hidden layer information required by prediction distance and angle more comprehensively by a network. Meanwhile, similar proteins are searched in a known Protein structure library (Protein Data Bank) according to the amino acid sequence of the target Protein to serve as a Template (Template), the Template serves as an input feature of a neural network, and the accuracy of distance and angle prediction is improved in a mode that the network learns various input features in training.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a schematic diagram of the present invention;

FIG. 2 is a schematic diagram of a neural network according to the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

As shown in fig. 1, the method for predicting a three-dimensional protein structure with double layers mutually enhanced provided by the present invention comprises:

a data input step: and acquiring a multi-sequence array matrix and a protein template containing protein coevolution information, respectively converting the multi-sequence array matrix and the protein template into embedded vectors, and inputting the embedded vectors into a multi-head neural network based on an attention mechanism.

The extraction step comprises: the attention-based multi-headed neural network extracts the interaction relationship of each pair of amino acid sequences in the protein by performing and updating on the sequence side and the residue side respectively.

An output step: after the sequence edges and the residue edges are updated mutually, one head of the multi-head neural network outputs the probability distribution of the distance between each pair of amino acid residues, and the other head outputs the angle probability distribution of the torsion angle of each amino acid residue.

The neural network input of the invention is a multi-sequence array matrix containing protein co-evolution information and a protein template similar to the multi-sequence array matrix, and the information is respectively converted into embedded vectors (Embedding) to be used as the input of the network.

The protein multi-sequence array matrix is characterized in that a plurality of amino acid sequences are aligned and compared column by column under the premise of considering evolution events such as mutation, insertion, deletion, recombination and the like, so that a data set similar to a predicted protein sequence in a database is constructed. If a multiple sequence alignment is considered as a two-dimensional table, where each row represents an amino acid sequence and each column represents the position of a residue, then the sequences to be aligned are filled into the table according to the following rules: (a) the relative positions of all residues in a sequence remain unchanged; (b) the same or similar residues in different sequences are placed in the same column so that the same or similar residues are aligned as far above and below the sequences as possible. In our procedure, HHblits and JackHMMER will be used to perform stepwise searches from each metagenomic protein sequence database and the results will be efficiently merged to obtain the deepest multiple sequence alignment matrix. Unlike the method mentioned in the background art, which constructs human-created features as input based on a multi-sequence permutation matrix, a multi-sequence permutation matrix of a protein is directly encoded and converted into an embedded vector as input to a neural network.

The template refers to known structural protein similar to target protein in a protein structure database (PDB), and the searched protein template information, such as amino acid sequence, distance matrix and the like, is converted into an embedded vector to be used as input of a neural network.

Referring to fig. 2, the neural network of the present invention is an innovative structure based on a Multi-head Attention mechanism (Multi-head Attention) and cyclically updated at Sequence Edge and Residue Edge.

The invention adopts a multi-head Attention mechanism (Attention) network to extract the interaction relation of each pair of amino acid sequences in the protein, and the information extraction is completed by a novel mode of respectively carrying out sequence side and residue side and updating each other. On the sequence side, the network of the attention mechanism can better distinguish sequences containing abundant information from sequences containing non-related information, wherein the sequences contain abundant information, and the sequences contain non-related information, and are embedded into a vector matrix (M) by an input protein multi-sequence arrangement matrix. Attention operates on the updated M, which will be used to update the residue-edge embedding vector matrix (E):

E’＝f(E+W)

where E' is the updated residue edge embedding vector matrix, f is the activation function, e.g., the commonly used ELU, and W is the weight matrix of the base M:

wherein D is the depth of the protein multi-sequence arrangement matrix.

On the residue side, the attention mechanism network allows better extraction of the interaction between each pair of amino acid residues. The Attention operation updated E, the embedded vector matrix (M) to be used to update the sequence edges:

M’＝α(M)

wherein M 'is an updated sequence edge embedded vector matrix, M' has various updating modes, and is performed in an Attention mode, and alpha is a weight matrix based on E:

α_ij＝F(E_ij)

where F is the activation function.

Through the mutual updated network structure, rich information contained in the protein multi-sequence array matrix and the similar template can be continuously transmitted mutually at the sequence side and the residue side so as to obtain sufficient extraction.

We adopt a double-ended neural network architecture. After the sequence and residue sides are updated, the network back end has "two heads" output: one head generates a probability distribution d (i, j) of the distance between each pair of amino acid residues using the previous output information (M, E), and the other head generates an angular probability distribution a (i) of the torsion angle of each amino acid residue using the face output information (M, E). Our network can also perform multi-headed outputs, predicting other structural/property features of proteins simultaneously beyond distance and angle.

After the prediction of the distance between amino acid residues and the torsion angle of the protein is completed, a Gradient Descent (Gradient Descent) method similar to that mentioned in 2.2.3 is performed to obtain a three-dimensional structure that best meets the predicted distance/angle as a result of the prediction of the three-dimensional structure corresponding to the target amino acid sequence.

Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units included in the system for realizing various functions can also be regarded as structures in the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A method for predicting a three-dimensional structure of a protein with double-layer mutual reinforcement is characterized by comprising the following steps:

2. The method for predicting the three-dimensional structure of the protein with double-layer mutual reinforcement according to claim 1, wherein the multi-sequence arrangement matrix containing the co-evolution information of the protein comprises: in consideration of evolutionary events such as mutation, insertion, deletion, recombination, etc., a plurality of amino acid sequences are aligned and compared column by column to construct a data set similar to the predicted protein sequence in the database.

3. The method for predicting the three-dimensional structure of the protein with the double-layer mutual reinforcement according to claim 1, wherein the protein template comprises: known structural proteins in the protein structure database that are similar to the target protein.

4. The method for predicting the three-dimensional structure of the protein based on the double-layer mutual reinforcement of claim 1, wherein the attention-based multi-headed neural network comprises:

E’＝f(E+W)

wherein D is the depth of the protein multi-sequence arrangement matrix.

5. The method for predicting the three-dimensional structure of the protein based on the double-layer mutual reinforcement of claim 4, wherein the attention-based multi-headed neural network comprises:

M’＝α(M)

α_ij＝F(E_ij)

where F is the activation function.

6. A protein three-dimensional structure prediction system with double layers mutually enhanced is characterized by comprising:

7. The system for predicting the three-dimensional structure of a protein with double-layer mutual reinforcement according to claim 6, wherein the multi-sequence arrangement matrix containing co-evolution information of the protein comprises: in consideration of evolutionary events such as mutation, insertion, deletion, recombination, etc., a plurality of amino acid sequences are aligned and compared column by column to construct a data set similar to the predicted protein sequence in the database.

8. The system of claim 6, wherein the protein template comprises: known structural proteins in the protein structure database that are similar to the target protein.

9. The system for predicting the three-dimensional structure of a protein based on the double-layer mutual reinforcement of claim 6, wherein the attention-based multi-headed neural network comprises:

E’＝f(E+W)

wherein D is the depth of the protein multi-sequence arrangement matrix.

10. The system for predicting the three-dimensional structure of a protein based on two-layer mutual reinforcement according to claim 9, wherein the attention-based multi-headed neural network comprises:

M’＝α(M)

α_ij＝F(E_ij)

where F is the activation function.