CN113223608A - Method and system for predicting three-dimensional structure of protein by double-layer mutual reinforcement - Google Patents

Method and system for predicting three-dimensional structure of protein by double-layer mutual reinforcement Download PDF

Info

Publication number
CN113223608A
CN113223608A CN202110626528.XA CN202110626528A CN113223608A CN 113223608 A CN113223608 A CN 113223608A CN 202110626528 A CN202110626528 A CN 202110626528A CN 113223608 A CN113223608 A CN 113223608A
Authority
CN
China
Prior art keywords
protein
sequence
amino acid
matrix
residue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110626528.XA
Other languages
Chinese (zh)
Inventor
苗洪江
汤一珂
沈邱难
薛贵荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Tianran Intelligent Technology Co ltd
Original Assignee
Shanghai Tianran Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Tianran Intelligent Technology Co ltd filed Critical Shanghai Tianran Intelligent Technology Co ltd
Priority to CN202110626528.XA priority Critical patent/CN113223608A/en
Publication of CN113223608A publication Critical patent/CN113223608A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method and a system for predicting a three-dimensional structure of a protein by double-layer mutual reinforcement, which comprises the following steps: acquiring a multi-sequence array matrix and a protein template containing protein co-evolution information, respectively converting the multi-sequence array matrix and the protein template into embedded vectors, and inputting the embedded vectors into a multi-head neural network based on an attention mechanism; extracting the interaction relation of each pair of amino acid sequences in the protein by respectively carrying out sequence side and residue side and mutually updating; one head of the multi-head neural network outputs the probability distribution of the distance between each pair of amino acid residues, and the other head outputs the angle probability distribution of the torsion angle of each amino acid residue; and obtaining a three-dimensional structure of the most consistent predicted distance/angle as a prediction result by a gradient descent method according to the output probability distribution of the distance between each pair of amino acid residues and the output angle probability distribution of the torsion angle of each amino acid residue. The invention can lead the network to extract global information more evenly, thereby predicting the integral three-dimensional structure of the protein more accurately.

Description

Method and system for predicting three-dimensional structure of protein by double-layer mutual reinforcement
Technical Field
The invention relates to the field of computer software and bioinformatics, in particular to a protein three-dimensional structure prediction method and system based on double-layer mutual reinforcement of an attention mechanism.
Background
Proteins are the main players of life activities, and many important life processes in organisms are involved in proteins. The protein is formed by connecting peptide chains formed by dehydration condensation of 20 common amino acids. The three-dimensional structure of a protein determines the function of the protein. Predicting the three-dimensional structure of a protein from an amino acid sequence is a fundamental and unsolved problem in bioinformatics.
To date, the acquisition methods for determining the three-dimensional structure of proteins are mainly divided into two main categories: one is determined by wet experiments and one is predicted based on protein sequence. Methods for measurement by wet experiments include X-ray crystallography (X-ray crystallography), Nuclear Magnetic Resonance (NMR), and Cryo-electron microscopy (Cryo-EM), but these methods have some obvious disadvantages, such as being time-consuming, expensive, limited in resolution, etc., and thus the above speed of protein structure measurement cannot meet the requirement of structure/function analysis on the whole Proteome Scale (Proteome-Scale), and much less follows the speed of gene sequencing. Based on the "self-assembly theory" proposed by the nobel chemical prize-chair Christian b.anfsen, the amino acid sequence of the polypeptide chain contains all the information necessary for forming its mechanically stable native conformation, and how to directly and accurately predict the spatial structure of a protein from the amino acid sequence is a key problem for the study of protein structure.
For the study of direct prediction of the three-dimensional structure of proteins from sequences, there are two main categories of methods: one is template-based modeling (template-based) and one is non-template direct de novo prediction (ab initio modeling). De novo prediction is becoming increasingly important as template-based modeling approaches are limited by the number and quality of structural templates available in protein structure databases. Although these methods have achieved some success in protein structure prediction, these methods have limited applicability and still do not allow high throughput accurate structure prediction of most proteins.
Protein structure prediction not only helps to enhance human cognition for protein folding mechanisms. Furthermore, structural prediction is of fundamental significance for new protein design-to design a new protein with a particular function/structure, structural prediction is undoubtedly a tool to shorten the design process. Therefore, more effective methods are urgently needed to fill the shortages of the traditional protein structure prediction means.
In recent two years, the methods of de novo prediction have achieved unprecedented improvements in predicting the tertiary structure of proteins by the latest Deep Learning (Deep Learning) technique. The prediction of protein structure based on deep learning is mainly to reconstruct the structure by predicting two key attributes of distance and angle between amino acids in the protein structure.
(1) Distance: the distance here is the Euclidean geometric distance D (i, j) between any two amino acid residues i and j in the amino acid sequence of the protein, and in general, it is used
Figure BDA0003101410450000021
(angstroms) as a distance unit.
(2) Angle: the Angle herein refers to the geometric Angle between the backbone atoms after one amino acid residue in the protein backbone (Back Bone) is bonded to another amino acid residue, and is specifically expressed as the Torsion Angle between C-Alpha and N, C atoms (Torsion Angle), and is generally referred to as the (phi ) Angle.
The current protein three-dimensional structure prediction based on deep learning mainly comprises three steps: 1. constructing input of a neural network according to the amino acid sequence of the protein; 2. extracting structural information from the constructed input features through a deep neural network, and predicting the distance/torsion angle between amino acid residues; 3. the reconstruction of the three-dimensional structure of the protein is carried out by a gradient descent method.
Three defects exist in the process, so that the prediction result is very universal and difficult to achieve the accuracy of actual use. The first drawback is: the Resnet network structure is limited by the number of channels, and the local interaction relationship in the amino acid sequence is emphasized too much to extract the global structure information effectively enough. The second drawback is: the input of the neural network is an artificially designed feature, and the capability of the neural network for extracting information and distinguishing useless/wrong information is greatly limited. The third drawback is: the input of current neural networks relies entirely on co-evolutionary information derived from amino acid sequences, ignoring potential templates (templates) with rich information. These three defects cause the accuracy of prediction of the three-dimensional structures of the two proteins to be greatly affected.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method and a system for predicting a three-dimensional protein structure by double-layer mutual reinforcement.
The invention provides a method for predicting a three-dimensional structure of a protein by double-layer mutual reinforcement, which comprises the following steps:
a data input step: acquiring a multi-sequence array matrix and a protein template containing protein co-evolution information, respectively converting the multi-sequence array matrix and the protein template into embedded vectors, and inputting the embedded vectors into a multi-head neural network based on an attention mechanism;
the extraction step comprises: the attention-based multi-head neural network extracts the interaction relation of each pair of amino acid sequences in the protein in a mode of respectively performing sequence side and residue side and updating each other;
an output step: after the sequence edge and the residue edge are updated mutually, one head of the multi-head neural network outputs the probability distribution of the distance between each pair of amino acid residues, and the other head outputs the angle probability distribution of the torsion angle of each amino acid residue;
a prediction step: and obtaining a three-dimensional structure of the most consistent predicted distance/angle as a prediction result by a gradient descent method according to the output probability distribution of the distance between each pair of amino acid residues and the output angle probability distribution of the torsion angle of each amino acid residue.
Preferably, the multi-sequence permutation matrix containing protein coevolution information comprises: in consideration of evolutionary events such as mutation, insertion, deletion, recombination, etc., a plurality of amino acid sequences are aligned and compared column by column to construct a data set similar to the predicted protein sequence in the database.
Preferably, the protein template comprises: known structural proteins in the protein structure database that are similar to the target protein.
Preferably, the attention-based multi-headed neural network includes:
on the sequence side, distinguishing sequences embedded in the input protein multi-sequence permutation matrix embedded vector matrix M from sequences containing irrelevant information, and paying attention to the updated M to update the embedded vector matrix E on the residue side:
E’=f(E+W)
wherein E' is the updated residue edge embedding vector matrix, f is the activation function, and W is the weight matrix of the base M:
Figure BDA0003101410450000031
wherein D is the depth of the protein multi-sequence arrangement matrix.
Preferably, the attention-based multi-headed neural network includes:
on residue edges, the interaction relationships between each pair of amino acid residues are extracted, attention is paid to the updated E, and the embedded vector matrix M of the sequence edges is updated by the following steps:
M’=α(M)
wherein M' is the updated sequence edge embedding vector matrix, and α is the E-based weight matrix:
αij=F(Eij)
where F is the activation function.
The invention provides a protein three-dimensional structure prediction system with double layers mutually reinforced, which comprises:
a data input step: acquiring a multi-sequence array matrix and a protein template containing protein co-evolution information, respectively converting the multi-sequence array matrix and the protein template into embedded vectors, and inputting the embedded vectors into a multi-head neural network based on an attention mechanism;
the extraction step comprises: the attention-based multi-head neural network extracts the interaction relation of each pair of amino acid sequences in the protein in a mode of respectively performing sequence side and residue side and updating each other;
an output step: after the sequence edge and the residue edge are updated mutually, one head of the multi-head neural network outputs the probability distribution of the distance between each pair of amino acid residues, and the other head outputs the angle probability distribution of the torsion angle of each amino acid residue;
a prediction step: and obtaining a three-dimensional structure of the most consistent predicted distance/angle as a prediction result by a gradient descent method according to the output probability distribution of the distance between each pair of amino acid residues and the output angle probability distribution of the torsion angle of each amino acid residue.
Preferably, the multi-sequence permutation matrix containing protein coevolution information comprises: in consideration of evolutionary events such as mutation, insertion, deletion, recombination, etc., a plurality of amino acid sequences are aligned and compared column by column to construct a data set similar to the predicted protein sequence in the database.
Preferably, the protein template comprises: known structural proteins in the protein structure database that are similar to the target protein.
Preferably, the attention-based multi-headed neural network includes:
on the sequence side, distinguishing sequences embedded in the input protein multi-sequence permutation matrix embedded vector matrix M from sequences containing irrelevant information, and paying attention to the updated M to update the embedded vector matrix E on the residue side:
E’=f(E+W)
wherein E' is the updated residue edge embedding vector matrix, f is the activation function, and W is the weight matrix of the base M:
Figure BDA0003101410450000041
wherein D is the depth of the protein multi-sequence arrangement matrix.
Preferably, the attention-based multi-headed neural network includes:
on residue edges, the interaction relationships between each pair of amino acid residues are extracted, attention is paid to the updated E, and the embedded vector matrix M of the sequence edges is updated by the following steps:
M’=α(M)
wherein M' is the updated sequence edge embedding vector matrix, and α is the E-based weight matrix:
αij=F(Eij)
where F is the activation function.
Compared with the prior art, the invention has the following beneficial effects:
the invention can lead the network to extract global information more evenly, thereby predicting the integral three-dimensional structure of the protein more accurately. The invention directly uses a Multiple Sequence Alignment matrix (Multiple Sequence Alignment) containing protein co-evolution information as input, and extracts hidden layer information required by prediction distance and angle more comprehensively by a network. Meanwhile, similar proteins are searched in a known Protein structure library (Protein Data Bank) according to the amino acid sequence of the target Protein to serve as a Template (Template), the Template serves as an input feature of a neural network, and the accuracy of distance and angle prediction is improved in a mode that the network learns various input features in training.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a schematic diagram of the present invention;
FIG. 2 is a schematic diagram of a neural network according to the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
As shown in fig. 1, the method for predicting a three-dimensional protein structure with double layers mutually enhanced provided by the present invention comprises:
a data input step: and acquiring a multi-sequence array matrix and a protein template containing protein coevolution information, respectively converting the multi-sequence array matrix and the protein template into embedded vectors, and inputting the embedded vectors into a multi-head neural network based on an attention mechanism.
The extraction step comprises: the attention-based multi-headed neural network extracts the interaction relationship of each pair of amino acid sequences in the protein by performing and updating on the sequence side and the residue side respectively.
An output step: after the sequence edges and the residue edges are updated mutually, one head of the multi-head neural network outputs the probability distribution of the distance between each pair of amino acid residues, and the other head outputs the angle probability distribution of the torsion angle of each amino acid residue.
A prediction step: and obtaining a three-dimensional structure of the most consistent predicted distance/angle as a prediction result by a gradient descent method according to the output probability distribution of the distance between each pair of amino acid residues and the output angle probability distribution of the torsion angle of each amino acid residue.
The neural network input of the invention is a multi-sequence array matrix containing protein co-evolution information and a protein template similar to the multi-sequence array matrix, and the information is respectively converted into embedded vectors (Embedding) to be used as the input of the network.
The protein multi-sequence array matrix is characterized in that a plurality of amino acid sequences are aligned and compared column by column under the premise of considering evolution events such as mutation, insertion, deletion, recombination and the like, so that a data set similar to a predicted protein sequence in a database is constructed. If a multiple sequence alignment is considered as a two-dimensional table, where each row represents an amino acid sequence and each column represents the position of a residue, then the sequences to be aligned are filled into the table according to the following rules: (a) the relative positions of all residues in a sequence remain unchanged; (b) the same or similar residues in different sequences are placed in the same column so that the same or similar residues are aligned as far above and below the sequences as possible. In our procedure, HHblits and JackHMMER will be used to perform stepwise searches from each metagenomic protein sequence database and the results will be efficiently merged to obtain the deepest multiple sequence alignment matrix. Unlike the method mentioned in the background art, which constructs human-created features as input based on a multi-sequence permutation matrix, a multi-sequence permutation matrix of a protein is directly encoded and converted into an embedded vector as input to a neural network.
The template refers to known structural protein similar to target protein in a protein structure database (PDB), and the searched protein template information, such as amino acid sequence, distance matrix and the like, is converted into an embedded vector to be used as input of a neural network.
Referring to fig. 2, the neural network of the present invention is an innovative structure based on a Multi-head Attention mechanism (Multi-head Attention) and cyclically updated at Sequence Edge and Residue Edge.
The invention adopts a multi-head Attention mechanism (Attention) network to extract the interaction relation of each pair of amino acid sequences in the protein, and the information extraction is completed by a novel mode of respectively carrying out sequence side and residue side and updating each other. On the sequence side, the network of the attention mechanism can better distinguish sequences containing abundant information from sequences containing non-related information, wherein the sequences contain abundant information, and the sequences contain non-related information, and are embedded into a vector matrix (M) by an input protein multi-sequence arrangement matrix. Attention operates on the updated M, which will be used to update the residue-edge embedding vector matrix (E):
E’=f(E+W)
where E' is the updated residue edge embedding vector matrix, f is the activation function, e.g., the commonly used ELU, and W is the weight matrix of the base M:
Figure BDA0003101410450000071
wherein D is the depth of the protein multi-sequence arrangement matrix.
On the residue side, the attention mechanism network allows better extraction of the interaction between each pair of amino acid residues. The Attention operation updated E, the embedded vector matrix (M) to be used to update the sequence edges:
M’=α(M)
wherein M 'is an updated sequence edge embedded vector matrix, M' has various updating modes, and is performed in an Attention mode, and alpha is a weight matrix based on E:
αij=F(Eij)
where F is the activation function.
Through the mutual updated network structure, rich information contained in the protein multi-sequence array matrix and the similar template can be continuously transmitted mutually at the sequence side and the residue side so as to obtain sufficient extraction.
We adopt a double-ended neural network architecture. After the sequence and residue sides are updated, the network back end has "two heads" output: one head generates a probability distribution d (i, j) of the distance between each pair of amino acid residues using the previous output information (M, E), and the other head generates an angular probability distribution a (i) of the torsion angle of each amino acid residue using the face output information (M, E). Our network can also perform multi-headed outputs, predicting other structural/property features of proteins simultaneously beyond distance and angle.
After the prediction of the distance between amino acid residues and the torsion angle of the protein is completed, a Gradient Descent (Gradient Descent) method similar to that mentioned in 2.2.3 is performed to obtain a three-dimensional structure that best meets the predicted distance/angle as a result of the prediction of the three-dimensional structure corresponding to the target amino acid sequence.
Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units included in the system for realizing various functions can also be regarded as structures in the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. A method for predicting a three-dimensional structure of a protein with double-layer mutual reinforcement is characterized by comprising the following steps:
a data input step: acquiring a multi-sequence array matrix and a protein template containing protein co-evolution information, respectively converting the multi-sequence array matrix and the protein template into embedded vectors, and inputting the embedded vectors into a multi-head neural network based on an attention mechanism;
the extraction step comprises: the attention-based multi-head neural network extracts the interaction relation of each pair of amino acid sequences in the protein in a mode of respectively performing sequence side and residue side and updating each other;
an output step: after the sequence edge and the residue edge are updated mutually, one head of the multi-head neural network outputs the probability distribution of the distance between each pair of amino acid residues, and the other head outputs the angle probability distribution of the torsion angle of each amino acid residue;
a prediction step: and obtaining a three-dimensional structure of the most consistent predicted distance/angle as a prediction result by a gradient descent method according to the output probability distribution of the distance between each pair of amino acid residues and the output angle probability distribution of the torsion angle of each amino acid residue.
2. The method for predicting the three-dimensional structure of the protein with double-layer mutual reinforcement according to claim 1, wherein the multi-sequence arrangement matrix containing the co-evolution information of the protein comprises: in consideration of evolutionary events such as mutation, insertion, deletion, recombination, etc., a plurality of amino acid sequences are aligned and compared column by column to construct a data set similar to the predicted protein sequence in the database.
3. The method for predicting the three-dimensional structure of the protein with the double-layer mutual reinforcement according to claim 1, wherein the protein template comprises: known structural proteins in the protein structure database that are similar to the target protein.
4. The method for predicting the three-dimensional structure of the protein based on the double-layer mutual reinforcement of claim 1, wherein the attention-based multi-headed neural network comprises:
on the sequence side, distinguishing sequences embedded in the input protein multi-sequence permutation matrix embedded vector matrix M from sequences containing irrelevant information, and paying attention to the updated M to update the embedded vector matrix E on the residue side:
E’=f(E+W)
wherein E' is the updated residue edge embedding vector matrix, f is the activation function, and W is the weight matrix of the base M:
Figure FDA0003101410440000011
wherein D is the depth of the protein multi-sequence arrangement matrix.
5. The method for predicting the three-dimensional structure of the protein based on the double-layer mutual reinforcement of claim 4, wherein the attention-based multi-headed neural network comprises:
on residue edges, the interaction relationships between each pair of amino acid residues are extracted, attention is paid to the updated E, and the embedded vector matrix M of the sequence edges is updated by the following steps:
M’=α(M)
wherein M' is the updated sequence edge embedding vector matrix, and α is the E-based weight matrix:
αij=F(Eij)
where F is the activation function.
6. A protein three-dimensional structure prediction system with double layers mutually enhanced is characterized by comprising:
a data input step: acquiring a multi-sequence array matrix and a protein template containing protein co-evolution information, respectively converting the multi-sequence array matrix and the protein template into embedded vectors, and inputting the embedded vectors into a multi-head neural network based on an attention mechanism;
the extraction step comprises: the attention-based multi-head neural network extracts the interaction relation of each pair of amino acid sequences in the protein in a mode of respectively performing sequence side and residue side and updating each other;
an output step: after the sequence edge and the residue edge are updated mutually, one head of the multi-head neural network outputs the probability distribution of the distance between each pair of amino acid residues, and the other head outputs the angle probability distribution of the torsion angle of each amino acid residue;
a prediction step: and obtaining a three-dimensional structure of the most consistent predicted distance/angle as a prediction result by a gradient descent method according to the output probability distribution of the distance between each pair of amino acid residues and the output angle probability distribution of the torsion angle of each amino acid residue.
7. The system for predicting the three-dimensional structure of a protein with double-layer mutual reinforcement according to claim 6, wherein the multi-sequence arrangement matrix containing co-evolution information of the protein comprises: in consideration of evolutionary events such as mutation, insertion, deletion, recombination, etc., a plurality of amino acid sequences are aligned and compared column by column to construct a data set similar to the predicted protein sequence in the database.
8. The system of claim 6, wherein the protein template comprises: known structural proteins in the protein structure database that are similar to the target protein.
9. The system for predicting the three-dimensional structure of a protein based on the double-layer mutual reinforcement of claim 6, wherein the attention-based multi-headed neural network comprises:
on the sequence side, distinguishing sequences embedded in the input protein multi-sequence permutation matrix embedded vector matrix M from sequences containing irrelevant information, and paying attention to the updated M to update the embedded vector matrix E on the residue side:
E’=f(E+W)
wherein E' is the updated residue edge embedding vector matrix, f is the activation function, and W is the weight matrix of the base M:
Figure FDA0003101410440000031
wherein D is the depth of the protein multi-sequence arrangement matrix.
10. The system for predicting the three-dimensional structure of a protein based on two-layer mutual reinforcement according to claim 9, wherein the attention-based multi-headed neural network comprises:
on residue edges, the interaction relationships between each pair of amino acid residues are extracted, attention is paid to the updated E, and the embedded vector matrix M of the sequence edges is updated by the following steps:
M’=α(M)
wherein M' is the updated sequence edge embedding vector matrix, and α is the E-based weight matrix:
αij=F(Eij)
where F is the activation function.
CN202110626528.XA 2021-06-04 2021-06-04 Method and system for predicting three-dimensional structure of protein by double-layer mutual reinforcement Pending CN113223608A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110626528.XA CN113223608A (en) 2021-06-04 2021-06-04 Method and system for predicting three-dimensional structure of protein by double-layer mutual reinforcement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110626528.XA CN113223608A (en) 2021-06-04 2021-06-04 Method and system for predicting three-dimensional structure of protein by double-layer mutual reinforcement

Publications (1)

Publication Number Publication Date
CN113223608A true CN113223608A (en) 2021-08-06

Family

ID=77082848

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110626528.XA Pending CN113223608A (en) 2021-06-04 2021-06-04 Method and system for predicting three-dimensional structure of protein by double-layer mutual reinforcement

Country Status (1)

Country Link
CN (1) CN113223608A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114283878A (en) * 2021-08-27 2022-04-05 腾讯科技(深圳)有限公司 Method and apparatus for training matching model, predicting amino acid sequence and designing medicine
CN115312119A (en) * 2022-10-09 2022-11-08 之江实验室 Method and system for identifying protein structural domain based on protein three-dimensional structure image
CN115497553A (en) * 2022-09-29 2022-12-20 水木未来(杭州)科技有限公司 Protein three-dimensional structure modeling method and device, electronic device and storage medium
TWI804229B (en) * 2021-09-27 2023-06-01 美商圖策智能科技有限公司 Method and system for estimating protein binding free energy based on protein mutation prediction

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689918A (en) * 2019-09-24 2020-01-14 上海宽慧智能科技有限公司 Method and system for predicting tertiary structure of protein
CN110796252A (en) * 2019-10-30 2020-02-14 上海天壤智能科技有限公司 Prediction method and system based on double-head or multi-head neural network
CN111667884A (en) * 2020-06-12 2020-09-15 天津大学 Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism
CN112085247A (en) * 2020-07-22 2020-12-15 浙江工业大学 Protein residue contact prediction method based on deep learning
CN112233723A (en) * 2020-10-26 2021-01-15 上海天壤智能科技有限公司 Protein structure prediction method and system based on deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689918A (en) * 2019-09-24 2020-01-14 上海宽慧智能科技有限公司 Method and system for predicting tertiary structure of protein
CN110796252A (en) * 2019-10-30 2020-02-14 上海天壤智能科技有限公司 Prediction method and system based on double-head or multi-head neural network
CN111667884A (en) * 2020-06-12 2020-09-15 天津大学 Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism
CN112085247A (en) * 2020-07-22 2020-12-15 浙江工业大学 Protein residue contact prediction method based on deep learning
CN112233723A (en) * 2020-10-26 2021-01-15 上海天壤智能科技有限公司 Protein structure prediction method and system based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
H. ZHU, I. YOSHIHARA AND K. YAMAMORI: "Prediction of protein secondary structure by multi-modal neural networks", 《2002 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS》 *
赵志山: "基于深度学习的蛋白质二极结构预测研究", 《中国优秀博士学位论文全文数据库(硕士)基础科学辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114283878A (en) * 2021-08-27 2022-04-05 腾讯科技(深圳)有限公司 Method and apparatus for training matching model, predicting amino acid sequence and designing medicine
TWI804229B (en) * 2021-09-27 2023-06-01 美商圖策智能科技有限公司 Method and system for estimating protein binding free energy based on protein mutation prediction
CN115497553A (en) * 2022-09-29 2022-12-20 水木未来(杭州)科技有限公司 Protein three-dimensional structure modeling method and device, electronic device and storage medium
CN115312119A (en) * 2022-10-09 2022-11-08 之江实验室 Method and system for identifying protein structural domain based on protein three-dimensional structure image
US11908140B1 (en) 2022-10-09 2024-02-20 Zhejiang Lab Method and system for identifying protein domain based on protein three-dimensional structure image

Similar Documents

Publication Publication Date Title
CN113223608A (en) Method and system for predicting three-dimensional structure of protein by double-layer mutual reinforcement
CN112233723B (en) Protein structure prediction method and system based on deep learning
CN114446383B (en) Quantum calculation-based ligand-protein interaction prediction method
Li et al. Protein loop modeling using deep generative adversarial network
Nguyen et al. New deep learning methods for protein loop modeling
Alipanahi et al. Error tolerant NMR backbone resonance assignment and automated structure generation
CN113257357B (en) Protein residue contact map prediction method
Skolnick et al. Computational studies of protein folding
CN112085245B (en) Protein residue contact prediction method based on depth residual neural network
Liu et al. Refinepocket: An attention-enhanced and mask-guided deep learning approach for protein binding site prediction
US6370479B1 (en) Method and apparatus for extracting and evaluating mutually similar portions in one-dimensional sequences in molecules and/or three-dimensional structures of molecules
CN116453584A (en) Protein three-dimensional structure prediction method and system
CN116758978A (en) Controllable attribute totally new active small molecule design method based on protein structure
CN116189776A (en) Antibody structure generation method based on deep learning
CN115588463A (en) Prediction method for mining protein interaction type based on deep learning
Einav et al. Quantitatively visualizing bipartite datasets
JP3867863B2 (en) 3D structure processing equipment
Kurniawan et al. Prediction of protein tertiary structure using pre-trained self-supervised learning based on transformer
Garayalde et al. Real‐time topology optimization via learnable mappings
Atasever et al. 3-State Protein Secondary Structure Prediction based on SCOPe Classes
Chen et al. Contactlib-att: a structure-based search engine for homologous proteins
Sarkar et al. Scalable Adaptive Protein Ensemble Refinement Integrating Flexible Fitting
Du et al. From Interatomic Distances to Protein Tertiary Structures with a Deep Convolutional Neural Network
Bao et al. Discover the Binding Domain of Transmembrane Proteins Based on Structural Universality
Liu et al. Neuproofreader: An Interactive Proofreading System with Suggestive Prompts for Connectomics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210806