CN116386724A - Method and device for predicting protein interaction, electronic device and storage medium - Google Patents

Method and device for predicting protein interaction, electronic device and storage medium Download PDF

Info

Publication number
CN116386724A
CN116386724A CN202310219211.3A CN202310219211A CN116386724A CN 116386724 A CN116386724 A CN 116386724A CN 202310219211 A CN202310219211 A CN 202310219211A CN 116386724 A CN116386724 A CN 116386724A
Authority
CN
China
Prior art keywords
network
protein
vector
sub
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310219211.3A
Other languages
Chinese (zh)
Inventor
杨森
程鹏
舒文杰
王升启
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Academy of Military Medical Sciences AMMS of PLA
Original Assignee
Academy of Military Medical Sciences AMMS of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Academy of Military Medical Sciences AMMS of PLA filed Critical Academy of Military Medical Sciences AMMS of PLA
Priority to CN202310219211.3A priority Critical patent/CN116386724A/en
Publication of CN116386724A publication Critical patent/CN116386724A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Analytical Chemistry (AREA)
  • Public Health (AREA)
  • Chemical & Material Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Peptides Or Proteins (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention provides a protein interaction prediction method, a device, electronic equipment and a storage medium, comprising the following steps: acquiring a first amino acid sequence corresponding to a first protein and a second amino acid sequence corresponding to a second protein; carrying out protein prediction based on the first amino acid sequence and the second amino acid sequence through a pre-trained target prediction network to obtain a protein prediction result; wherein the target prediction network comprises an amino acid intercalating sub-network, a vector encoding sub-network, and a prediction sub-network, the protein prediction result being used to characterize the probability of interaction between the first protein and the second protein. The method can predict whether interaction occurs between proteins with high efficiency and high accuracy, and can also predict the interaction of proteins across species.

Description

Method and device for predicting protein interaction, electronic device and storage medium
Technical Field
The present invention relates to the field of protein interaction prediction technology, and in particular, to a method, an apparatus, an electronic device, and a storage medium for predicting protein interaction.
Background
The prediction of the interaction between proteins is helpful for revealing the life activity process of cells, and is an important means for mining and verifying markers such as functional genes from mass data. Conventional protein prediction techniques include wet experimental methods and computational methods. Among them, yeast two-hybrid, co-immunoprecipitation, fluorescence resonance energy transfer, etc. are typical wet experimental methods, but wet experimental methods generally require a large amount of samples, require highly purified proteins, and are too time and cost-intensive; in addition, advances in computational biology and bioinformatics techniques provide new approaches to exploring protein interactions, but existing computational approaches often suffer from OOD (Out of distribution) problems, which make it difficult to make accurate predictions for proteins of unknown species.
Disclosure of Invention
In view of the above, an object of the present invention is to provide a method, an apparatus, an electronic device, and a storage medium for predicting protein interactions, which can efficiently and accurately predict whether or not interactions between proteins occur, and which can also predict interactions between proteins of different species.
In a first aspect, embodiments of the present invention provide a method for predicting protein interactions, comprising: acquiring a first amino acid sequence corresponding to a first protein and a second amino acid sequence corresponding to a second protein; carrying out protein prediction based on the first amino acid sequence and the second amino acid sequence through a pre-trained target prediction network to obtain a protein prediction result; wherein the target prediction network comprises an amino acid intercalating sub-network, a vector encoding sub-network, and a prediction sub-network, the protein prediction result being used to characterize the probability of interaction between the first protein and the second protein.
In one embodiment, the vector encoding sub-network comprises a first vector encoding sub-network and a second vector encoding sub-network, the outputs of the amino acid embedding sub-network are respectively connected with the inputs of the first vector encoding sub-network and the second vector encoding sub-network, and the outputs of the first vector encoding sub-network and the second vector encoding sub-network are respectively connected with the prediction sub-network; performing protein prediction based on the first amino acid sequence and the second amino acid sequence through a pre-trained target prediction network to obtain a protein prediction result, wherein the method comprises the following steps of: embedding the first amino acid sequence into a vector space through the amino acid embedding sub-network to obtain a first embedded vector corresponding to the first protein, and embedding the second amino acid sequence into the vector space to obtain a second embedded vector corresponding to the second protein; encoding the first embedded vector through the first vector encoding sub-network to obtain a first encoding result corresponding to the first protein; and encoding the second embedded vector through the second vector encoding sub-network to obtain a second encoding result corresponding to the second protein; and carrying out protein prediction based on the first coding result and the second coding result through the prediction sub-network to obtain a protein prediction result.
In one embodiment, the first vector encoding sub-network includes a projection unit and a plurality of encoding units; encoding the first embedded vector through the first vector encoding sub-network to obtain a first encoding result corresponding to the first protein, including: compressing the first embedded vector from a current dimension to a specified dimension by the projection unit; performing self-attention calculation and feedforward nerve calculation on the first embedded vector with the specified dimension or the output vector of the previous coding unit through the coding unit to obtain the output vector of the coding unit; and determining the output vector of the coding unit at the tail end as a first coding result corresponding to the first protein.
In one embodiment, the coding unit includes a multi-headed attention layer and a feed forward nerve layer; and performing self-attention calculation and feedforward nerve calculation on the output vector of the previous coding unit by the coding unit to obtain an output vector of the coding unit, wherein the method comprises the following steps: through the multi-headed attention layerPerforming multi-head attention operation on the output vector of a previous coding unit, and performing normalization operation on the sum of the output vector of the previous coding unit and the result of the head attention operation to obtain an intermediate vector; determining, by the feedforward neural layer, the output vector of the coding unit from the intermediate vector according to the following formula:
Figure BDA0004116027280000031
Wherein said->
Figure BDA0004116027280000032
Representing said output vector, said ++>
Figure BDA0004116027280000033
Representing the intermediate vector, LN representing the normalization operation, W 1 、W 2 、b 1 、b 2 Are all network parameters of the feedforward neural layer.
In one embodiment, the first and second vector encoding sub-networks employ a twin network architecture, and the first and second vector encoding sub-networks share network parameters.
In one embodiment, protein prediction is performed based on the first encoding result and the second encoding result, resulting in a protein prediction result, comprising: carrying out average pooling operation on the first coding result to obtain a first average pooling result, and carrying out average pooling operation on the second coding result to obtain a second average pooling result; determining a hadamard product of the first average pooling result and the second average pooling result; determining a protein prediction result based on the hadamard product and network parameters of the prediction subnetwork by using a Softmax function; wherein if the probability characterized by the protein prediction result is greater than a preset threshold, an interaction is determined to occur between the first protein and the second protein, and if the probability characterized by the protein prediction result is less than the preset threshold, no interaction is determined to occur between the first protein and the second protein.
In one embodiment, the method further comprises: training network parameters of the amino acid embedded sub-network by using a first training data set; wherein, the amino acid embedding sub-network adopts a PortT5 model; and freezing network parameters of the amino acid embedded sub-network, and training the network parameters of the vector coding sub-network and the prediction sub-network by using a second training set to obtain a target prediction network.
In a second aspect, embodiments of the present invention further provide a protein interaction prediction apparatus, including: the sequence acquisition module is used for acquiring a first amino acid sequence corresponding to the first protein and a second amino acid sequence corresponding to the second protein; the prediction module is used for carrying out protein prediction based on the first amino acid sequence and the second amino acid sequence through a pre-trained target prediction network to obtain a protein prediction result; wherein the target prediction network comprises an amino acid intercalating sub-network, a vector encoding sub-network, and a prediction sub-network, the protein prediction result being used to characterize the probability of interaction between the first protein and the second protein.
In a third aspect, an embodiment of the present invention further provides an electronic device comprising a processor and a memory storing computer-executable instructions executable by the processor to implement the method of any one of the first aspects.
In a fourth aspect, embodiments of the present invention also provide a computer-readable storage medium storing computer-executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of any one of the first aspects.
According to the protein interaction prediction method, the device, the electronic equipment and the storage medium provided by the embodiment of the invention, a first amino acid sequence corresponding to a first protein and a second amino acid sequence corresponding to a second protein are firstly obtained, and then protein prediction is performed based on the first amino acid sequence and the second amino acid sequence through a pre-trained target prediction network, so that a protein prediction result is obtained. Wherein the target prediction network comprises an amino acid intercalating sub-network, a vector encoding sub-network, and a prediction sub-network, the protein prediction result being used to characterize the probability of interaction between the first protein and the second protein. The method provides a target prediction network which comprises an amino acid embedding sub-network, a vector coding sub-network and a prediction sub-network, and can be used for efficiently and accurately predicting whether interaction occurs between proteins according to the amino acid sequence of the proteins and also can realize cross-species protein interaction prediction.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method for predicting protein interactions according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a target prediction network according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of another objective prediction network according to an embodiment of the present invention;
FIG. 4 is a flow chart of another method for predicting protein interactions according to an embodiment of the present invention;
FIG. 5 is a model predictive effect provided by an embodiment of the invention;
FIG. 6 is a schematic structural diagram of a protein interaction prediction apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described in conjunction with the embodiments, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Proteins are important macromolecules in organisms, and are folded in space by long chains of amino acids to form a three-dimensional structure. Proteins generally do not act alone, but rather, they serve biological functions at the cellular and tissue level by interacting with other proteins to form "molecular machines". Protein interactions, also known as protein interactions, refer to physical contact between proteins through non-random collisions such as charge attraction, hydrogen bonding, hydrophobic forces, etc., that dock at specific sites to form complexes. Thus, whether a protein interacts or not depends on the geometry and physicochemical properties of the protein surface is a complex process. Protein interaction is one of the basic units of functional execution at the molecular level, and runs through various links of life processes such as gene expression, signal transduction, immune regulation and the like. For example, the spike protein of the novel coronavirus interacts with and binds to the host cell target ACE2 receptor protein to effect fusion with the host cell membrane, thereby completing infection of the host cell.
Conventional protein prediction techniques include wet experimental methods and computational methods. Yeast double hybridization, co-immunoprecipitation, fluorescence resonance energy transfer, and the like are typical wet-test methods, but wet-test methods generally require a large amount of samples, require highly purified proteins, and are too time-and cost-intensive. In recent years, advances in computational biology and bioinformatics technologies have provided new methods for exploring protein interactions, e.g., computer and molecular dynamics simulations based on protein sequences and structures, etc., that can predict interactions between proteins and provide information on the mechanisms and kinetics of exploring interactions between proteins. The development of deep learning technology provides a new method for predicting protein interaction, and the deep learning technology has the capability of processing large-scale data, can learn characteristics from massive protein sequences and structural data, and can predict protein interaction. However, the deep learning model generally has a problem of OOD (Out of distribution), and when there is a difference between the distribution of the test set and the training set, the prediction accuracy of the deep learning prediction model is significantly reduced. On the protein interaction prediction task, the problem of OOD is manifested in that the interaction prediction model has difficulty in making accurate predictions for proteins of unknown species.
Based on the method, the device, the electronic equipment and the storage medium for predicting the protein interaction are provided, whether the interaction between the proteins occurs or not can be predicted efficiently and accurately, and the prediction of the protein interaction across species can be realized.
For the convenience of understanding the present embodiment, a method for predicting a protein interaction disclosed in the present embodiment will be described in detail, referring to a schematic flow chart of a method for predicting a protein interaction shown in fig. 1, the method mainly includes the following steps S102 to S104:
step S102, a first amino acid sequence corresponding to the first protein and a second amino acid sequence corresponding to the second protein are obtained. The first protein and the second protein may be proteins of the same species or proteins of different species.
Step S104, protein prediction is carried out based on the first amino acid sequence and the second amino acid sequence through a pre-trained target prediction network, and a protein prediction result is obtained. The protein prediction result is used for representing the probability of interaction between the first protein and the second protein, and optionally, when the probability is larger than a preset threshold value, the interaction between the first protein and the second protein can be indicated, otherwise, when the probability is smaller than the preset threshold value, the interaction between the first protein and the second protein can not be indicated.
In one embodiment, the target prediction network may be referred to as a PPITrans, including an amino acid insertion sub-network, a vector encoding sub-network, and a prediction sub-network. The amino acid intercalation sub-network may employ a PortT5 model for intercalating an amino acid sequence of a protein into a vector space to obtain an intercalation vector (which may also be referred to as an intercalation representation) of the protein, and the amino acid intercalation sub-network is inputted with a first amino acid sequence of a first protein and a second amino acid sequence of a second protein, and outputs a first intercalation vector of the first protein and a second intercalation vector of the second protein. The vector coding sub-network is used for coding the embedded vectors to obtain coding results, the input of the vector coding sub-network is a first embedded limit and a second embedded vector, and the input is a first coding result of the first protein and a second coding result of the second protein. The prediction sub-network is used for carrying out average pooling operation on the coding results of the proteins, predicting the probability of interaction between the two proteins based on Hadamard product (Hadamard product) of the average pooling results, and outputting the prediction result as a protein prediction result, wherein the input of the prediction sub-network is a first coding result and a second coding result.
The method for predicting protein interaction provided by the embodiment of the invention provides a target prediction network, which comprises an amino acid embedding sub-network, a vector coding sub-network and a prediction sub-network, and can be used for efficiently and accurately predicting whether interaction occurs between proteins according to the amino acid sequence of the proteins and also can realize cross-species protein interaction prediction.
For the understanding of the foregoing embodiments, an embodiment of the present invention provides a target prediction network, referring to a schematic structure of a target prediction network shown in fig. 2, fig. 2 illustrates that the target prediction network includes an amino acid embedding sub-network, a first vector encoding sub-network, a second vector encoding sub-network, and a prediction sub-network, an output of the amino acid embedding sub-network is connected to inputs of the first vector encoding sub-network and the second vector encoding sub-network, respectively, and outputs of the first vector encoding sub-network and the second vector encoding sub-network are connected to the prediction sub-network.
In one embodiment, in the network construction stage, the network parameters of the amino acid embedded sub-network may be trained using a first training data set, the network parameters of the amino acid embedded sub-network may be frozen, and the network parameters of the vector encoded sub-network and the predicted sub-network may be trained using a second training set to obtain the target predicted network. For easy understanding, the embodiment of the present invention provides a specific structure of a target prediction network, and referring to a schematic structural diagram of another target prediction network shown in fig. 3, the specific structure is as follows:
(1) For amino acid intercalating sub-networks: the present embodiment employs a ProtT5 model to dynamically embed amino acid sequences, the ProtT5 model containing 30 hundred million parameters, first pre-trained on a BFD (Big Fantastic Database, protein sequence database) dataset, and then fine-tuned on a UniRef50 dataset. Of these, the UniRef50 dataset contains 4500 tens of thousands and 21.22 hundreds of millions of proteins, covering a wide range of biological species. After embedding, each residue in the protein is embedded in a 1024-dimensional vector that contains the position and context information of the residue.
(2) For the first vector encoding sub-network and the second vector encoding sub-network: the first and second vector encoding sub-networks employ a twin network architecture, and the first and second vector encoding sub-networks share network parameters. Taking the first vector encoding sub-network as an example, the first vector encoding sub-network includes a projection unit and a plurality of encoding units, each including a multi-headed attention layer and a feed-forward neural layer. Illustratively, the first vector encoding subunit and the second vector subunit may each include 6 encoding units. In practical applications, the PPITrans freezes its parameters and superimposes coding units (a transducer encoder may be used) thereon to further encode the protein sequence, since the ProtT5 model requires a lot of computational resources in fine tuning. In the pre-trained base acid embedding sub-network, the size of the embedded vector already reaches 1024, which results in that the coding unit needs more computing resources, but cannot improve the performance of the coding unit, so the PPITrans provided by the embodiment of the present invention adds a projection unit before the coding unit to convert the size of the embedded vector into the hidden size 256 of the coding unit. Since each input sample contains two proteins, embodiments of the present invention employ a twin network architecture for protein encoding, i.e., the vector encoding subnetwork actually includes a first vector encoding subnetwork and a second vector encoding subnetwork, and the two vector encoding subnetworks share network parameters.
(3) For the predictive subnetwork: the prediction sub-network respectively carries out average pooling operation on the first coding result and the second coding result, and predicts the interaction probability of two proteins according to the Hadamard product of the two average pooling results.
On the basis of the target prediction network, the embodiment of the invention provides an implementation manner for obtaining a protein prediction result by performing protein prediction based on a first amino acid sequence and a second amino acid sequence through a pre-trained target prediction network, which is described in the following steps 1 to 3:
step 1, embedding a first amino acid sequence into a vector space through an amino acid embedding sub-network to obtain a first embedded vector corresponding to a first protein, and embedding a second amino acid sequence into the vector space to obtain a second embedded vector corresponding to a second protein. In an alternative embodiment, a first amino acid sequence is input into an amino acid embedding sub-network (PortT 5 model), and the first amino acid sequence is embedded into a vector space by using the amino acid embedding sub-network to obtain a first embedded vector; then inputting the second amino acid sequence into an amino acid intercalator network, and using the amino acid intercalator network to intercalate the second amino acid The sequence is embedded into a vector space to obtain a second embedded vector. Wherein each amino acid of the protein is embedded as a vector of length 1024
Figure BDA0004116027280000101
The embedded vector of the protein is expressed as->
Figure BDA0004116027280000102
Step 2, encoding a first embedded vector through a first vector encoding sub-network to obtain a first encoding result corresponding to a first protein; and encoding the second embedded vector through a second vector encoding sub-network to obtain a second encoding result corresponding to the second protein. For easy understanding, the embodiment of the present invention provides an implementation manner of encoding, by using a first vector encoding sub-network, a first embedded vector to obtain a first encoding result corresponding to a first protein, which specifically refers to the following steps 2.1 to 2.3:
step 2.1, compressing the first embedded vector from the current dimension to the specified dimension by the projection unit. In one embodiment, the projection unit compresses the first embedded vector from 1024 dimensions to 265 dimensions:
Figure BDA0004116027280000103
wherein,,
Figure BDA0004116027280000104
is the parameter matrix of the projection unit, LN represents the layer normalization operation,/->
Figure BDA0004116027280000105
A first embedded vector representing 265 dimensions, +.>
Figure BDA0004116027280000106
Representing a 1024-dimensional first embedded vector.
And 2.2, performing self-attention calculation and feedforward nerve calculation on the first embedded vector with the specified dimension or the output vector of the previous coding unit through the coding unit to obtain the output vector of the coding unit. In one embodiment, the protein is further encoded using a transducer encoder, each layer of which comprises two main steps: (1) self-attention calculation; and (2) feedforward nerve layer calculation. Specifically, for the coding unit located at the head end, the self-attention calculation and the feedforward nerve calculation are sequentially performed on the first embedded vector of 265 dimensions by the coding unit, and for the other coding units, the self-attention calculation and the feedforward nerve calculation are sequentially performed on the output vector of the previous coding unit by the coding unit.
And 2.3, determining the output vector of the coding unit positioned at the tail end as a first coding result corresponding to the first protein.
In order to facilitate understanding of step 2.2, the embodiment of the present invention further provides an implementation manner of performing self-attention calculation and feedforward neural calculation on an output vector of a previous encoding unit by using an encoding unit to obtain the output vector of the encoding unit, which is referred to as steps a to b below:
And a step a, performing multi-head attention operation on the output vector of the previous coding unit through the multi-head attention layer, and performing normalization operation on the sum of the output vector of the previous coding unit and the result of the multi-head attention operation to obtain an intermediate vector. Specifically, the intermediate vector may be determined according to the following formula:
Figure BDA0004116027280000111
wherein,,
Figure BDA0004116027280000112
intermediate vector representing the coding unit, +.>
Figure BDA0004116027280000113
Representing the output of the first-1 coding unitVectors, i.e. the output vector of the previous coding unit,/->
Figure BDA0004116027280000114
Representing the nesting of functions, multi-head represents a multi-head attention operation.
Step b, determining the output vector of the coding unit according to the intermediate vector through the feedforward neural layer according to the following formula:
Figure BDA0004116027280000115
wherein,,
Figure BDA0004116027280000121
representing the output vector +.>
Figure BDA0004116027280000122
Representing intermediate vectors, LN representing normalization operations, W 1 、W 2 、b 1 、b 2 Are all network parameters of the feedforward neural layer, < ->
Figure BDA0004116027280000123
Figure BDA0004116027280000124
b 1 ∈R 4d And b 2 ∈R d Representing a learnable parameter. Assuming that the lengths of the two proteins are m and n, respectively, after encoding, the first encoding result is
Figure BDA0004116027280000125
The second coding result is->
Figure BDA0004116027280000126
And 3, carrying out protein prediction based on the first coding result and the second coding result through a prediction sub-network to obtain a protein prediction result. See in particular steps 3.1 to 3.3 below:
And 3.1, carrying out average pooling operation on the first coding result to obtain a first average pooling result, and carrying out average pooling operation on the second coding result to obtain a second average pooling result. In one embodiment, the encoding of the protein is subjected to an averaging pooling operation:
Figure BDA0004116027280000127
Figure BDA0004116027280000128
wherein a represents a first protein, b represents a second protein,
Figure BDA0004116027280000131
i.e. the first averaged pooling result, +.>
Figure BDA0004116027280000132
I.e. the second averaged pooling result.
And 3.2, determining the Hadamard product of the first average pooling result and the second average pooling result.
And 3.3, determining a protein prediction result based on the Hadamard product and network parameters of the prediction subnetwork by using a Softmax function. In one embodiment, the multi-layer perceptron may be used to predict based on the average pooling result, see the following formula:
Figure BDA0004116027280000133
wherein,,
Figure BDA0004116027280000134
represents the Hadamard product, W c And b c Parameters representing prediction module->
Figure BDA0004116027280000135
Representing the probability of model predictions. In one embodiment, the interaction between the first protein and the second protein is determined if the probability characterized by the protein prediction is greater than a predetermined threshold, and the interaction between the first protein and the second protein is determined not to occur if the probability characterized by the protein prediction is less than the predetermined threshold. Exemplary, when- >
Figure BDA0004116027280000136
Two proteins are thought to interact when they do not otherwise interact.
For an understanding of the foregoing embodiments, embodiments of the present invention provide a specific implementation of a method for predicting protein interactions, referring to a schematic flow chart of another method for predicting protein interactions shown in fig. 4, the method mainly includes the following steps S402 to S408:
step S402, obtaining an embedded representation of the protein using an amino acid embedding sub-network. Wherein the embedded representation is the first embedded vector and the second embedded vector.
Step S404, compressing the embedded representation of the protein from 1024 dimensions to 256 dimensions using projection units in the vector encoding sub-network.
Step S406, encoding the embedded representation of the protein using a transducer encoder in the vector encoding sub-network.
And step S408, predicting according to the protein coding result by using a prediction sub-network to obtain a protein prediction result.
According to the embodiment of the invention, based on a standard data set, PPITrans is compared with an existing protein interaction prediction model based on deep learning, wherein positive samples of a training set and a testing set are both derived from a STRING database, the negative samples are constructed through random pairing, the scale is ten times that of the positive samples, see a model prediction effect shown in FIG. 5, the prediction effect of all models on six modes of biological data is shown in FIG. 5, and evaluation indexes comprise F1 scores and AUPR scores, because the two are more differentiated in unbalanced classification evaluation. Experimental results show that the classification performance of PPITrans greatly exceeds the traditional interaction prediction model on the test set of six model organisms. AUPR scores were above 0.9 on the Human (Human), mouse (Mouse), drosophila (Fly) test sets and F1 scores were above 0.8 on the testers of model organisms other than Yeast (Yeast) and E.coli (E.coli), indicating that PPITrans can predict the correct vast majority of test samples. More importantly, the PPITrans have greater advantages than the traditional method on test sets except for human beings, and the PPITrans have stronger generalization capability on the test set of Drosophila compared with the optimal model PIPR+DSCRIPT in the past, wherein the score of the PPITrans exceeds 0.34 and the score of the F1 exceeds 0.31.
In summary, the present invention proposes and completes a protein sequence-based predictive model, called PPITrans. The PPITrans comprises an amino acid embedding sub-network, a vector coding sub-network and a prediction sub-network, wherein the amino acid embedding sub-network adopts a pretrained ProtT5 model to convert an amino acid sequence into vector representation, the vector coding sub-network adopts a twin network structure based on a transducer and is used for further coding the embedded vectors of two input proteins, and the prediction sub-network adopts a multi-layer perceptron and judges whether interaction occurs according to the Hadamard product of protein coding. The embodiment of the invention provides and completes a protein interaction prediction model based on a transducer network structure, and the model can efficiently judge whether interaction occurs between proteins according to the amino acid sequence of the proteins and can predict the proteins of cross species.
For the method for predicting protein interactions provided in the foregoing embodiment, the embodiment of the present invention provides a device for predicting protein interactions, referring to a schematic structural diagram of a device for predicting protein interactions shown in fig. 6, the device mainly includes the following parts:
A sequence acquisition module 602, configured to acquire a first amino acid sequence corresponding to a first protein and a second amino acid sequence corresponding to a second protein;
the prediction module 604 is configured to perform protein prediction based on the first amino acid sequence and the second amino acid sequence through a pre-trained target prediction network, so as to obtain a protein prediction result; wherein the target prediction network comprises an amino acid intercalating sub-network, a vector encoding sub-network, and a prediction sub-network, the protein prediction result being used to characterize the probability of interaction between the first protein and the second protein.
The protein interaction prediction device provided by the embodiment of the invention provides a target prediction network, which comprises an amino acid embedding sub-network, a vector coding sub-network and a prediction sub-network, and can be used for efficiently and accurately predicting whether interaction occurs between proteins according to the amino acid sequence of the proteins and also can realize cross-species protein interaction prediction.
In one embodiment, the vector encoding sub-network comprises a first vector encoding sub-network and a second vector encoding sub-network, the outputs of the amino acid embedding sub-network are respectively connected with the inputs of the first vector encoding sub-network and the second vector encoding sub-network, and the outputs of the first vector encoding sub-network and the second vector encoding sub-network are respectively connected with the prediction sub-network; the prediction module 604 is further configured to: embedding a first amino acid sequence into a vector space through an amino acid embedding sub-network to obtain a first embedded vector corresponding to a first protein, and embedding a second amino acid sequence into the vector space to obtain a second embedded vector corresponding to a second protein; encoding the first embedded vector through a first vector encoding sub-network to obtain a first encoding result corresponding to the first protein; and encoding the second embedded vector through a second vector encoding sub-network to obtain a second encoding result corresponding to the second protein; and carrying out protein prediction based on the first coding result and the second coding result through a prediction sub-network to obtain a protein prediction result.
In one embodiment, the first vector encoding sub-network includes a projection unit and a plurality of encoding units; the prediction module 604 is further configured to: compressing the first embedded vector from the current dimension to the specified dimension through a projection unit; the method comprises the steps that through an encoding unit, self-attention calculation and feedforward nerve calculation are carried out on a first embedded vector with a specified dimension or an output vector of a previous encoding unit, so that the output vector of the encoding unit is obtained; and determining the output vector of the coding unit positioned at the tail end as a first coding result corresponding to the first protein.
In one embodiment, the encoding unit includes a multi-headed attention layer and a feed-forward nerve layer; the prediction module 604 is further configured to: performing multi-head attention operation on the output vector of the previous coding unit through the multi-head attention layer, and performing normalization operation on the sum of the output vector of the previous coding unit and the result of the head attention operation to obtain an intermediate vector; the output vector of the coding unit is determined from the intermediate vector by the feed-forward neural layer according to the following formula:
Figure BDA0004116027280000161
wherein said->
Figure BDA0004116027280000162
Representing the output vector, the
Figure BDA0004116027280000163
Representing the intermediate vector, LN representing the normalization operation, W 1 、W 2 、b 1 、b 2 Are all network parameters of the feedforward neural layer.
In one embodiment, the first and second vector encoding sub-networks employ a twin network architecture, and the first and second vector encoding sub-networks share network parameters.
In one embodiment, the prediction module 604 is further configured to: carrying out average pooling operation on the first coding result to obtain a first average pooling result, and carrying out average pooling operation on the second coding result to obtain a second average pooling result; determining a Hadamard product of the first average pooling result and the second average pooling result; determining a protein prediction result based on the Hadamard product and network parameters of the prediction sub-network by using a Softmax function; wherein if the probability characterized by the protein prediction is greater than a predetermined threshold, an interaction is determined to occur between the first protein and the second protein, and if the probability characterized by the protein prediction is less than the predetermined threshold, no interaction is determined to occur between the first protein and the second protein.
In one embodiment, the apparatus further includes a training module configured to: training network parameters of the amino acid embedded sub-network by using a first training data set; wherein, the amino acid embedding sub-network adopts a PortT6 model; and freezing network parameters of the amino acid embedded sub-network, and training the network parameters of the vector coding sub-network and the prediction sub-network by utilizing the second training set to obtain a target prediction network.
The device provided by the embodiment of the present invention has the same implementation principle and technical effects as those of the foregoing method embodiment, and for the sake of brevity, reference may be made to the corresponding content in the foregoing method embodiment where the device embodiment is not mentioned.
The embodiment of the invention provides electronic equipment, which comprises a processor and a storage device; the storage means has stored thereon a computer program which, when executed by the processor, performs the method of any of the embodiments described above.
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device 100 includes: a processor 70, a memory 71, a bus 72 and a communication interface 73, said processor 70, communication interface 73 and memory 71 being connected by bus 72; the processor 70 is arranged to execute executable modules, such as computer programs, stored in the memory 71.
The memory 71 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. The communication connection between the system network element and the at least one other network element is achieved via at least one communication interface 73 (which may be wired or wireless), which may use the internet, a wide area network, a local network, a metropolitan area network, etc.
Bus 72 may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 7, but not only one bus or type of bus.
The memory 71 is configured to store a program, and the processor 70 executes the program after receiving an execution instruction, where the method executed by the apparatus for flow defining disclosed in any of the foregoing embodiments of the present invention may be applied to the processor 70 or implemented by the processor 70.
The processor 70 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in the processor 70. The processor 70 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but may also be a digital signal processor (Digital Signal Processing, DSP for short), application specific integrated circuit (Application Specific Integrated Circuit, ASIC for short), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA for short), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 71 and the processor 70 reads the information in the memory 71 and in combination with its hardware performs the steps of the method described above.
The computer program product of the readable storage medium provided by the embodiment of the present invention includes a computer readable storage medium storing a program code, where the program code includes instructions for executing the method described in the foregoing method embodiment, and the specific implementation may refer to the foregoing method embodiment and will not be described herein.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the above examples are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, but it should be understood by those skilled in the art that the present invention is not limited thereto, and that the present invention is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for predicting protein interactions, comprising:
acquiring a first amino acid sequence corresponding to a first protein and a second amino acid sequence corresponding to a second protein;
carrying out protein prediction based on the first amino acid sequence and the second amino acid sequence through a pre-trained target prediction network to obtain a protein prediction result; wherein the target prediction network comprises an amino acid intercalating sub-network, a vector encoding sub-network, and a prediction sub-network, the protein prediction result being used to characterize the probability of interaction between the first protein and the second protein.
2. The method of claim 1, wherein the vector coding sub-network comprises a first vector coding sub-network and a second vector coding sub-network, the outputs of the amino acid intercalating sub-network being connected to inputs of the first vector coding sub-network and the second vector coding sub-network, respectively, the outputs of the first vector coding sub-network and the second vector coding sub-network being connected to the prediction sub-network;
performing protein prediction based on the first amino acid sequence and the second amino acid sequence through a pre-trained target prediction network to obtain a protein prediction result, wherein the method comprises the following steps of:
embedding the first amino acid sequence into a vector space through the amino acid embedding sub-network to obtain a first embedded vector corresponding to the first protein, and embedding the second amino acid sequence into the vector space to obtain a second embedded vector corresponding to the second protein;
encoding the first embedded vector through the first vector encoding sub-network to obtain a first encoding result corresponding to the first protein; and encoding the second embedded vector through the second vector encoding sub-network to obtain a second encoding result corresponding to the second protein;
And carrying out protein prediction based on the first coding result and the second coding result through the prediction sub-network to obtain a protein prediction result.
3. The method of predicting protein interactions of claim 2, wherein the first vector encoding subnetwork comprises a projection unit and a plurality of encoding units;
encoding the first embedded vector through the first vector encoding sub-network to obtain a first encoding result corresponding to the first protein, including:
compressing the first embedded vector from a current dimension to a specified dimension by the projection unit;
performing self-attention calculation and feedforward nerve calculation on the first embedded vector with the specified dimension or the output vector of the previous coding unit through the coding unit to obtain the output vector of the coding unit;
and determining the output vector of the coding unit at the tail end as a first coding result corresponding to the first protein.
4. A method of predicting protein interactions in accordance with claim 3, wherein the coding unit comprises a multi-headed attention layer and a feed forward neural layer;
and performing self-attention calculation and feedforward nerve calculation on the output vector of the previous coding unit by the coding unit to obtain an output vector of the coding unit, wherein the method comprises the following steps:
Performing multi-head attention operation on the output vector of a previous coding unit through the multi-head attention layer, and performing normalization operation on the sum of the output vector of the previous coding unit and the result of the head attention operation to obtain an intermediate vector;
determining, by the feedforward neural layer, the output vector of the coding unit from the intermediate vector according to the following formula:
Figure FDA0004116027270000021
wherein the said
Figure FDA0004116027270000022
Representing said output vector, said ++>
Figure FDA0004116027270000023
Representing the intermediate vector, LN representing the normalization operation, W 1 、W 2 、b 1 、b 2 Are all network parameters of the feedforward neural layer.
5. The method of claim 2, wherein the first vector encoding sub-network and the second vector encoding sub-network employ a twin network architecture and the first vector encoding sub-network and the second vector encoding sub-network share network parameters.
6. The method of predicting protein interactions according to claim 2, wherein predicting protein based on the first encoding result and the second encoding result, results in a protein prediction result, comprises:
carrying out average pooling operation on the first coding result to obtain a first average pooling result, and carrying out average pooling operation on the second coding result to obtain a second average pooling result;
Determining a hadamard product of the first average pooling result and the second average pooling result;
determining a protein prediction result based on the hadamard product and network parameters of the prediction subnetwork by using a Softmax function;
wherein if the probability characterized by the protein prediction result is greater than a preset threshold, an interaction is determined to occur between the first protein and the second protein, and if the probability characterized by the protein prediction result is less than the preset threshold, no interaction is determined to occur between the first protein and the second protein.
7. The method of predicting protein interactions of any one of claims 1-6, further comprising:
training network parameters of the amino acid embedded sub-network by using a first training data set; wherein, the amino acid embedding sub-network adopts a PortT5 model;
and freezing network parameters of the amino acid embedded sub-network, and training the network parameters of the vector coding sub-network and the prediction sub-network by using a second training set to obtain a target prediction network.
8. A protein interaction prediction apparatus comprising:
The sequence acquisition module is used for acquiring a first amino acid sequence corresponding to the first protein and a second amino acid sequence corresponding to the second protein;
the prediction module is used for carrying out protein prediction based on the first amino acid sequence and the second amino acid sequence through a pre-trained target prediction network to obtain a protein prediction result; wherein the target prediction network comprises an amino acid intercalating sub-network, a vector encoding sub-network, and a prediction sub-network, the protein prediction result being used to characterize the probability of interaction between the first protein and the second protein.
9. An electronic device comprising a processor and a memory, the memory storing computer-executable instructions executable by the processor, the processor executing the computer-executable instructions to implement the method of any one of claims 1 to 7.
10. A computer readable storage medium storing computer executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of any one of claims 1 to 7.
CN202310219211.3A 2023-03-06 2023-03-06 Method and device for predicting protein interaction, electronic device and storage medium Pending CN116386724A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310219211.3A CN116386724A (en) 2023-03-06 2023-03-06 Method and device for predicting protein interaction, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310219211.3A CN116386724A (en) 2023-03-06 2023-03-06 Method and device for predicting protein interaction, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN116386724A true CN116386724A (en) 2023-07-04

Family

ID=86968436

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310219211.3A Pending CN116386724A (en) 2023-03-06 2023-03-06 Method and device for predicting protein interaction, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN116386724A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117393050A (en) * 2023-10-17 2024-01-12 哈尔滨工业大学(威海) Protein function recognition method, device and readable storage medium
CN118398079A (en) * 2024-06-25 2024-07-26 中国人民解放军军事科学院军事医学研究院 Computer device, method and application for predicting amino acid mutation effect or carrying out design modification on protein

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117393050A (en) * 2023-10-17 2024-01-12 哈尔滨工业大学(威海) Protein function recognition method, device and readable storage medium
CN118398079A (en) * 2024-06-25 2024-07-26 中国人民解放军军事科学院军事医学研究院 Computer device, method and application for predicting amino acid mutation effect or carrying out design modification on protein

Similar Documents

Publication Publication Date Title
CN113593631B (en) Method and system for predicting protein-polypeptide binding site
CN116386724A (en) Method and device for predicting protein interaction, electronic device and storage medium
CN111210871B (en) Protein-protein interaction prediction method based on deep forests
CN111312329B (en) Transcription factor binding site prediction method based on deep convolution automatic encoder
Kaur et al. A neural network method for prediction of β-turn types in proteins using evolutionary information
CN108009405A (en) A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter
CN113257357B (en) Protein residue contact map prediction method
Chen et al. Cascaded bidirectional recurrent neural networks for protein secondary structure prediction
Chen et al. Predicting coding potential of RNA sequences by solving local data imbalance
CN112116950A (en) Protein folding identification method based on depth measurement learning
Yu et al. SOMPNN: an efficient non-parametric model for predicting transmembrane helices
CN118038995A (en) Method and system for predicting small open reading window coding polypeptide capacity in non-coding RNA
CN112259157B (en) Protein interaction prediction method
Penić et al. Rinalmo: General-purpose rna language models can generalize well on structure prediction tasks
Saraswathi et al. Fast learning optimized prediction methodology (FLOPRED) for protein secondary structure prediction
Chen et al. Domain-based predictive models for protein-protein interaction prediction
CN114758721B (en) Deep learning-based transcription factor binding site positioning method
CN116259358A (en) Protein interaction prediction method, device and storage medium
Yue et al. A systematic review on the state-of-the-art strategies for protein representation
Milosavljević Discovering dependencies via algorithmic mutual information: A case study in DNA sequence comparisons
CN117037917A (en) Cell type prediction model training method, cell type prediction method and device
KR101636995B1 (en) Improvement method of gene network using domain-specific phylogenetic profiles similarity
Clark et al. Vector quantization kernels for the classification of protein sequences and structures
CN113782094A (en) Modification site prediction method, modification site prediction device, computer device, and storage medium
Xu et al. An integrated prediction method for identifying protein-protein interactions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination