CN115206423A - Label guidance-based protein action relation prediction method - Google Patents

Label guidance-based protein action relation prediction method Download PDF

Info

Publication number
CN115206423A
CN115206423A CN202210828104.6A CN202210828104A CN115206423A CN 115206423 A CN115206423 A CN 115206423A CN 202210828104 A CN202210828104 A CN 202210828104A CN 115206423 A CN115206423 A CN 115206423A
Authority
CN
China
Prior art keywords
protein
graph
representation
node
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210828104.6A
Other languages
Chinese (zh)
Inventor
朱小飞
王新生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Technology
Original Assignee
Chongqing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Technology filed Critical Chongqing University of Technology
Priority to CN202210828104.6A priority Critical patent/CN115206423A/en
Publication of CN115206423A publication Critical patent/CN115206423A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention specifically relates to a protein action relation prediction method based on label guidance, which comprises the following steps: obtaining a pair of proteins to be predicted; inputting a pair of proteins to be detected into the trained prediction model, and outputting a corresponding prediction relation; firstly, enhancing graph data based on protein to be detected to obtain a multi-scale graph representation; secondly, inputting the multi-scale graph representation into a graph neural network to obtain multi-scale protein feature representation, and introducing contrast learning to eliminate the difference of different scale protein feature representations; then constructing a self-learning label relation graph and learning the relation among labels to obtain label characteristic representation; finally, the characteristic representation of the modified protein is represented through the label characteristic and the prediction of the protein action relation is guided; and taking the predicted relationship of a pair of proteins to be detected as the prediction result of the protein action relationship. The method can improve the generalization capability of protein characteristic representation and the classification accuracy of the prediction model, thereby improving the prediction effect of the protein action relation.

Description

Label guidance-based protein action relation prediction method
Technical Field
The invention relates to the technical field of biological information and natural language processing, in particular to a protein action relation prediction method based on label guidance.
Background
Protein-protein interactions play key roles in a wide range of biological processes, such as DNA replication, transcription, translation, and transmembrane signal transduction. Therefore, detection of Protein-Protein Interactions (PPIs) and the type of Protein Interactions are critical to understanding the cellular biological processes in normal and disease states, and such studies are also helpful in the identification of therapeutic targets and the design of new drugs. In early work on protein action relationships, laboratory-based methods were used, mainly involving yeast two-hybrid screening, protein chip and mass spectrometry protein complex identification, and the like. Laboratory experiments are often time consuming and labor intensive, resulting in inefficient identification of protein action relationships, while laboratory-based methods generate incomplete protein action relationship data due to limitations of laboratory experiments.
In the existing research work for predicting the protein action relationship of the deep learning algorithm, a Convolutional Neural Network (CNN) is mainly used for extracting local features of the protein or a Recurrent Neural Network (RNN) is used for storing context long-distance dependence information. However, such deep learning algorithms still have many problems such as inability to efficiently filter and aggregate local features of proteins, difficulty in simultaneously retaining important context and amino acid information of sequences, no utilization of interaction between protein pairs, and the like. With the development of Graph Neural Networks (GNNs), the prior art began to predict by constructing protein action Network graphs and introducing Graph Neural networks. The method not only considers the influence of protein pairs, but also can enhance the characteristic representation of the method through the relationship between the protein pairs, thereby further improving the effect of predicting the protein action relationship.
However, the applicant found in actual research that the conventional method for predicting the protein action relationship based on the graph neural network only constructs a protein action network graph and a protein feature representation based on an original data set, and does not fully search the original data set, so that the generalization ability of the protein feature representation is insufficient, and the effect of predicting the protein action relationship is not good. Meanwhile, a plurality of action relations often exist among the proteins, the action relations may have mutual correlation information, and the existing method for realizing the prediction of the protein action relations based on the graph neural network does not consider the mutual correlation information among the proteins, so that the classification accuracy of the protein action relation prediction model is insufficient. Therefore, how to design a method capable of improving the generalization capability of protein feature representation and the classification accuracy of a prediction model is an urgent technical problem to be solved.
Disclosure of Invention
Aiming at the defects of the prior art, the technical problems to be solved by the invention are as follows: how to provide a protein action relation prediction method based on label guidance to improve the generalization capability of protein characteristic representation and the classification accuracy of a prediction model, thereby improving the prediction effect of the protein action relation and further better analyzing the cell biological process of a subject to which the protein belongs under normal and disease states.
In order to solve the technical problem, the invention adopts the following technical scheme:
the protein action relation prediction method based on label guidance comprises the following steps:
s1: obtaining a pair of proteins to be predicted;
s2: inputting a pair of proteins to be detected into the trained prediction model, and outputting a corresponding prediction relation;
firstly, enhancing graph data based on the protein to be detected by the prediction model to obtain a multi-scale graph representation; secondly, the multi-scale graph is represented and input into a graph neural network to obtain multi-scale protein feature representation, and contrast learning is introduced to eliminate the difference of different scale protein feature representations; then constructing a self-learning label relation graph and learning the relation among labels to obtain label characteristic representation; finally, the characteristic representation of the modified protein is represented through the label characteristic, the prediction of the action relation of the protein is guided, and then the corresponding prediction relation is output;
s3: and taking the prediction relationship of the pair of proteins to be detected as the prediction result of the protein action relationship, and further analyzing the cell biological process of the main body to which the proteins to be detected belong in normal and disease states based on the prediction result of the protein action relationship.
Preferably, in step S2, the prediction model includes a protein feature encoder module for extracting local features and global features of the protein, a multi-scale graph neural network module for performing data enhancement, graph neural network processing and contrast learning, a self-learning label relationship graph module for learning the relationship between labels, and a multi-label loss calculation module for performing a self-supervised learning task and a supervised learning task.
Preferably, in step S2, the prediction model is trained by the following steps:
s201: acquiring a pair of proteins for training and inputting the proteins into a prediction model;
s202: extracting local features and global features of the protein through a protein feature encoder module to obtain protein feature representation with local information and global information;
s203: constructing an original graph of the protein action relation based on the protein characteristic representation; disturbing the original graph through a multi-scale graph neural network module to obtain a corresponding disturbed graph; then inputting the original graph and the disturbance graph into a graph neural network, and outputting original node characteristic representation and disturbance node characteristic representation, namely multi-scale protein characteristic representation; then fusing the original node characteristic representation and the disturbance node characteristic representation in a comparative learning mode to obtain a fusion node characteristic representation; finally, fusion edge feature representation is obtained through fusion node feature representation calculation;
s204: acquiring label name embedding representation through a self-learning label relation graph module, and constructing a label relation graph; then inputting the label relation graph into a graph convolution neural network, and outputting label node characteristic representation;
s205: the feature representation of the modified fusion edge is represented through the feature representation of the label node, and the feature representation of the protein relational graph connecting the edges is obtained;
s206: the multi-label loss calculation module carries out self-supervision learning through original node characteristic representation and disturbance node characteristic representation to obtain a self-supervision learning loss function; then, supervised learning is carried out through protein relational graph continuous edge feature representation to obtain a supervised learning loss function; finally, calculating based on the self-supervised learning loss function and the supervised learning loss function to obtain a training loss function, and optimizing and updating parameters of the prediction model through the training loss function;
s207: steps S201 to S206 are repeatedly performed until the prediction model converges.
Preferably, in step S202, the protein feature encoder module includes a local feature encoder and a global feature encoder;
the local feature encoder comprises a convolutional neural network and a maximum pooling layer, and extracts input protein by the following formula
Figure BDA0003744774440000031
Local feature in (1) represents h i
h i =f GMP (f CNN (p i ;θ CNN ));
Figure BDA0003744774440000032
In the formula: f. of CNN Represents a convolution operation; f. of GMP Represents a max pooling layer operation;
Figure BDA0003744774440000033
represents a collection of proteins;
Figure BDA0003744774440000034
represents a defined vocabulary of amino acids; a is j Represents an amino acid in the amino acid vocabulary; theta CNN Training parameters representing convolution operations;
the global feature encoder comprises a bidirectional gating circulation unit and a global average pooling layer, and extracts an input local feature representation h through the following formula i To obtain a protein feature representation x having local information and global information i ∈X;
x i =f GAP (f BiGRU (h i ;θ BiGRU ));
In the formula: f. of BiGRU Representing a bidirectional gated loop operation; f. of GAP Representing a global average pooling layer operation; theta BiGRU Training parameters representing a bi-directional gated loop operation; x represents the protein feature representation obtained based on the protein feature encoder module.
Preferably, in step S203, an original graph G = (X, a), node characteristics are defined
Figure BDA0003744774440000035
And adjacency matrix
Figure BDA0003744774440000036
1) The multi-scale graph neural network module first applies a stochastic graph data enhancement function through two different viewing angles
Figure BDA0003744774440000037
And
Figure BDA0003744774440000038
disturbing the continuous edges and node characteristics of the original graph G = (X, A) respectively to obtain a first disturbed graph G1= (X, A) 1 ) And a second disturbing image G2= (X) 2 ,A);
Figure BDA0003744774440000039
Figure BDA00037447744400000310
Figure BDA00037447744400000311
1 ~Bernoulli(N,1-δ 1 );
Figure BDA00037447744400000312
Figure BDA00037447744400000313
In the formula: e is a 1 The representation is based on a hyper-parameter delta 1 E (0, 1) isThe resulting N Bernoulli distribution;
Figure BDA00037447744400000314
representation based on the first perturbation graph G1= (X, a) 1 ) Enhancing functions by graph data
Figure BDA00037447744400000315
The obtained protein characteristics are expressed;
Figure BDA00037447744400000316
representing a set of connected edges of an original graph; bernoulli represents Bernoulli distribution; delta. For the preparation of a coating 1 E (0, 1) is a hyper-parameter and represents the ratio of deleting continuous edges;
Figure BDA0003744774440000041
the representation is based on a hyper-parameter delta 2 The result of uniform distribution obtained by epsilon (0, 1); x 2 Representation based on the first perturbation graph G1= (X, a) 1 ) Enhancing functions by graph data
Figure BDA0003744774440000042
The obtained protein characteristics are expressed; x represents the node characteristics of the original graph; unifrom denotes uniform distribution; delta 2 E (0, 1) is a hyper-parameter and represents the ratio of the node characteristic set to 0;
2) An original graph G = (X, A), a first perturbative graph G1= (X, A) 1 ) And a second disturbing graph G2= (X) 2 And A) inputting the data into a neural network of the graph respectively and outputting the characteristic representation of the original node
Figure BDA0003744774440000043
First perturbation node signature representation
Figure BDA0003744774440000044
And a second perturbation node signature representation
Figure BDA0003744774440000045
The graph neural network with k iterations is represented as:
Figure BDA0003744774440000046
Figure BDA0003744774440000047
in the formula:
Figure BDA0003744774440000048
representing the representation obtained by the node v after aggregating the characteristics of the neighbor nodes; AGG represents a function of the aggregation node characteristics;
Figure BDA0003744774440000049
or
Figure BDA00037447744400000410
Representing the result of k-1 iterations of the node u based on the graph convolution network;
Figure BDA00037447744400000411
a neighbor set representing node v; UPDATE represents a node feature UPDATE function;
Figure BDA00037447744400000412
or
Figure BDA00037447744400000413
Representing the result of k-1 iterations of the node v based on the graph convolution network;
Figure BDA00037447744400000414
or
Figure BDA00037447744400000415
Representing a feature representation of the kth iteration of node v; MLP represents a multi-layer perceptron neural network; ω is a learnable parameter or constant;
3) Fusing raw node feature representations by the following formula
Figure BDA00037447744400000416
First perturbation node characterization
Figure BDA00037447744400000417
And the second disturbance node characteristic representation to obtain a fusion node characteristic representation
Figure BDA00037447744400000418
Z'=f Fusion ([Z 0 ,Z 1 ,Z 2 ]);
In the formula: f. of Fusion Representing a fusion function;
4) Obtaining a fused edge feature representation by fusing a node feature representation Z
Figure BDA00037447744400000419
e ij =z′ i ⊙z′ j ,e ij ∈E;
In the formula: an indication of a hadamard product;
Figure BDA00037447744400000420
and
Figure BDA00037447744400000421
respectively representing the feature representations of node i and node j.
Preferably, in step S204, the self-learning label relationship graph module obtains the label name embedding representation through the pre-training model BERT;
X L =BERT(L NAME );
in the formula:
Figure BDA00037447744400000422
represents a label name;
Figure BDA00037447744400000423
a word vector representing the tag name, i.e., a tag name embedded representation;
embedding representation X by tag name L And learnable parameter matrix
Figure BDA00037447744400000424
Constructing a tag relationship graph G L =(A L ,X L );
Map G of label relationship L =(A L ,X L ) Output label node feature representation in input graph convolution neural network
Figure BDA0003744774440000051
Figure BDA0003744774440000052
In the formula: initialization Z L (0) =X L
Figure BDA0003744774440000053
A representation degree matrix; w is a group of (l-1) Representing a learnable parameter matrix; sigma represents a sigmoid activation function; a. The L Is initialized to an identity matrix.
Preferably, in step S205, the protein relationship graph edge connection feature representation is calculated by the following formula;
Figure BDA0003744774440000054
in the formula:
Figure BDA0003744774440000055
representing a protein relation graph with edge characteristics; e represents fused edge feature representation;
Figure BDA0003744774440000056
and (4) label node feature representation.
Preferably, in step S206, the multi-label loss calculation module represents the original node feature in the self-supervised learning task
Figure BDA0003744774440000057
And a first perturbation node characterization
Figure BDA0003744774440000058
Calculating a first disturbance loss function
Figure BDA0003744774440000059
By original node feature representation
Figure BDA00037447744400000510
And a second perturbation node characterization
Figure BDA00037447744400000511
Calculating a second disturbance loss function
Figure BDA00037447744400000512
First disturbance loss function
Figure BDA00037447744400000513
And a second disturbance loss function
Figure BDA00037447744400000514
Namely, the function of the loss of the self-supervision learning is obtained;
Figure BDA00037447744400000515
Figure BDA00037447744400000516
Figure BDA00037447744400000517
Figure BDA00037447744400000518
in the formula: (z) 1,i ,z 0,i ) Represents an active sample pair; z is a radical of formula 0,i ∈Z 0 ,z 1,i ∈Z 1 ;θ(z 0,i ,z 1,i ) Representing the calculation of z 0,i And z 1,i Cosine similarity of (a); τ represents a temperature parameter; n represents all node sets.
Preferably, in step S206, the multi-label loss calculating module calculates the following supervised learning loss function through the protein relation graph edge-to-edge feature representation in the supervised learning task;
Figure BDA00037447744400000519
Figure BDA00037447744400000520
p ij =Softmax(e ij );
Figure BDA0003744774440000061
in the formula:
Figure BDA0003744774440000062
represents a supervised learning loss; t represents the number of label categories;
Figure BDA0003744774440000063
representing a training set of connected edge sets; p is a radical of formula ij Representing a relational probability distribution between proteins i and j;
Figure BDA0003744774440000064
a predicted relationship representing the relationship between proteins i and j; c represents a specific category of the label;
Figure BDA0003744774440000065
true labels that represent proteins i and j in class c;
Figure BDA0003744774440000066
represents the predicted results of proteins i and j in class c; argmax denotesThe largest subscript in the set is taken to indicate.
Preferably, in step S206, the training loss function is expressed by the following formula;
Figure BDA0003744774440000067
in the formula:
Figure BDA0003744774440000068
represents a loss of training;
Figure BDA0003744774440000069
represents a supervised learning loss;
Figure BDA00037447744400000610
representing a first perturbation loss;
Figure BDA00037447744400000611
representing a second perturbation loss; lambda [ alpha ] 1 And λ 2 Indicating the set hyper-parameter.
The protein action relation prediction method based on label guidance has the following beneficial effects:
the prediction model generates a multi-scale graph representation in a graph data enhancement mode, learns multi-scale protein feature representation by enhancing self feature representation through neighbor nodes in a graph neural network, eliminates the difference of different scale protein feature representations by introducing contrast learning and further improves the protein characterization capability, namely, an original data set is fully explored through graph data enhancement, graph neural network processing and contrast learning, the generalization capability of the protein feature representation can be improved, the prediction effect of the protein action relation can be improved, and the cell biological process of a main body to which the protein belongs in normal and disease states can be better analyzed.
Meanwhile, label information is introduced into the prediction model, the relationship among the labels is learned by constructing a self-learned label relationship diagram to obtain label characteristic representation, and then the learning of the protein interaction relationship is guided by the label characteristic representation, namely the correlation information generated by various interaction relationships among the proteins can be fully explored by learning the relationship among the labels, so that the classification accuracy of the prediction model can be improved, the prediction effect of the protein interaction relationship can be further improved, and the cell biological process of a main body to which the proteins belong under normal and disease states can be better analyzed.
Drawings
For a better understanding of the objects, solutions and advantages of the present invention, reference will now be made in detail to the present invention, which is illustrated in the accompanying drawings, in which:
FIG. 1 is a logic diagram of a tag-based protein interaction relationship prediction method;
FIG. 2 is a diagram of a network architecture of a predictive model (LGMG-PPI);
FIG. 3 is a schematic diagram of SL-LRG topology validation;
FIG. 4 is a diagram illustrating a feature validity verification of a SL-LRG node.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures. In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. indicate orientations or positional relationships based on orientations or positional relationships shown in the drawings or orientations or positional relationships that the present product is conventionally placed in use, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the device or element referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance. Furthermore, the terms "horizontal", "vertical" and the like do not imply that the components are required to be absolutely horizontal or pendant, but rather may be slightly inclined. For example, "horizontal" merely means that the direction is more horizontal than "vertical" and does not mean that the structure must be perfectly horizontal, but may be slightly inclined. In the description of the present invention, it should also be noted that, unless otherwise explicitly specified or limited, the terms "disposed," "mounted," "connected," and "connected" are to be construed broadly and may, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
The following is further detailed by the specific embodiments:
example (b):
the embodiment discloses a protein action relation prediction method based on label guidance.
As shown in FIG. 1, the method for predicting the protein action relationship based on the label guidance comprises the following steps:
s1: obtaining a pair of proteins to be predicted to be detected;
s2: inputting a pair of proteins to be detected into the trained prediction model, and outputting a corresponding prediction relation;
firstly, enhancing graph data based on the protein to be detected by the prediction model to obtain a multi-scale graph representation; secondly, the multi-scale graph is represented and input into a graph neural network to obtain multi-scale protein feature representation, and contrast learning is introduced to eliminate the difference of different scale protein feature representations; then constructing a self-learning label relation graph and learning the relation among labels to obtain the characteristic representation of the labels; finally, the characteristic representation of the modified protein is expressed through the label characteristic, the prediction of the action relationship of the protein is guided, and the corresponding prediction relationship is output;
in this embodiment, as shown in fig. 2, the prediction model includes a protein feature encoder module for extracting local features and global features of the protein, a Multi-Scale map Data enhancement (MS-GDA) module for performing Data enhancement and obtaining Multi-Scale protein feature representation, a Self-Learning Label relationship Graph (SL-LRG) module for Learning a relationship between labels, and a Multi-Label loss calculation module for performing a Self-Learning task and a supervised Learning task.
S3: and taking the prediction relationship of the pair of proteins to be detected as the prediction result of the protein action relationship, and further analyzing the cell biological process of the main body to which the proteins to be detected belong in normal and disease states based on the prediction result of the protein action relationship.
In this embodiment, the cell biological process of the target subject (human or animal) in normal and disease states can be analyzed based on the protein action relationship prediction result of the protein to be detected of the target subject (human or animal), and then the identification of the therapeutic target and the design of a new drug can be realized based on the analyzed data. The invention improves the effects of the analysis of the cell biological process, the identification of the therapeutic target and the design of the new medicine by obtaining a better prediction result of the protein action relationship.
The prediction model generates a multi-scale graph representation in a graph data enhancement mode, learns multi-scale protein feature representation by enhancing self feature representation through neighbor nodes in a graph neural network, eliminates the difference of different scale protein feature representations by introducing contrast learning and further improves the protein characterization capability, namely, an original data set is fully explored through graph data enhancement, graph neural network processing and contrast learning, the generalization capability of the protein feature representation can be improved, the prediction effect of the protein action relation can be improved, and the cell biological process of a main body to which the protein belongs in normal and disease states can be better analyzed.
Meanwhile, label information is introduced into the prediction model, the relationship among the labels is learned by constructing a self-learned label relationship diagram to obtain label characteristic representation, and then the learning of the protein interaction relationship is guided through the label characteristic representation, namely, the correlation information generated by various action relationships among the proteins can be fully explored through the relationship among the learning labels, the classification accuracy of the prediction model can be improved, the prediction effect of the protein action relationship can be further improved, and the cell biological process of a main body to which the proteins belong under normal and disease states can be better analyzed.
In a specific implementation process, a prediction model is trained through the following steps:
s201: acquiring a pair of proteins for training and inputting the proteins into a prediction model;
s202: extracting local features and global features of the protein through a protein feature encoder module to obtain protein feature representation with local information and global information;
s203: constructing an original graph of the protein action relation based on the protein characteristic representation; disturbing the original graph through a multi-scale graph neural network module to obtain a corresponding disturbance graph; then inputting the original graph and the disturbance graph into a graph neural network, and outputting original node characteristic representation and disturbance node characteristic representation, namely multi-scale protein characteristic representation; then fusing the original node characteristic representation and the disturbance node characteristic representation in a comparative learning mode to obtain a fusion node characteristic representation; finally, fusion edge feature representation is obtained through fusion node feature representation calculation;
s204: acquiring label name embedding representation through a self-learning label relation graph module, and constructing a label relation graph; then inputting the label relation graph into a graph convolution neural network, and outputting label node characteristic representation;
s205: the feature representation of the modified fusion edge is represented through the feature representation of the label node, and the feature representation of the protein relational graph connecting the edges is obtained;
s206: the multi-label loss calculation module carries out self-supervision learning through original node characteristic representation and disturbance node characteristic representation to obtain a self-supervision learning loss function; then, supervised learning is carried out through protein relational graph continuous edge feature representation to obtain a supervised learning loss function; finally, calculating based on the self-supervised learning loss function and the supervised learning loss function to obtain a training loss function, and further optimizing and updating parameters of the prediction model through the training loss function;
s207: steps S201 to S206 are repeatedly performed until the prediction model converges.
When a prediction model is trained, a multi-scale graph representation is generated in a graph data enhancement mode, multi-scale protein feature representation is learned by enhancing own feature representation through neighbor nodes in a graph neural network, and the difference of different scale protein feature representations is eliminated by introducing contrast learning, so that the protein characterization capability is further improved; meanwhile, label information is introduced, the relation among labels is learned by constructing a self-learned label relation graph to obtain label characteristic representation, and then the learning of the protein interaction relation is guided through the label characteristic representation, namely, original data can be fully explored through graph data enhancement, graph neural network processing and comparison learning, the mutual association information generated by various action relations among proteins can be fully explored through the relation among the learning labels, and further the generalization capability of the protein characteristic representation and the classification accuracy of a prediction model can be improved, so that the prediction effect of the protein action relation can be further improved.
It should be noted that the prediction model of the present invention can be regarded as a protein action relationship (LGMG-PPI) prediction model based on Label-Guided Multi-scale Graph Neural networks.
Proteins are composed of amino acids, and the common amino acids are 20 kinds. Defining a vocabulary of amino acids
Figure BDA0003744774440000091
Protein assembly
Figure BDA0003744774440000092
Wherein
Figure BDA0003744774440000093
Definition of
Figure BDA0003744774440000094
Is a collection of PPIs (protein action relationships), where I denotes whether a relationship exists between two proteins, if I (x) ij ) =1, represents protein p i And protein p j There is an action relationship between them; if I (x) ij ) =0, represents protein p i And protein p j There is no functional relationship between them, or it means that there is no functional relationship between them found in the current research work. By the above definition, the PPIs graph is constructed by taking proteins as nodes and PPIs as continuous edges
Figure BDA0003744774440000095
The protein action relation only indicates whether the protein has the interaction relation or not, however, a plurality of action relations may exist among the proteins, and the task of the invention is to predict the plurality of action relations existing among the proteins and is a multi-label classification task. The invention defines the label set of PPIs as
Figure BDA0003744774440000096
Wherein t represents t action relationships.
In a specific implementation process, the protein characteristic encoder module comprises a local characteristic encoder and a global characteristic encoder;
the local feature encoder comprisesConvolutional Neural Network (CNN) and maximum Pooling layer (Global Max Pooling, GMP), the input protein was extracted by the following formula
Figure BDA0003744774440000101
Local feature in (1) represents h i
h i =f GMP (f CNN (p i ;θ CNN ));
Figure BDA0003744774440000102
In the formula: f. of CNN Representing a convolution operation; f. of GMP Representing a max pooling level operation;
Figure BDA0003744774440000103
represents a collection of proteins;
Figure BDA0003744774440000104
representing a defined vocabulary of amino acids; a is j Represents an amino acid in the amino acid vocabulary; theta CNN Training parameters representing convolution operations;
the Global feature encoder comprises a Bidirectional Gate recovery Unit (BiGRU) and a Global average Pooling layer (GAP), and extracts an input local feature representation h through the following formula i To obtain a protein feature representation x having local information and global information i ∈X;
x i =f GAP (f BiGRU (h i ;θ BiGRU ));
In the formula: f. of BiGRU Representing a bidirectional gated loop operation; f. of GAP Representing a global average pooling layer operation; theta.theta. BiGRU Training parameters representing a bi-directional gated loop operation; x represents the protein feature representation obtained based on the protein feature encoder module.
According to the invention, the local features and the global features of the protein are extracted in a feature coding mode to obtain the protein feature representation with the local information and the global information, so that the characterization capability of the protein can be better improved.
In the specific implementation process, a Multi-Scale Graph Data enhancement (MS-GDA) module mainly comprises two Graph Data enhancement functions.
Defining an original graph G = (X, A), node characteristics
Figure BDA0003744774440000105
(protein features obtained by the protein feature encoder module are expressed as node features) and adjacency matrix
Figure BDA0003744774440000106
1) The multi-scale graph neural network module first applies a stochastic graph data enhancement function through two different viewing angles
Figure BDA0003744774440000107
And
Figure BDA0003744774440000108
disturbing the continuous edges and node characteristics of the original graph G = (X, A) respectively to obtain a first disturbed graph G1= (X, A) 1 ) And a second disturbing graph G2= (X) 2 ,A);
Figure BDA0003744774440000109
And (3) disturbing the continuous edges of the original graph G = (X, A), and randomly deleting the continuous edges of the topology structure of the original graph.
Figure BDA00037447744400001010
Perturbing the node characteristics of the original graph G = (X, a), and randomly setting some columns of the node characteristics of the original graph to 0.
Figure BDA00037447744400001011
Figure BDA00037447744400001012
Figure BDA00037447744400001013
1 ~Bernoulli(N,1-δ 1 );
Figure BDA0003744774440000111
Figure BDA0003744774440000112
In the formula: e is the same as 1 The representation is based on a hyper-parameter delta 1 The result of N Bernoulli distribution obtained by epsilon (0, 1);
Figure BDA0003744774440000113
representation based on the first perturbation graph G1= (X, a) 1 ) Enhancing functions by graph data
Figure BDA0003744774440000114
The obtained protein characteristics are expressed;
Figure BDA0003744774440000115
representing a set of connected edges of an original graph; bernoulli represents Bernoulli distribution; delta 1 Epsilon (0, 1) is a hyper-parameter and represents the ratio of deleting continuous edges;
Figure BDA0003744774440000116
the representation is based on a hyper-parameter delta 2 The result of uniform distribution obtained by epsilon (0, 1); x 2 Representation is based on the first perturbative graph G1= (X, a) 1 ) Enhancing functions by graph data
Figure BDA0003744774440000117
The obtained protein characteristics are expressed; x represents a radicalA node characteristic of the starting graph; unifrom denotes uniform distribution; delta 2 The epsilon (0, 1) is a hyper-parameter and represents the ratio of the node characteristic set to 0;
2) An original graph G = (X, A), a first perturbative graph G1= (X, A) 1 ) And a second disturbing image G2= (X) 2 A) inputting the data into the graph convolution network (GIN is adopted in the embodiment) respectively, and outputting the original node feature representation
Figure BDA0003744774440000118
First perturbation node signature representation
Figure BDA0003744774440000119
And a second perturbation node signature representation
Figure BDA00037447744400001110
GNN is one of the most effective graph representation learning methods at present, and the main idea is to update the feature representation of the own node by aggregating the features of neighboring nodes. Through k iterations of aggregation and updating, the node representation aggregates representations of its k-hop neighbor nodes.
The graph neural network with k iterations is represented as:
Figure BDA00037447744400001111
Figure BDA00037447744400001112
in the formula:
Figure BDA00037447744400001113
representing the representation obtained by the node v after aggregating the characteristics of the neighbor nodes; AGG represents a function of the aggregation node characteristics;
Figure BDA00037447744400001114
or
Figure BDA00037447744400001115
Representing the result of k-1 iterations of the node u based on the graph convolution network;
Figure BDA00037447744400001116
a neighbor set representing node v; UPDATE represents a node feature UPDATE function;
Figure BDA00037447744400001117
or
Figure BDA00037447744400001118
Representing the result of k-1 iterations of the node v based on the graph convolution network;
Figure BDA00037447744400001119
or
Figure BDA00037447744400001120
Representing a feature representation of the kth iteration of node v; MLP represents a multi-layer perceptron neural network; ω is a learnable parameter or constant;
3) Fusing raw node feature representations by the following formula
Figure BDA00037447744400001121
First perturbation node signature representation
Figure BDA00037447744400001122
And second perturbation node feature representation to obtain fusion node feature representation
Figure BDA00037447744400001123
Z'=f Fusion ([Z 0 ,Z 1 ,Z 2 ]);
In the formula: f. of Fusion Representing a fusion function;
4) Obtaining a fused edge feature representation by fusing a node feature representation Z
Figure BDA00037447744400001124
e ij =z′ i ⊙z′ j ,e ij ∈E;
In the formula: an indication of a hadamard product;
Figure BDA0003744774440000121
and
Figure BDA0003744774440000122
respectively representing the feature representations of node i and node j.
In the specific implementation process, the invention adopts a Self-Learning mode to obtain the Relation representation among the labels and construct a Self-Learning Label Relation Graph (SL-LRG).
First, a learnable parameter is set
Figure BDA0003744774440000123
T represents the number of the types of the labels, and AL is initialized to be an identity matrix to be used as an initial topological structure of the label relation graph.
Then, acquiring label name embedding representation through a pre-training model BERT;
X L =BERT(L NAME );
in the formula:
Figure BDA0003744774440000124
represents a label name;
Figure BDA0003744774440000125
a word vector representing the tag name, i.e., a tag name embedded representation;
representation X by tag name embedding L And learnable parameter matrix
Figure BDA0003744774440000126
Constructing a tag relationship graph G L =(A L ,X L );
Map G of label relationship L =(A L ,X L ) Inputting into Graph Convolution neural Network (GCN), and outputting tag nodeSymbolization
Figure BDA0003744774440000127
Figure BDA0003744774440000128
In the formula: initialization Z L (0) =X L
Figure BDA0003744774440000129
A representation degree matrix; w (l-1) Representing a learnable parameter matrix; sigma represents a sigmoid activation function; a. The L Is initialized to an identity matrix. Updating parameter A through gradient feedback in model training process L And then learning the label relation hidden in the data, thereby achieving the purpose of self-learning the label relation graph.
In the specific implementation process, the protein relational graph connection edge characteristic expression is calculated through the following formula;
Figure BDA00037447744400001210
in the formula:
Figure BDA00037447744400001211
representing a protein relation graph with edge characteristics; e represents a fused edge feature representation;
Figure BDA00037447744400001212
and (4) label node characteristic representation.
In a specific implementation process, in real life, data often contain noises, and the noises can cause that a model cannot accurately represent the distribution of original data, so that the learning effect of the model is seriously influenced. In order to solve the problem, the invention introduces an automatic supervision learning task into the prediction model, and aims to increase the accuracy of the main learning task and improve the performance of the model by adding an auxiliary task.
Loss of multiple tagsThe computing module represents through the characteristics of the original nodes in the self-supervision learning task
Figure BDA00037447744400001213
And a first perturbation node characterization
Figure BDA00037447744400001214
Calculating a first disturbance loss function
Figure BDA00037447744400001215
By original node feature representation
Figure BDA00037447744400001216
And a second perturbation node characterization
Figure BDA0003744774440000131
Calculating a second disturbance loss function
Figure BDA0003744774440000132
First disturbance loss function
Figure BDA0003744774440000133
And a second disturbance loss function
Figure BDA0003744774440000134
Namely, the function of the loss of the self-supervision learning is obtained;
Figure BDA0003744774440000135
Figure BDA0003744774440000136
Figure BDA0003744774440000137
Figure BDA0003744774440000138
in the formula: (z) 1,i ,z 0,i ) Represents an active sample pair; z is a radical of 0,i ∈Z 0 ,z 1,i ∈Z 1 ;θ(z 0,i ,z 1,i ) Representing the calculation of z 0,i And z 1,i Cosine similarity of (a); tau represents a temperature parameter which has the function of controlling the discrimination of the model to the negative samples, and the negative samples which are difficult to pay attention to are represented as the value is smaller; n represents all node sets.
In a specific implementation process, a multi-label loss calculation module calculates the following supervised learning loss function through protein relation graph continuous edge feature representation in a supervised learning task;
Figure BDA0003744774440000139
Figure BDA00037447744400001310
Figure BDA00037447744400001311
Figure BDA00037447744400001312
in the formula:
Figure BDA00037447744400001313
represents a supervised learning loss; t represents the number of label categories;
Figure BDA00037447744400001314
representing a training set of connected edge sets; p is a radical of ij Representing a relational probability distribution between proteins i and j;
Figure BDA00037447744400001315
a predicted relationship representing the relationship between proteins i and j; c represents a specific category of the label;
Figure BDA00037447744400001316
true labels that represent proteins i and j in class c;
Figure BDA00037447744400001317
the predicted results of proteins i and j in class c; argmax is expressed by taking the largest subscript in the set.
In the specific implementation process, a training loss function is expressed by the following formula;
Figure BDA00037447744400001318
in the formula:
Figure BDA00037447744400001319
represents a loss of training;
Figure BDA00037447744400001320
represents a supervised learning loss;
Figure BDA00037447744400001321
representing a first perturbation loss;
Figure BDA00037447744400001322
representing a second perturbation loss; lambda [ alpha ] 1 And λ 2 Indicating the set hyper-parameter.
In order to better illustrate the advantages of the technical solution of the present invention, the following experiments are also disclosed in this example.
1. Data set
This experiment followed the setting of a dataset of previous work (disclosed in LV G F, HU Z Q, BI Y G et al, learning Unknown from Correlations: graph Neural Network for Inter-level-protein Interaction Prediction), and the evaluation of the model using the PPIs data in a database using STRING (disclosed in SZKLARCZYK D, GABLE A L, LYON D, et al, STRING v11: protein-protein association networks with associated updates, supporting functional conversion in genome-side experimental data).
The STRING database collects, scores and integrates most of the published PPIs data and establishes a comprehensive objective PPIs network. Furthermore, chen et al (published in CHEN M, JU C J T, ZHOU G, et al. Multi-faceted protein-protein interaction prediction based on Simase residual RCNN) extracted two subdata sets from STRING, called SHS27k and SHS148k, respectively. Specific information of the three data sets is shown in table 1, wherein the original data set is a protein network relational graph, nodes represent proteins, and connecting edges represent action relations among the proteins; secondly, since the protein is composed of amino acid sequences, the present experiment counted that the average length of the amino acid sequences constituting the protein in each data set is shown in table 1.
Table 1 data set statistics
Figure BDA0003744774440000141
2. Experimental setup and evaluation index
In the experiment, 20% of data in a data set is randomly selected as a test set, and then, in order to eliminate the influence of the randomness of data division on the performance of the PPI method, the experiment result is repeated under 3 different random seeds. This experiment uses amino acid sequence-based protein characterization, referring to the amino acid insertion method used by Chen et al (disclosed in CHEN M, JU C J T, ZHOU G, et al, multifaceted protein-protein interaction prediction based on size residual RCNN) to represent each amino acid. The model updates all trainable parameters using Adam's algorithm. This experiment was carried out using the experimental setup of the conventional work (disclosed in LV G F, HU Z Q, BI Y G et al, learning Unknown from relations: graph Neural Network for Inter-novel-protein Interaction Prediction) and using micro-F1 as an evaluation index.
3. Reference method
3.1 machine learning reference method
In the present experiment, three representative Machine Learning (ML) algorithms are selected as reference methods, namely Support Vector Machine (SVM) (disclosed in GUO Y, YU L, WEN Z, et al. Using a supported Vector Machine combined with auto-vary to predict protein-protein interactions from protein sequences), logistic Regression (LR) (disclosed in silverberg Y, kuporc, shaan r.a. method for predicting protein-protein interactions), and Random Forest (RF) (disclosed in WONG L, YOU Z H, LI S, et al. Detection of protein-protein interactions).
3.2 deep learning reference method
The present experiment selects the Deep Learning (DL) algorithm for the four PPIs Prediction tasks, respectively DPPI (published in HAHEMIFAR S, NEYSHABUR B, KHAN A, et al. Predictingprotein-protein interactions third-base Prediction), DNN-PPI (published in HAHEMIFAR S, NEYSHAR B, KHAN A, et al. Predictingprotein-protein interactions third-base Prediction), PIPR (published in CHEN M, JU J T, ZHOU G, et al. Multifaceted protein-protein interactions Prediction result in texture), and INTERNAL-N-Neural Network (published in Heart I, heart I).
4. Comparative experiment
Table 2 shows the performance of different calculation methods on different data sets, and the result format is the micro-F1 mean value + -standard deviation under three different random seeds, wherein LGMG-PPI is the model method proposed by the experiment.
TABLE 2 comparative study of the results
Figure BDA0003744774440000151
The following results were observed and analyzed:
1) The performance of the deep learning algorithm is generally superior to that of the machine learning algorithm, which shows that the deep learning-based technology has superiority in packaging various types of information (such as amino acid composition and co-occurrence condition thereof) of protein pairs and automatically extracting robust information suitable for learning targets. Second, as the size of the data set increases, the performance of each type of method also increases. This is because the amount of data increases, so that the model learns more sufficiently and the generalization ability of the model is stronger.
2) Compared with the optimal reference method GNN-PPI, the model method (LGMG-PPI) provided by the experiment has better prediction effect on all types of data and more stable effect. Wherein the micro-F1 score was raised by 2.01% on the SHS27k data set, 0.94% on the SHS148k data set, and 0.93% on the STRING data set. The optimal reference method is quite reliable, so that the model method provided by the experiment can be further improved on the basis of the optimal reference method, and the model method provided by the experiment is superior.
5. Ablation experiment
In order to further analyze the effect of each module in the model, experiments are carried out by deleting different modules, and then the effectiveness of each module is verified. Thus, this experiment sets up the following ablation experiments:
(1)
Figure BDA0003744774440000161
representing in de-scaling multi-scale-map neural network modules
Figure BDA0003744774440000162
The type of data enhancement, namely a data enhancement method without using a disturbing picture connecting edge;
(2)
Figure BDA0003744774440000163
representing in de-scaling multi-scale-map neural network modules
Figure BDA0003744774440000164
Type (B)The data enhancement of (2), namely a data enhancement method without using disturbing graph node characteristics;
(3) w/o MS-GDA: representing the complete removal of the multi-scale graph neural network module, i.e. without using a graph data enhancement strategy;
(4) w/o SL-LRG: and the label removing relation graph module is shown, namely, the label information is not used for guiding learning.
TABLE 3 ablation experiment
Figure BDA0003744774440000165
The results of the experiment are shown in table 3. From experimental results, the data enhancement method of the node characteristics of the perturbation graph is slightly better than the data enhancement method of the continuous edges of the perturbation graph, and both graph data enhancement methods are beneficial to the model. This shows that the graph data enhancement method can enhance the generalization capability of the model by perturbing the original graph data. In addition, when the label relationship graph module is removed, the effect of the model on all data sets is reduced. The introduction of the label relation graph module can learn the implicit relation between labels, further obtain the hidden state of the labels, and guide the final prediction result. In general, each submodule of the model provided by the experiment is beneficial to the whole model.
6. Self-learning tag relational graph effectiveness experiment
6.1 topological Structure validation experiment
The self-adaptive label graph further learns the label characteristics by introducing a self-learning topological structure. To verify the validity of the topology, the topology of the labellings would not be used, replacing the GCN with a Multi-Layer Perceptron (MLP). Specifically, the formula
Figure BDA0003744774440000171
Is replaced by Z L =f MLP (X L )。
The results of the experiment are shown in FIG. 3. From experimental results, the effect of introducing the topological structure of the tag is obviously better. Therefore, certain relations exist among the labels of the PPIs prediction tasks, the implicit relation among the labels can be well learned through the self-learning label relation graph, and the effectiveness of the method provided by the invention is further proved.
6.2 node characteristic effectiveness test
The initial representation of the self-learning tag-relation graph node features is an embedded representation of words, which was obtained by the Pre-trained model BERT (disclosed in DEVLIN J, CHANG M W, LEE K, et al. Bert: pre-training of deep biological transformations for language embedding) used in this experiment. In this section, the performance of the model under different word-embedded representations will be evaluated. Specifically, the model effects in the BERT and One-Hot embedding representation were compared experimentally.
The results of the experiment are shown in FIG. 4. As can be seen from the figure, the multi-tag recognition accuracy is not significantly affected when different word embeddings are used as inputs to the GCN. This indicates that the effect enhancement achieved by the model does not come entirely from the semantic information derived from word embedding. Furthermore, using a powerful word embedding representation may lead to better performance. One possible reason is that word embedding learned from large text corpora retains certain semantic information, and the word embedding has certain relation in the embedding space, and the model can further improve the prediction capability of the model by using the implicit relation.
7. Summary of the invention
The invention provides a label-guided protein action relation prediction method of a multi-scale graph neural network, which comprises the steps of obtaining a graph representation of multiple scales by enhancing graph data, inputting the graphs of the multiple scales into the graph neural network to obtain protein characteristic representations of the multiple scales, and introducing contrast learning to further improve the characterization capability of protein; in addition, a self-learning label relation graph is constructed, the relation among the labels is learned, information representation of the labels is further obtained, and the final prediction of the protein relation is guided to learn. The experimental results on 3 public data sets show that the experimental model is effective in predicting the protein action relationship task, and the prediction effect is superior to that of the optimal reference method.
It should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and not for limiting the technical solutions, and those skilled in the art should understand that the technical solutions of the present invention can be modified or substituted with equivalent solutions without departing from the spirit and scope of the technical solutions, and all should be covered in the claims of the present invention.

Claims (10)

1. The protein action relation prediction method based on label guidance is characterized by comprising the following steps:
s1: obtaining a pair of proteins to be predicted to be detected;
s2: inputting a pair of proteins to be detected into the trained prediction model, and outputting a corresponding prediction relation;
firstly, enhancing graph data based on the protein to be detected by the prediction model to obtain a multi-scale graph representation; secondly, inputting the multi-scale graph representation into a graph neural network to obtain multi-scale protein feature representation, and introducing contrast learning to eliminate the difference of different scale protein feature representations; then constructing a self-learning label relation graph and learning the relation among labels to obtain label characteristic representation; finally, the characteristic representation of the modified protein is represented through the label characteristic, the prediction of the action relation of the protein is guided, and then the corresponding prediction relation is output;
s3: and taking the prediction relationship of the pair of proteins to be detected as the prediction result of the protein action relationship, and further analyzing the cell biological process of the main body to which the proteins to be detected belong in normal and disease states based on the prediction result of the protein action relationship.
2. The method of claim 1 for predicting the protein action relationship based on the label guidance, wherein the method comprises the following steps: in step S2, the prediction model includes a protein feature encoder module for extracting local features and global features of the protein, a multi-scale graph neural network module for performing data enhancement, graph neural network processing, and comparative learning, a self-learning label relationship graph module for learning the relationship between labels, and a multi-label loss calculation module for performing a self-supervised learning task and a supervised learning task.
3. The method of claim 2 for predicting the protein action relationship based on the label guidance, wherein the method comprises the following steps: in step S2, the prediction model is trained by the following steps:
s201: acquiring a pair of proteins for training and inputting the proteins into a prediction model;
s202: extracting local features and global features of the protein through a protein feature encoder module to obtain protein feature representation with local information and global information;
s203: constructing an original graph of the protein action relation based on the protein characteristic representation; disturbing the original graph through a multi-scale graph neural network module to obtain a corresponding disturbance graph; then inputting the original graph and the disturbance graph into a graph neural network, and outputting original node characteristic representation and disturbance node characteristic representation, namely multi-scale protein characteristic representation; then fusing the original node characteristic representation and the disturbance node characteristic representation in a comparative learning mode to obtain fusion node characteristic representation; finally, fusion edge feature representation is obtained through fusion node feature representation calculation;
s204: acquiring label name embedding representation through a self-learning label relation graph module, and constructing a label relation graph; then inputting the label relation graph into a graph convolution neural network, and outputting a label node characteristic representation;
s205: the feature representation of the modified fusion edge is represented through the feature representation of the label node, and the feature representation of the protein relational graph connecting the edges is obtained;
s206: the multi-label loss calculation module carries out self-supervision learning through original node feature representation and disturbance node feature representation to obtain a self-supervision learning loss function; then, supervised learning is carried out through protein relational graph continuous edge feature representation to obtain a supervised learning loss function; finally, calculating based on the self-supervised learning loss function and the supervised learning loss function to obtain a training loss function, and further optimizing and updating parameters of the prediction model through the training loss function;
s207: steps S201 to S206 are repeatedly performed until the prediction model converges.
4. The method of claim 3 for predicting the protein action relationship based on the label guidance, wherein the method comprises the following steps: in step S202, the protein feature encoder module includes a local feature encoder and a global feature encoder;
the local feature encoder comprises a convolutional neural network and a maximum pooling layer, and extracts input protein by the following formula
Figure FDA0003744774430000021
Local feature in (1) represents h i
h i =f GMP (f CNN (p i ;θ CNN ));
Figure FDA0003744774430000022
In the formula: f. of CNN Represents a convolution operation; f. of GMP Represents a max pooling layer operation;
Figure FDA0003744774430000023
represents a protein set;
Figure FDA00037447744300000214
representing a defined vocabulary of amino acids; a is j Represents an amino acid in the amino acid vocabulary; theta CNN Training parameters representing convolution operations;
the global feature encoder comprises a bidirectional gating circulation unit and a global average pooling layer, and extracts an input local feature representation h through the following formula i To obtain a protein feature representation x having local information and global information i ∈X;
x i =f GAP (f BiGRU (h i ;θ BiGRU ));
In the formula: f. of BiGRU Representing a bidirectional gated loop operation; f. of GAP Representing a global averagePerforming a pooling layer operation; theta.theta. BiGRU Training parameters representing a bi-directional gated loop operation; x represents the protein feature representation obtained based on the protein feature encoder module.
5. The method of claim 3 for predicting the protein action relationship based on the label guidance, wherein the method comprises the following steps: in step S203, an original graph G = (X, a), node characteristics are defined
Figure FDA0003744774430000024
And an adjacency matrix
Figure FDA0003744774430000025
1) The multi-scale graph neural network module first applies a stochastic graph data enhancement function through two different viewing angles
Figure FDA0003744774430000026
And
Figure FDA0003744774430000027
disturbing the continuous edge and node characteristics of the original graph G = (X, A) respectively to obtain a first disturbed graph G1= (X, A) 1 ) And a second disturbing graph G2= (X) 2 ,A);
Figure FDA0003744774430000028
Figure FDA0003744774430000029
Figure FDA00037447744300000210
v 1 ~Bernoulli(N,1-δ 1 );
Figure FDA00037447744300000211
Figure FDA00037447744300000212
In the formula: e is a 1 The representation is based on a hyper-parameter delta 1 The result of N Bernoulli distribution obtained by epsilon (0, 1);
Figure FDA00037447744300000213
representation is based on the first perturbative graph G1= (X, a) 1 ) Enhancing functions by graph data
Figure FDA0003744774430000031
The obtained protein characteristics are expressed;
Figure FDA0003744774430000032
representing a set of connected edges of an original graph; bernoulli represents Bernoulli distribution; delta 1 E (0, 1) is a hyper-parameter and represents the ratio of deleting continuous edges;
Figure FDA0003744774430000033
the representation is based on a hyper-parameter delta 2 The result of uniform distribution obtained by epsilon (0, 1); x 2 Representation is based on the first perturbative graph G1= (X, a) 1 ) Enhancing function T by graph data 2 The obtained protein characteristics are expressed; x represents the node characteristics of the original graph; unifrom denotes uniform distribution; delta 2 E (0, 1) is a hyper-parameter and represents the ratio of the node characteristic set to 0;
2) The original graph G = (X, a), the first disturbing graph G1= (X, a) 1 ) And a second disturbing graph G2= (X) 2 And A) inputting the data into a neural network of the graph respectively and outputting the characteristic representation of the original node
Figure FDA0003744774430000034
First perturbation node signature representation
Figure FDA0003744774430000035
And a second perturbation node signature representation
Figure FDA0003744774430000036
The graph neural network with k iterations is represented as:
Figure FDA0003744774430000037
Figure FDA0003744774430000038
in the formula:
Figure FDA0003744774430000039
representing the representation obtained by the node v after aggregating the characteristics of the neighbor nodes; AGG represents a function of the aggregation node characteristics;
Figure FDA00037447744300000310
or
Figure FDA00037447744300000311
Representing the result of k-1 iterations of the node u based on the graph convolution network;
Figure FDA00037447744300000312
a neighbor set representing node v; UPDATE represents a node feature UPDATE function;
Figure FDA00037447744300000313
or
Figure FDA00037447744300000314
Representing the result of k-1 iterations of the node v based on the graph convolution network;
Figure FDA00037447744300000315
or
Figure FDA00037447744300000316
Representing a feature representation of the kth iteration of node v; MLP represents a multi-layer perceptron neural network; ω is a learnable parameter or constant;
3) Fusing raw node feature representations by
Figure FDA00037447744300000317
First perturbation node characterization
Figure FDA00037447744300000318
And the second disturbance node characteristic representation to obtain a fusion node characteristic representation
Figure FDA00037447744300000319
Z'=f Fusion ([Z 0 ,Z 1 ,Z 2 ]);
In the formula: f. of Fusion Representing a fusion function;
4) Obtaining a fused edge feature representation by fusing a node feature representation Z
Figure FDA00037447744300000320
e ij =z′ i ⊙z′ j ,e ij ∈E;
In the formula: an indication of a hadamard product;
Figure FDA00037447744300000321
and
Figure FDA00037447744300000322
respectively representing the characteristic representations of node i and node j.
6. The method of claim 5 for predicting the effect relationship of a protein based on the label guidance, wherein: in step S204, the self-learning label relation graph module obtains label name embedding representation through a pre-training model BERT;
X L =BERT(L NAME );
in the formula:
Figure FDA00037447744300000323
represents a label name;
Figure FDA00037447744300000324
a word vector representing the tag name, i.e., a tag name embedded representation;
representation X by tag name embedding L And learnable parameter matrix
Figure FDA0003744774430000041
Constructing a tag relationship graph G L =(A L ,X L );
Map G of label relationship L =(A L ,X L ) Output label node feature representation in input graph convolution neural network
Figure FDA0003744774430000042
Figure FDA0003744774430000043
In the formula: initialization Z L (0) =X L
Figure FDA0003744774430000044
A representation degree matrix; w (l-1) Representing a learnable parameter matrix; sigma represents a sigmoid activation function; a. The L Is initialized to an identity matrix.
7. The method of claim 6 for predicting the effect relationship of a protein based on the label guidance, wherein: in step S205, the protein relational graph connecting edge feature representation is calculated through the following formula;
Figure FDA0003744774430000045
in the formula:
Figure FDA0003744774430000046
representing a protein relation graph with edge characteristic representation; e represents fused edge feature representation;
Figure FDA0003744774430000047
and (4) label node characteristic representation.
8. The method of claim 7 for predicting the effect relationship of a protein based on the label guidance, wherein: in step S206, the multi-label loss calculation module represents the original node characteristics in the self-supervision learning task
Figure FDA0003744774430000048
And a first perturbation node signature representation
Figure FDA0003744774430000049
Calculating a first disturbance loss function
Figure FDA00037447744300000410
By raw node feature representation
Figure FDA00037447744300000411
And a second perturbation node signature representation
Figure FDA00037447744300000412
Calculating a second disturbance loss function
Figure FDA00037447744300000413
First disturbance loss function
Figure FDA00037447744300000414
And a second disturbance loss function
Figure FDA00037447744300000415
Namely, the function is the self-supervision learning loss function;
Figure FDA00037447744300000416
Figure FDA00037447744300000417
Figure FDA00037447744300000418
Figure FDA00037447744300000419
in the formula: (z) 1,i ,z 0,i ) Represents a positive sample pair; z is a radical of 0,i ∈Z 0 ,z 1,i ∈Z 1 ;θ(z 0,i ,z 1,i ) Representing the calculation of z 0,i And z 1,i Cosine similarity of (d); τ represents a temperature parameter; n represents all node sets.
9. The method of claim 8 for predicting the effect relationship of a protein based on the label guidance, wherein: in step S206, the multi-label loss calculation module calculates the following supervised learning loss function through the protein relation graph connection edge feature representation in the supervised learning task;
Figure FDA0003744774430000051
Figure FDA0003744774430000052
p ij =Softmax(e ij );
Figure FDA0003744774430000053
in the formula:
Figure FDA0003744774430000054
represents a supervised learning loss; t represents the number of label categories;
Figure FDA0003744774430000055
representing a training set of connected edge sets; p is a radical of ij Representing a relational probability distribution between proteins i and j;
Figure FDA0003744774430000056
a predicted relationship representing the relationship between proteins i and j; c represents a specific category of the label;
Figure FDA0003744774430000057
true labels that represent proteins i and j in class c;
Figure FDA0003744774430000058
represents the predicted results of proteins i and j in class c; argmax is expressed by taking the largest subscript in the set.
10. The method of claim 9 for predicting the protein action relationship based on the label guidance, wherein the method comprises the following steps: in step S206, a training loss function is expressed by the following formula;
Figure FDA0003744774430000059
in the formula:
Figure FDA00037447744300000510
represents a loss of training;
Figure FDA00037447744300000511
represents a supervised learning loss;
Figure FDA00037447744300000512
representing a first perturbation loss;
Figure FDA00037447744300000513
representing a second perturbation loss; lambda [ alpha ] 1 And λ 2 Indicating the set hyper-parameter.
CN202210828104.6A 2022-07-13 2022-07-13 Label guidance-based protein action relation prediction method Pending CN115206423A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210828104.6A CN115206423A (en) 2022-07-13 2022-07-13 Label guidance-based protein action relation prediction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210828104.6A CN115206423A (en) 2022-07-13 2022-07-13 Label guidance-based protein action relation prediction method

Publications (1)

Publication Number Publication Date
CN115206423A true CN115206423A (en) 2022-10-18

Family

ID=83582218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210828104.6A Pending CN115206423A (en) 2022-07-13 2022-07-13 Label guidance-based protein action relation prediction method

Country Status (1)

Country Link
CN (1) CN115206423A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117672353A (en) * 2023-12-18 2024-03-08 南京医科大学 Space-time proteomics deep learning prediction method for protein subcellular migration

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117672353A (en) * 2023-12-18 2024-03-08 南京医科大学 Space-time proteomics deep learning prediction method for protein subcellular migration

Similar Documents

Publication Publication Date Title
Talukder et al. Interpretation of deep learning in genomics and epigenomics
CN111370073B (en) Medicine interaction rule prediction method based on deep learning
CN114743600A (en) Gate-controlled attention mechanism-based deep learning prediction method for target-ligand binding affinity
Yin et al. Coco: A coupled contrastive framework for unsupervised domain adaptive graph classification
Zhang et al. protein2vec: predicting protein-protein interactions based on LSTM
CN113764034B (en) Method, device, equipment and medium for predicting potential BGC in genome sequence
CN115206423A (en) Label guidance-based protein action relation prediction method
Wang Neuro-fuzzy modeling for microarray cancer gene expression data
CN112270950A (en) Fusion network drug target relation prediction method based on network enhancement and graph regularization
CN112259157A (en) Protein interaction prediction method
Bai et al. A unified deep learning model for protein structure prediction
Chowdhury et al. Cell type identification from single-cell transcriptomic data via gene embedding
CN114420201A (en) Method for predicting interaction of drug targets by efficient fusion of multi-source data
Wang et al. Prediction of protein interactions based on CT-DNN
KR102212310B1 (en) System and method for detecting of Incorrect Triple
Sathe et al. Gene expression and protein function: A survey of deep learning methods
Singh et al. Classification of non-coding rna-a review from machine learning perspective
Fadhil et al. Classification of Cancer Microarray Data Based on Deep Learning: A Review
Sun et al. An enhanced LRMC method for drug repositioning via gcn-based HIN embedding
Zhang Deep neural networks on genetic motif discovery: the interpretability and identifiability issues
Sridhar et al. Interrelating N-gram based protein sequences using LSTMs with parallel capsule routing
Ahmed et al. Predicting Alzheimer's Disease Using Filter Feature Selection Method
US20220367011A1 (en) Identification of unknown genomes and closest known genomes
Chowdhury Cell Type Classification Via Deep Learning On Single-Cell Gene Expression Data
Korkealaakso Predicting liquid-liquid phase separation of proteins using graph neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination