CN115206423A

CN115206423A - Label guidance-based protein action relation prediction method

Info

Publication number: CN115206423A
Application number: CN202210828104.6A
Authority: CN
Inventors: 朱小飞; 王新生
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2022-07-13
Filing date: 2022-07-13
Publication date: 2022-10-18

Abstract

The invention specifically relates to a protein action relation prediction method based on label guidance, which comprises the following steps: obtaining a pair of proteins to be predicted; inputting a pair of proteins to be detected into the trained prediction model, and outputting a corresponding prediction relation; firstly, enhancing graph data based on protein to be detected to obtain a multi-scale graph representation; secondly, inputting the multi-scale graph representation into a graph neural network to obtain multi-scale protein feature representation, and introducing contrast learning to eliminate the difference of different scale protein feature representations; then constructing a self-learning label relation graph and learning the relation among labels to obtain label characteristic representation; finally, the characteristic representation of the modified protein is represented through the label characteristic and the prediction of the protein action relation is guided; and taking the predicted relationship of a pair of proteins to be detected as the prediction result of the protein action relationship. The method can improve the generalization capability of protein characteristic representation and the classification accuracy of the prediction model, thereby improving the prediction effect of the protein action relation.

Description

Label guidance-based protein action relation prediction method

Technical Field

The invention relates to the technical field of biological information and natural language processing, in particular to a protein action relation prediction method based on label guidance.

Background

Protein-protein interactions play key roles in a wide range of biological processes, such as DNA replication, transcription, translation, and transmembrane signal transduction. Therefore, detection of Protein-Protein Interactions (PPIs) and the type of Protein Interactions are critical to understanding the cellular biological processes in normal and disease states, and such studies are also helpful in the identification of therapeutic targets and the design of new drugs. In early work on protein action relationships, laboratory-based methods were used, mainly involving yeast two-hybrid screening, protein chip and mass spectrometry protein complex identification, and the like. Laboratory experiments are often time consuming and labor intensive, resulting in inefficient identification of protein action relationships, while laboratory-based methods generate incomplete protein action relationship data due to limitations of laboratory experiments.

In the existing research work for predicting the protein action relationship of the deep learning algorithm, a Convolutional Neural Network (CNN) is mainly used for extracting local features of the protein or a Recurrent Neural Network (RNN) is used for storing context long-distance dependence information. However, such deep learning algorithms still have many problems such as inability to efficiently filter and aggregate local features of proteins, difficulty in simultaneously retaining important context and amino acid information of sequences, no utilization of interaction between protein pairs, and the like. With the development of Graph Neural Networks (GNNs), the prior art began to predict by constructing protein action Network graphs and introducing Graph Neural networks. The method not only considers the influence of protein pairs, but also can enhance the characteristic representation of the method through the relationship between the protein pairs, thereby further improving the effect of predicting the protein action relationship.

However, the applicant found in actual research that the conventional method for predicting the protein action relationship based on the graph neural network only constructs a protein action network graph and a protein feature representation based on an original data set, and does not fully search the original data set, so that the generalization ability of the protein feature representation is insufficient, and the effect of predicting the protein action relationship is not good. Meanwhile, a plurality of action relations often exist among the proteins, the action relations may have mutual correlation information, and the existing method for realizing the prediction of the protein action relations based on the graph neural network does not consider the mutual correlation information among the proteins, so that the classification accuracy of the protein action relation prediction model is insufficient. Therefore, how to design a method capable of improving the generalization capability of protein feature representation and the classification accuracy of a prediction model is an urgent technical problem to be solved.

Disclosure of Invention

Aiming at the defects of the prior art, the technical problems to be solved by the invention are as follows: how to provide a protein action relation prediction method based on label guidance to improve the generalization capability of protein characteristic representation and the classification accuracy of a prediction model, thereby improving the prediction effect of the protein action relation and further better analyzing the cell biological process of a subject to which the protein belongs under normal and disease states.

In order to solve the technical problem, the invention adopts the following technical scheme:

the protein action relation prediction method based on label guidance comprises the following steps:

s1: obtaining a pair of proteins to be predicted;

s2: inputting a pair of proteins to be detected into the trained prediction model, and outputting a corresponding prediction relation;

firstly, enhancing graph data based on the protein to be detected by the prediction model to obtain a multi-scale graph representation; secondly, the multi-scale graph is represented and input into a graph neural network to obtain multi-scale protein feature representation, and contrast learning is introduced to eliminate the difference of different scale protein feature representations; then constructing a self-learning label relation graph and learning the relation among labels to obtain label characteristic representation; finally, the characteristic representation of the modified protein is represented through the label characteristic, the prediction of the action relation of the protein is guided, and then the corresponding prediction relation is output;

s3: and taking the prediction relationship of the pair of proteins to be detected as the prediction result of the protein action relationship, and further analyzing the cell biological process of the main body to which the proteins to be detected belong in normal and disease states based on the prediction result of the protein action relationship.

Preferably, in step S2, the prediction model includes a protein feature encoder module for extracting local features and global features of the protein, a multi-scale graph neural network module for performing data enhancement, graph neural network processing and contrast learning, a self-learning label relationship graph module for learning the relationship between labels, and a multi-label loss calculation module for performing a self-supervised learning task and a supervised learning task.

Preferably, in step S2, the prediction model is trained by the following steps:

s201: acquiring a pair of proteins for training and inputting the proteins into a prediction model;

s202: extracting local features and global features of the protein through a protein feature encoder module to obtain protein feature representation with local information and global information;

s203: constructing an original graph of the protein action relation based on the protein characteristic representation; disturbing the original graph through a multi-scale graph neural network module to obtain a corresponding disturbed graph; then inputting the original graph and the disturbance graph into a graph neural network, and outputting original node characteristic representation and disturbance node characteristic representation, namely multi-scale protein characteristic representation; then fusing the original node characteristic representation and the disturbance node characteristic representation in a comparative learning mode to obtain a fusion node characteristic representation; finally, fusion edge feature representation is obtained through fusion node feature representation calculation;

s204: acquiring label name embedding representation through a self-learning label relation graph module, and constructing a label relation graph; then inputting the label relation graph into a graph convolution neural network, and outputting label node characteristic representation;

s205: the feature representation of the modified fusion edge is represented through the feature representation of the label node, and the feature representation of the protein relational graph connecting the edges is obtained;

s206: the multi-label loss calculation module carries out self-supervision learning through original node characteristic representation and disturbance node characteristic representation to obtain a self-supervision learning loss function; then, supervised learning is carried out through protein relational graph continuous edge feature representation to obtain a supervised learning loss function; finally, calculating based on the self-supervised learning loss function and the supervised learning loss function to obtain a training loss function, and optimizing and updating parameters of the prediction model through the training loss function;

s207: steps S201 to S206 are repeatedly performed until the prediction model converges.

Preferably, in step S202, the protein feature encoder module includes a local feature encoder and a global feature encoder;

the local feature encoder comprises a convolutional neural network and a maximum pooling layer, and extracts input protein by the following formula

Local feature in (1) represents h _i ；

h _i ＝f _GMP (f _CNN (p _i ；θ _CNN ))；

In the formula: f. of _CNN Represents a convolution operation; f. of _GMP Represents a max pooling layer operation;

represents a collection of proteins;

represents a defined vocabulary of amino acids; a is _j Represents an amino acid in the amino acid vocabulary; theta _CNN Training parameters representing convolution operations;

the global feature encoder comprises a bidirectional gating circulation unit and a global average pooling layer, and extracts an input local feature representation h through the following formula _i To obtain a protein feature representation x having local information and global information _i ∈X；

x _i ＝f _GAP (f _BiGRU (h _i ；θ _BiGRU ))；

In the formula: f. of _BiGRU Representing a bidirectional gated loop operation; f. of _GAP Representing a global average pooling layer operation; theta _BiGRU Training parameters representing a bi-directional gated loop operation; x represents the protein feature representation obtained based on the protein feature encoder module.

Preferably, in step S203, an original graph G = (X, a), node characteristics are defined

And adjacency matrix

1) The multi-scale graph neural network module first applies a stochastic graph data enhancement function through two different viewing angles

And

disturbing the continuous edges and node characteristics of the original graph G = (X, A) respectively to obtain a first disturbed graph G1= (X, A) ₁ ) And a second disturbing image G2= (X) ₂ ,A)；

∈ ₁ ～Bernoulli(N,1-δ ₁ )；

In the formula: e is a ₁ The representation is based on a hyper-parameter delta ₁ E (0, 1) isThe resulting N Bernoulli distribution;

representation based on the first perturbation graph G1= (X, a) ₁ ) Enhancing functions by graph data

The obtained protein characteristics are expressed;

representing a set of connected edges of an original graph; bernoulli represents Bernoulli distribution; delta. For the preparation of a coating ₁ E (0, 1) is a hyper-parameter and represents the ratio of deleting continuous edges;

the representation is based on a hyper-parameter delta ₂ The result of uniform distribution obtained by epsilon (0, 1); x ₂ Representation based on the first perturbation graph G1= (X, a) ₁ ) Enhancing functions by graph data

The obtained protein characteristics are expressed; x represents the node characteristics of the original graph; unifrom denotes uniform distribution; delta ₂ E (0, 1) is a hyper-parameter and represents the ratio of the node characteristic set to 0;

2) An original graph G = (X, A), a first perturbative graph G1= (X, A) ₁ ) And a second disturbing graph G2= (X) ₂ And A) inputting the data into a neural network of the graph respectively and outputting the characteristic representation of the original node

First perturbation node signature representation

And a second perturbation node signature representation

The graph neural network with k iterations is represented as:

in the formula:

representing the representation obtained by the node v after aggregating the characteristics of the neighbor nodes; AGG represents a function of the aggregation node characteristics;

or

Representing the result of k-1 iterations of the node u based on the graph convolution network;

a neighbor set representing node v; UPDATE represents a node feature UPDATE function;

or

Representing the result of k-1 iterations of the node v based on the graph convolution network;

or

Representing a feature representation of the kth iteration of node v; MLP represents a multi-layer perceptron neural network; ω is a learnable parameter or constant;

3) Fusing raw node feature representations by the following formula

First perturbation node characterization

And the second disturbance node characteristic representation to obtain a fusion node characteristic representation

Z'＝f _Fusion ([Z ₀ ,Z ₁ ,Z ₂ ])；

In the formula: f. of _Fusion Representing a fusion function;

4) Obtaining a fused edge feature representation by fusing a node feature representation Z

e _ij ＝z′ _i ⊙z′ _j ，e _ij ∈E；

In the formula: an indication of a hadamard product;

and

respectively representing the feature representations of node i and node j.

Preferably, in step S204, the self-learning label relationship graph module obtains the label name embedding representation through the pre-training model BERT;

X _L ＝BERT(L _NAME )；

in the formula:

represents a label name;

a word vector representing the tag name, i.e., a tag name embedded representation;

embedding representation X by tag name _L And learnable parameter matrix

Constructing a tag relationship graph G _L ＝(A _L ,X _L )；

Map G of label relationship _L ＝(A _L ,X _L ) Output label node feature representation in input graph convolution neural network

In the formula: initialization Z _L ⁽⁰⁾ ＝X _L ；

A representation degree matrix; w is a group of ^(l-1) Representing a learnable parameter matrix; sigma represents a sigmoid activation function; a. The _L Is initialized to an identity matrix.

Preferably, in step S205, the protein relationship graph edge connection feature representation is calculated by the following formula;

in the formula:

representing a protein relation graph with edge characteristics; e represents fused edge feature representation;

and (4) label node feature representation.

Preferably, in step S206, the multi-label loss calculation module represents the original node feature in the self-supervised learning task

And a first perturbation node characterization

Calculating a first disturbance loss function

By original node feature representation

And a second perturbation node characterization

Calculating a second disturbance loss function

First disturbance loss function

And a second disturbance loss function

Namely, the function of the loss of the self-supervision learning is obtained;

in the formula: (z) _1,i ,z _0,i ) Represents an active sample pair; z is a radical of formula _0,i ∈Z ₀ ，z _1,i ∈Z ₁ ；θ(z _0,i ,z _1,i ) Representing the calculation of z _0,i And z _1,i Cosine similarity of (a); τ represents a temperature parameter; n represents all node sets.

Preferably, in step S206, the multi-label loss calculating module calculates the following supervised learning loss function through the protein relation graph edge-to-edge feature representation in the supervised learning task;

p _ij ＝Softmax(e _ij )；

in the formula:

represents a supervised learning loss; t represents the number of label categories;

representing a training set of connected edge sets; p is a radical of formula _ij Representing a relational probability distribution between proteins i and j;

a predicted relationship representing the relationship between proteins i and j; c represents a specific category of the label;

true labels that represent proteins i and j in class c;

represents the predicted results of proteins i and j in class c; argmax denotesThe largest subscript in the set is taken to indicate.

Preferably, in step S206, the training loss function is expressed by the following formula;

in the formula:

represents a loss of training;

represents a supervised learning loss;

representing a first perturbation loss;

representing a second perturbation loss; lambda [ alpha ] ₁ And λ ₂ Indicating the set hyper-parameter.

The protein action relation prediction method based on label guidance has the following beneficial effects:

the prediction model generates a multi-scale graph representation in a graph data enhancement mode, learns multi-scale protein feature representation by enhancing self feature representation through neighbor nodes in a graph neural network, eliminates the difference of different scale protein feature representations by introducing contrast learning and further improves the protein characterization capability, namely, an original data set is fully explored through graph data enhancement, graph neural network processing and contrast learning, the generalization capability of the protein feature representation can be improved, the prediction effect of the protein action relation can be improved, and the cell biological process of a main body to which the protein belongs in normal and disease states can be better analyzed.

Meanwhile, label information is introduced into the prediction model, the relationship among the labels is learned by constructing a self-learned label relationship diagram to obtain label characteristic representation, and then the learning of the protein interaction relationship is guided by the label characteristic representation, namely the correlation information generated by various interaction relationships among the proteins can be fully explored by learning the relationship among the labels, so that the classification accuracy of the prediction model can be improved, the prediction effect of the protein interaction relationship can be further improved, and the cell biological process of a main body to which the proteins belong under normal and disease states can be better analyzed.

Drawings

For a better understanding of the objects, solutions and advantages of the present invention, reference will now be made in detail to the present invention, which is illustrated in the accompanying drawings, in which:

FIG. 1 is a logic diagram of a tag-based protein interaction relationship prediction method;

FIG. 2 is a diagram of a network architecture of a predictive model (LGMG-PPI);

FIG. 3 is a schematic diagram of SL-LRG topology validation;

FIG. 4 is a diagram illustrating a feature validity verification of a SL-LRG node.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures. In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. indicate orientations or positional relationships based on orientations or positional relationships shown in the drawings or orientations or positional relationships that the present product is conventionally placed in use, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the device or element referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance. Furthermore, the terms "horizontal", "vertical" and the like do not imply that the components are required to be absolutely horizontal or pendant, but rather may be slightly inclined. For example, "horizontal" merely means that the direction is more horizontal than "vertical" and does not mean that the structure must be perfectly horizontal, but may be slightly inclined. In the description of the present invention, it should also be noted that, unless otherwise explicitly specified or limited, the terms "disposed," "mounted," "connected," and "connected" are to be construed broadly and may, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

The following is further detailed by the specific embodiments:

example (b):

the embodiment discloses a protein action relation prediction method based on label guidance.

As shown in FIG. 1, the method for predicting the protein action relationship based on the label guidance comprises the following steps:

s1: obtaining a pair of proteins to be predicted to be detected;

firstly, enhancing graph data based on the protein to be detected by the prediction model to obtain a multi-scale graph representation; secondly, the multi-scale graph is represented and input into a graph neural network to obtain multi-scale protein feature representation, and contrast learning is introduced to eliminate the difference of different scale protein feature representations; then constructing a self-learning label relation graph and learning the relation among labels to obtain the characteristic representation of the labels; finally, the characteristic representation of the modified protein is expressed through the label characteristic, the prediction of the action relationship of the protein is guided, and the corresponding prediction relationship is output;

in this embodiment, as shown in fig. 2, the prediction model includes a protein feature encoder module for extracting local features and global features of the protein, a Multi-Scale map Data enhancement (MS-GDA) module for performing Data enhancement and obtaining Multi-Scale protein feature representation, a Self-Learning Label relationship Graph (SL-LRG) module for Learning a relationship between labels, and a Multi-Label loss calculation module for performing a Self-Learning task and a supervised Learning task.

In this embodiment, the cell biological process of the target subject (human or animal) in normal and disease states can be analyzed based on the protein action relationship prediction result of the protein to be detected of the target subject (human or animal), and then the identification of the therapeutic target and the design of a new drug can be realized based on the analyzed data. The invention improves the effects of the analysis of the cell biological process, the identification of the therapeutic target and the design of the new medicine by obtaining a better prediction result of the protein action relationship.

Meanwhile, label information is introduced into the prediction model, the relationship among the labels is learned by constructing a self-learned label relationship diagram to obtain label characteristic representation, and then the learning of the protein interaction relationship is guided through the label characteristic representation, namely, the correlation information generated by various action relationships among the proteins can be fully explored through the relationship among the learning labels, the classification accuracy of the prediction model can be improved, the prediction effect of the protein action relationship can be further improved, and the cell biological process of a main body to which the proteins belong under normal and disease states can be better analyzed.

In a specific implementation process, a prediction model is trained through the following steps:

s203: constructing an original graph of the protein action relation based on the protein characteristic representation; disturbing the original graph through a multi-scale graph neural network module to obtain a corresponding disturbance graph; then inputting the original graph and the disturbance graph into a graph neural network, and outputting original node characteristic representation and disturbance node characteristic representation, namely multi-scale protein characteristic representation; then fusing the original node characteristic representation and the disturbance node characteristic representation in a comparative learning mode to obtain a fusion node characteristic representation; finally, fusion edge feature representation is obtained through fusion node feature representation calculation;

s206: the multi-label loss calculation module carries out self-supervision learning through original node characteristic representation and disturbance node characteristic representation to obtain a self-supervision learning loss function; then, supervised learning is carried out through protein relational graph continuous edge feature representation to obtain a supervised learning loss function; finally, calculating based on the self-supervised learning loss function and the supervised learning loss function to obtain a training loss function, and further optimizing and updating parameters of the prediction model through the training loss function;

When a prediction model is trained, a multi-scale graph representation is generated in a graph data enhancement mode, multi-scale protein feature representation is learned by enhancing own feature representation through neighbor nodes in a graph neural network, and the difference of different scale protein feature representations is eliminated by introducing contrast learning, so that the protein characterization capability is further improved; meanwhile, label information is introduced, the relation among labels is learned by constructing a self-learned label relation graph to obtain label characteristic representation, and then the learning of the protein interaction relation is guided through the label characteristic representation, namely, original data can be fully explored through graph data enhancement, graph neural network processing and comparison learning, the mutual association information generated by various action relations among proteins can be fully explored through the relation among the learning labels, and further the generalization capability of the protein characteristic representation and the classification accuracy of a prediction model can be improved, so that the prediction effect of the protein action relation can be further improved.

It should be noted that the prediction model of the present invention can be regarded as a protein action relationship (LGMG-PPI) prediction model based on Label-Guided Multi-scale Graph Neural networks.

Proteins are composed of amino acids, and the common amino acids are 20 kinds. Defining a vocabulary of amino acids

Protein assembly

Wherein

Definition of

Is a collection of PPIs (protein action relationships), where I denotes whether a relationship exists between two proteins, if I (x) _ij ) =1, represents protein p _i And protein p _j There is an action relationship between them; if I (x) _ij ) =0, represents protein p _i And protein p _j There is no functional relationship between them, or it means that there is no functional relationship between them found in the current research work. By the above definition, the PPIs graph is constructed by taking proteins as nodes and PPIs as continuous edges

The protein action relation only indicates whether the protein has the interaction relation or not, however, a plurality of action relations may exist among the proteins, and the task of the invention is to predict the plurality of action relations existing among the proteins and is a multi-label classification task. The invention defines the label set of PPIs as

Wherein t represents t action relationships.

In a specific implementation process, the protein characteristic encoder module comprises a local characteristic encoder and a global characteristic encoder;

the local feature encoder comprisesConvolutional Neural Network (CNN) and maximum Pooling layer (Global Max Pooling, GMP), the input protein was extracted by the following formula

Local feature in (1) represents h _i ；

h _i ＝f _GMP (f _CNN (p _i ；θ _CNN ))；

In the formula: f. of _CNN Representing a convolution operation; f. of _GMP Representing a max pooling level operation;

represents a collection of proteins;

representing a defined vocabulary of amino acids; a is _j Represents an amino acid in the amino acid vocabulary; theta _CNN Training parameters representing convolution operations;

the Global feature encoder comprises a Bidirectional Gate recovery Unit (BiGRU) and a Global average Pooling layer (GAP), and extracts an input local feature representation h through the following formula _i To obtain a protein feature representation x having local information and global information _i ∈X；

x _i ＝f _GAP (f _BiGRU (h _i ；θ _BiGRU ))；

In the formula: f. of _BiGRU Representing a bidirectional gated loop operation; f. of _GAP Representing a global average pooling layer operation; theta.theta. _BiGRU Training parameters representing a bi-directional gated loop operation; x represents the protein feature representation obtained based on the protein feature encoder module.

According to the invention, the local features and the global features of the protein are extracted in a feature coding mode to obtain the protein feature representation with the local information and the global information, so that the characterization capability of the protein can be better improved.

In the specific implementation process, a Multi-Scale Graph Data enhancement (MS-GDA) module mainly comprises two Graph Data enhancement functions.

Defining an original graph G = (X, A), node characteristics

(protein features obtained by the protein feature encoder module are expressed as node features) and adjacency matrix

And

disturbing the continuous edges and node characteristics of the original graph G = (X, A) respectively to obtain a first disturbed graph G1= (X, A) ₁ ) And a second disturbing graph G2= (X) ₂ ,A)；

And (3) disturbing the continuous edges of the original graph G = (X, A), and randomly deleting the continuous edges of the topology structure of the original graph.

Perturbing the node characteristics of the original graph G = (X, a), and randomly setting some columns of the node characteristics of the original graph to 0.

∈ ₁ ～Bernoulli(N,1-δ ₁ )；

In the formula: e is the same as ₁ The representation is based on a hyper-parameter delta ₁ The result of N Bernoulli distribution obtained by epsilon (0, 1);

The obtained protein characteristics are expressed;

representing a set of connected edges of an original graph; bernoulli represents Bernoulli distribution; delta ₁ Epsilon (0, 1) is a hyper-parameter and represents the ratio of deleting continuous edges;

the representation is based on a hyper-parameter delta ₂ The result of uniform distribution obtained by epsilon (0, 1); x ₂ Representation is based on the first perturbative graph G1= (X, a) ₁ ) Enhancing functions by graph data

The obtained protein characteristics are expressed; x represents a radicalA node characteristic of the starting graph; unifrom denotes uniform distribution; delta ₂ The epsilon (0, 1) is a hyper-parameter and represents the ratio of the node characteristic set to 0;

2) An original graph G = (X, A), a first perturbative graph G1= (X, A) ₁ ) And a second disturbing image G2= (X) ₂ A) inputting the data into the graph convolution network (GIN is adopted in the embodiment) respectively, and outputting the original node feature representation

First perturbation node signature representation

And a second perturbation node signature representation

GNN is one of the most effective graph representation learning methods at present, and the main idea is to update the feature representation of the own node by aggregating the features of neighboring nodes. Through k iterations of aggregation and updating, the node representation aggregates representations of its k-hop neighbor nodes.

The graph neural network with k iterations is represented as:

in the formula:

or

or

or

3) Fusing raw node feature representations by the following formula

First perturbation node signature representation

And second perturbation node feature representation to obtain fusion node feature representation

Z'＝f _Fusion ([Z ₀ ,Z ₁ ,Z ₂ ])；

In the formula: f. of _Fusion Representing a fusion function;

e _ij ＝z′ _i ⊙z′ _j ，e _ij ∈E；

In the formula: an indication of a hadamard product;

and

respectively representing the feature representations of node i and node j.

In the specific implementation process, the invention adopts a Self-Learning mode to obtain the Relation representation among the labels and construct a Self-Learning Label Relation Graph (SL-LRG).

First, a learnable parameter is set

T represents the number of the types of the labels, and AL is initialized to be an identity matrix to be used as an initial topological structure of the label relation graph.

Then, acquiring label name embedding representation through a pre-training model BERT;

X _L ＝BERT(L _NAME )；

in the formula:

represents a label name;

representation X by tag name embedding _L And learnable parameter matrix

Constructing a tag relationship graph G _L ＝(A _L ,X _L )；

Map G of label relationship _L ＝(A _L ,X _L ) Inputting into Graph Convolution neural Network (GCN), and outputting tag nodeSymbolization

In the formula: initialization Z _L ⁽⁰⁾ ＝X _L ；

A representation degree matrix; w ^(l-1) Representing a learnable parameter matrix; sigma represents a sigmoid activation function; a. The _L Is initialized to an identity matrix. Updating parameter A through gradient feedback in model training process _L And then learning the label relation hidden in the data, thereby achieving the purpose of self-learning the label relation graph.

In the specific implementation process, the protein relational graph connection edge characteristic expression is calculated through the following formula;

in the formula:

representing a protein relation graph with edge characteristics; e represents a fused edge feature representation;

and (4) label node characteristic representation.

In a specific implementation process, in real life, data often contain noises, and the noises can cause that a model cannot accurately represent the distribution of original data, so that the learning effect of the model is seriously influenced. In order to solve the problem, the invention introduces an automatic supervision learning task into the prediction model, and aims to increase the accuracy of the main learning task and improve the performance of the model by adding an auxiliary task.

Loss of multiple tagsThe computing module represents through the characteristics of the original nodes in the self-supervision learning task

And a first perturbation node characterization

Calculating a first disturbance loss function

By original node feature representation

And a second perturbation node characterization

Calculating a second disturbance loss function

First disturbance loss function

And a second disturbance loss function

Namely, the function of the loss of the self-supervision learning is obtained;

in the formula: (z) _1,i ,z _0,i ) Represents an active sample pair; z is a radical of _0,i ∈Z ₀ ，z _1,i ∈Z ₁ ；θ(z _0,i ,z _1,i ) Representing the calculation of z _0,i And z _1,i Cosine similarity of (a); tau represents a temperature parameter which has the function of controlling the discrimination of the model to the negative samples, and the negative samples which are difficult to pay attention to are represented as the value is smaller; n represents all node sets.

In a specific implementation process, a multi-label loss calculation module calculates the following supervised learning loss function through protein relation graph continuous edge feature representation in a supervised learning task;

in the formula:

representing a training set of connected edge sets; p is a radical of _ij Representing a relational probability distribution between proteins i and j;

true labels that represent proteins i and j in class c;

the predicted results of proteins i and j in class c; argmax is expressed by taking the largest subscript in the set.

In the specific implementation process, a training loss function is expressed by the following formula;

in the formula:

represents a loss of training;

represents a supervised learning loss;

representing a first perturbation loss;

In order to better illustrate the advantages of the technical solution of the present invention, the following experiments are also disclosed in this example.

1. Data set

This experiment followed the setting of a dataset of previous work (disclosed in LV G F, HU Z Q, BI Y G et al, learning Unknown from Correlations: graph Neural Network for Inter-level-protein Interaction Prediction), and the evaluation of the model using the PPIs data in a database using STRING (disclosed in SZKLARCZYK D, GABLE A L, LYON D, et al, STRING v11: protein-protein association networks with associated updates, supporting functional conversion in genome-side experimental data).

The STRING database collects, scores and integrates most of the published PPIs data and establishes a comprehensive objective PPIs network. Furthermore, chen et al (published in CHEN M, JU C J T, ZHOU G, et al. Multi-faceted protein-protein interaction prediction based on Simase residual RCNN) extracted two subdata sets from STRING, called SHS27k and SHS148k, respectively. Specific information of the three data sets is shown in table 1, wherein the original data set is a protein network relational graph, nodes represent proteins, and connecting edges represent action relations among the proteins; secondly, since the protein is composed of amino acid sequences, the present experiment counted that the average length of the amino acid sequences constituting the protein in each data set is shown in table 1.

Table 1 data set statistics

2. Experimental setup and evaluation index

In the experiment, 20% of data in a data set is randomly selected as a test set, and then, in order to eliminate the influence of the randomness of data division on the performance of the PPI method, the experiment result is repeated under 3 different random seeds. This experiment uses amino acid sequence-based protein characterization, referring to the amino acid insertion method used by Chen et al (disclosed in CHEN M, JU C J T, ZHOU G, et al, multifaceted protein-protein interaction prediction based on size residual RCNN) to represent each amino acid. The model updates all trainable parameters using Adam's algorithm. This experiment was carried out using the experimental setup of the conventional work (disclosed in LV G F, HU Z Q, BI Y G et al, learning Unknown from relations: graph Neural Network for Inter-novel-protein Interaction Prediction) and using micro-F1 as an evaluation index.

3. Reference method

3.1 machine learning reference method

In the present experiment, three representative Machine Learning (ML) algorithms are selected as reference methods, namely Support Vector Machine (SVM) (disclosed in GUO Y, YU L, WEN Z, et al. Using a supported Vector Machine combined with auto-vary to predict protein-protein interactions from protein sequences), logistic Regression (LR) (disclosed in silverberg Y, kuporc, shaan r.a. method for predicting protein-protein interactions), and Random Forest (RF) (disclosed in WONG L, YOU Z H, LI S, et al. Detection of protein-protein interactions).

3.2 deep learning reference method

The present experiment selects the Deep Learning (DL) algorithm for the four PPIs Prediction tasks, respectively DPPI (published in HAHEMIFAR S, NEYSHABUR B, KHAN A, et al. Predictingprotein-protein interactions third-base Prediction), DNN-PPI (published in HAHEMIFAR S, NEYSHAR B, KHAN A, et al. Predictingprotein-protein interactions third-base Prediction), PIPR (published in CHEN M, JU J T, ZHOU G, et al. Multifaceted protein-protein interactions Prediction result in texture), and INTERNAL-N-Neural Network (published in Heart I, heart I).

4. Comparative experiment

Table 2 shows the performance of different calculation methods on different data sets, and the result format is the micro-F1 mean value + -standard deviation under three different random seeds, wherein LGMG-PPI is the model method proposed by the experiment.

TABLE 2 comparative study of the results

The following results were observed and analyzed:

1) The performance of the deep learning algorithm is generally superior to that of the machine learning algorithm, which shows that the deep learning-based technology has superiority in packaging various types of information (such as amino acid composition and co-occurrence condition thereof) of protein pairs and automatically extracting robust information suitable for learning targets. Second, as the size of the data set increases, the performance of each type of method also increases. This is because the amount of data increases, so that the model learns more sufficiently and the generalization ability of the model is stronger.

2) Compared with the optimal reference method GNN-PPI, the model method (LGMG-PPI) provided by the experiment has better prediction effect on all types of data and more stable effect. Wherein the micro-F1 score was raised by 2.01% on the SHS27k data set, 0.94% on the SHS148k data set, and 0.93% on the STRING data set. The optimal reference method is quite reliable, so that the model method provided by the experiment can be further improved on the basis of the optimal reference method, and the model method provided by the experiment is superior.

5. Ablation experiment

In order to further analyze the effect of each module in the model, experiments are carried out by deleting different modules, and then the effectiveness of each module is verified. Thus, this experiment sets up the following ablation experiments:

(1)

representing in de-scaling multi-scale-map neural network modules

The type of data enhancement, namely a data enhancement method without using a disturbing picture connecting edge;

(2)

representing in de-scaling multi-scale-map neural network modules

Type (B)The data enhancement of (2), namely a data enhancement method without using disturbing graph node characteristics;

(3) w/o MS-GDA: representing the complete removal of the multi-scale graph neural network module, i.e. without using a graph data enhancement strategy;

(4) w/o SL-LRG: and the label removing relation graph module is shown, namely, the label information is not used for guiding learning.

TABLE 3 ablation experiment

The results of the experiment are shown in table 3. From experimental results, the data enhancement method of the node characteristics of the perturbation graph is slightly better than the data enhancement method of the continuous edges of the perturbation graph, and both graph data enhancement methods are beneficial to the model. This shows that the graph data enhancement method can enhance the generalization capability of the model by perturbing the original graph data. In addition, when the label relationship graph module is removed, the effect of the model on all data sets is reduced. The introduction of the label relation graph module can learn the implicit relation between labels, further obtain the hidden state of the labels, and guide the final prediction result. In general, each submodule of the model provided by the experiment is beneficial to the whole model.

6. Self-learning tag relational graph effectiveness experiment

6.1 topological Structure validation experiment

The self-adaptive label graph further learns the label characteristics by introducing a self-learning topological structure. To verify the validity of the topology, the topology of the labellings would not be used, replacing the GCN with a Multi-Layer Perceptron (MLP). Specifically, the formula

Is replaced by Z _L ＝f _MLP (X _L )。

The results of the experiment are shown in FIG. 3. From experimental results, the effect of introducing the topological structure of the tag is obviously better. Therefore, certain relations exist among the labels of the PPIs prediction tasks, the implicit relation among the labels can be well learned through the self-learning label relation graph, and the effectiveness of the method provided by the invention is further proved.

6.2 node characteristic effectiveness test

The initial representation of the self-learning tag-relation graph node features is an embedded representation of words, which was obtained by the Pre-trained model BERT (disclosed in DEVLIN J, CHANG M W, LEE K, et al. Bert: pre-training of deep biological transformations for language embedding) used in this experiment. In this section, the performance of the model under different word-embedded representations will be evaluated. Specifically, the model effects in the BERT and One-Hot embedding representation were compared experimentally.

The results of the experiment are shown in FIG. 4. As can be seen from the figure, the multi-tag recognition accuracy is not significantly affected when different word embeddings are used as inputs to the GCN. This indicates that the effect enhancement achieved by the model does not come entirely from the semantic information derived from word embedding. Furthermore, using a powerful word embedding representation may lead to better performance. One possible reason is that word embedding learned from large text corpora retains certain semantic information, and the word embedding has certain relation in the embedding space, and the model can further improve the prediction capability of the model by using the implicit relation.

7. Summary of the invention

The invention provides a label-guided protein action relation prediction method of a multi-scale graph neural network, which comprises the steps of obtaining a graph representation of multiple scales by enhancing graph data, inputting the graphs of the multiple scales into the graph neural network to obtain protein characteristic representations of the multiple scales, and introducing contrast learning to further improve the characterization capability of protein; in addition, a self-learning label relation graph is constructed, the relation among the labels is learned, information representation of the labels is further obtained, and the final prediction of the protein relation is guided to learn. The experimental results on 3 public data sets show that the experimental model is effective in predicting the protein action relationship task, and the prediction effect is superior to that of the optimal reference method.

It should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and not for limiting the technical solutions, and those skilled in the art should understand that the technical solutions of the present invention can be modified or substituted with equivalent solutions without departing from the spirit and scope of the technical solutions, and all should be covered in the claims of the present invention.

Claims

1. The protein action relation prediction method based on label guidance is characterized by comprising the following steps:

s1: obtaining a pair of proteins to be predicted to be detected;

firstly, enhancing graph data based on the protein to be detected by the prediction model to obtain a multi-scale graph representation; secondly, inputting the multi-scale graph representation into a graph neural network to obtain multi-scale protein feature representation, and introducing contrast learning to eliminate the difference of different scale protein feature representations; then constructing a self-learning label relation graph and learning the relation among labels to obtain label characteristic representation; finally, the characteristic representation of the modified protein is represented through the label characteristic, the prediction of the action relation of the protein is guided, and then the corresponding prediction relation is output;

2. The method of claim 1 for predicting the protein action relationship based on the label guidance, wherein the method comprises the following steps: in step S2, the prediction model includes a protein feature encoder module for extracting local features and global features of the protein, a multi-scale graph neural network module for performing data enhancement, graph neural network processing, and comparative learning, a self-learning label relationship graph module for learning the relationship between labels, and a multi-label loss calculation module for performing a self-supervised learning task and a supervised learning task.

3. The method of claim 2 for predicting the protein action relationship based on the label guidance, wherein the method comprises the following steps: in step S2, the prediction model is trained by the following steps:

s203: constructing an original graph of the protein action relation based on the protein characteristic representation; disturbing the original graph through a multi-scale graph neural network module to obtain a corresponding disturbance graph; then inputting the original graph and the disturbance graph into a graph neural network, and outputting original node characteristic representation and disturbance node characteristic representation, namely multi-scale protein characteristic representation; then fusing the original node characteristic representation and the disturbance node characteristic representation in a comparative learning mode to obtain fusion node characteristic representation; finally, fusion edge feature representation is obtained through fusion node feature representation calculation;

s204: acquiring label name embedding representation through a self-learning label relation graph module, and constructing a label relation graph; then inputting the label relation graph into a graph convolution neural network, and outputting a label node characteristic representation;

s206: the multi-label loss calculation module carries out self-supervision learning through original node feature representation and disturbance node feature representation to obtain a self-supervision learning loss function; then, supervised learning is carried out through protein relational graph continuous edge feature representation to obtain a supervised learning loss function; finally, calculating based on the self-supervised learning loss function and the supervised learning loss function to obtain a training loss function, and further optimizing and updating parameters of the prediction model through the training loss function;

4. The method of claim 3 for predicting the protein action relationship based on the label guidance, wherein the method comprises the following steps: in step S202, the protein feature encoder module includes a local feature encoder and a global feature encoder;

Local feature in (1) represents h _i ；

h _i ＝f _GMP (f _CNN (p _i ；θ _CNN ))；

represents a protein set;

x _i ＝f _GAP (f _BiGRU (h _i ；θ _BiGRU ))；

In the formula: f. of _BiGRU Representing a bidirectional gated loop operation; f. of _GAP Representing a global averagePerforming a pooling layer operation; theta.theta. _BiGRU Training parameters representing a bi-directional gated loop operation; x represents the protein feature representation obtained based on the protein feature encoder module.

5. The method of claim 3 for predicting the protein action relationship based on the label guidance, wherein the method comprises the following steps: in step S203, an original graph G = (X, a), node characteristics are defined

And an adjacency matrix

And

disturbing the continuous edge and node characteristics of the original graph G = (X, A) respectively to obtain a first disturbed graph G1= (X, A) ₁ ) And a second disturbing graph G2= (X) ₂ ,A)；

v ₁ ～Bernoulli(N,1-δ ₁ )；

In the formula: e is a ₁ The representation is based on a hyper-parameter delta ₁ The result of N Bernoulli distribution obtained by epsilon (0, 1);

representation is based on the first perturbative graph G1= (X, a) ₁ ) Enhancing functions by graph data

The obtained protein characteristics are expressed;

representing a set of connected edges of an original graph; bernoulli represents Bernoulli distribution; delta ₁ E (0, 1) is a hyper-parameter and represents the ratio of deleting continuous edges;

the representation is based on a hyper-parameter delta ₂ The result of uniform distribution obtained by epsilon (0, 1); x ₂ Representation is based on the first perturbative graph G1= (X, a) ₁ ) Enhancing function T by graph data ₂ The obtained protein characteristics are expressed; x represents the node characteristics of the original graph; unifrom denotes uniform distribution; delta ₂ E (0, 1) is a hyper-parameter and represents the ratio of the node characteristic set to 0;

2) The original graph G = (X, a), the first disturbing graph G1= (X, a) ₁ ) And a second disturbing graph G2= (X) ₂ And A) inputting the data into a neural network of the graph respectively and outputting the characteristic representation of the original node

First perturbation node signature representation

And a second perturbation node signature representation

The graph neural network with k iterations is represented as:

in the formula:

or

or

or

3) Fusing raw node feature representations by

First perturbation node characterization

Z'＝f _Fusion ([Z ₀ ,Z ₁ ,Z ₂ ])；

In the formula: f. of _Fusion Representing a fusion function;

e _ij ＝z′ _i ⊙z′ _j ，e _ij ∈E；

In the formula: an indication of a hadamard product;

and

respectively representing the characteristic representations of node i and node j.

6. The method of claim 5 for predicting the effect relationship of a protein based on the label guidance, wherein: in step S204, the self-learning label relation graph module obtains label name embedding representation through a pre-training model BERT;

X _L ＝BERT(L _NAME )；

in the formula:

represents a label name;

representation X by tag name embedding _L And learnable parameter matrix

Constructing a tag relationship graph G _L ＝(A _L ,X _L )；

In the formula: initialization Z _L ⁽⁰⁾ ＝X _L ；

A representation degree matrix; w ^(l-1) Representing a learnable parameter matrix; sigma represents a sigmoid activation function; a. The _L Is initialized to an identity matrix.

7. The method of claim 6 for predicting the effect relationship of a protein based on the label guidance, wherein: in step S205, the protein relational graph connecting edge feature representation is calculated through the following formula;

in the formula:

representing a protein relation graph with edge characteristic representation; e represents fused edge feature representation;

and (4) label node characteristic representation.

8. The method of claim 7 for predicting the effect relationship of a protein based on the label guidance, wherein: in step S206, the multi-label loss calculation module represents the original node characteristics in the self-supervision learning task

And a first perturbation node signature representation

Calculating a first disturbance loss function

By raw node feature representation

And a second perturbation node signature representation

Calculating a second disturbance loss function

First disturbance loss function

And a second disturbance loss function

Namely, the function is the self-supervision learning loss function;

in the formula: (z) _1,i ,z _0,i ) Represents a positive sample pair; z is a radical of _0,i ∈Z ₀ ，z _1,i ∈Z ₁ ；θ(z _0,i ,z _1,i ) Representing the calculation of z _0,i And z _1,i Cosine similarity of (d); τ represents a temperature parameter; n represents all node sets.

9. The method of claim 8 for predicting the effect relationship of a protein based on the label guidance, wherein: in step S206, the multi-label loss calculation module calculates the following supervised learning loss function through the protein relation graph connection edge feature representation in the supervised learning task;

p _ij ＝Softmax(e _ij )；

in the formula:

true labels that represent proteins i and j in class c;

represents the predicted results of proteins i and j in class c; argmax is expressed by taking the largest subscript in the set.

10. The method of claim 9 for predicting the protein action relationship based on the label guidance, wherein the method comprises the following steps: in step S206, a training loss function is expressed by the following formula;

in the formula:

represents a loss of training;

represents a supervised learning loss;

representing a first perturbation loss;