CN114678081A

CN114678081A - Compound-protein interaction prediction method fusing network topology information

Info

Publication number: CN114678081A
Application number: CN202210491027.XA
Authority: CN
Inventors: 刘宏生; 于笑雪; 张力; 徐鑫
Original assignee: Liaoning University
Current assignee: Liaoning University
Priority date: 2022-05-07
Filing date: 2022-05-07
Publication date: 2022-06-28

Abstract

The invention relates to a compound-protein interaction prediction method fusing network topology information, which comprises the following steps: step 1: preprocessing data; step 2: constructing an interaction network, and calculating the centrality measurement of each node in the network; and step 3: for each pair of compound and protein in the data set, calculating a correlation metric of the compound to the protein using a common neighbor number-based method; and 4, step 4: constructing a transformer-based model, and adding the centrality of the node into the node characteristics. And 5: the correlation of each pair of nodes is taken as a bias term in the cross attention module. And 6: and outputting the prediction probability by using the full connection layer. The invention considers the topological information in the interaction network, fuses the properties of the protein and the compound with the topological information of the interaction network, and effectively utilizes the topological information to improve the accuracy of the compound-protein interaction prediction.

Description

Compound-protein interaction prediction method fusing network topology information

Technical Field

The invention belongs to the field of bioinformatics, and particularly relates to a compound-protein interaction prediction method fusing network topology information.

Background

Proteins are the basis of living activities of organisms, and play a wide and important role in organisms. Drugs are generally compounds with specific properties that affect the function of a protein by binding to a specific protein in the organism, thereby producing a drug effect. The research on the interaction between the compound and the protein is an important component of drug design, and has important significance on drug development. To improve drug development efficiency, many deep learning-based predictive models have been developed, but existing models fail to explicitly fuse network topology information into the model.

Disclosure of Invention

The invention aims to provide a compound-protein interaction prediction method fused with network topology information, which can effectively improve the accuracy of compound-protein interaction prediction.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for predicting compound-protein interaction fused with network topology information, comprising the following steps:

step 1: preprocessing the data;

step 2: constructing a compound-protein interaction network according to the data set, and calculating the degree of each node in the interaction network as the centrality measurement of the node;

and step 3: for each pair of compound and protein in the data set, calculating the number of mutual neighbors of each adjacent node of the compound in the interaction network and the protein as a correlation measure of the compound to the protein; obtaining a correlation measurement of the protein to the compound by the same method;

and 4, step 4: constructing a transform-based binary classification model, distributing a real-value embedded vector for each node according to the obtained centrality measurement of the node, and adding the real-value embedded vector into the node characteristics.

And 5: and (4) respectively allocating a learnable scalar to each possible value of the obtained correlation of each pair of nodes, and using the learnable scalar as a bias term of the cross attention module in the model in the step 4.

Step 6: and finally, outputting the prediction probability by utilizing the full connection layer.

Further, the step 1 specifically comprises:

step 1.1: preprocessing the compound-protein interaction data, the protein sequence information and the compound SMILES data, removing abnormal values, randomly generating negative examples, and randomly dividing a data set;

step 1.2: encoding a protein sequence using the seqvec model;

step 1.3: the rdkit was used to extract a contiguous matrix of compound features and compound profiles.

Further, the step 2 specifically comprises:

step 2.1: each compound and each protein in the raw data set was taken as a node, and the positive interactions of paired compounds and proteins were taken as edges, building a compound-protein interaction network.

Step 2.2: and calculating the number of neighbor nodes of each node in the network as the degree centrality of the node.

Further, the step 3 specifically includes:

step 3.1: and calculating and storing the number of common neighbors between every two proteins and the number of common neighbors between every two compounds in the interaction network.

Step 3.2: for each pair of compound and protein in the data set, the number of common neighbors of each neighboring node of the compound in the interaction network to the protein is found from the results stored in step 3.1, and the maximum value is recorded as a measure of the correlation of the compound to the protein.

Step 3.3: for each pair of compound and protein in the data set, the number of common neighbors of the protein to each neighboring node in the interaction network and the compound is found according to the results stored in step 3.1, and the maximum value is recorded as a measure of the correlation of the protein to the compound.

Further, the step 4 specifically includes:

step 4.1: a traditional transform model is constructed, the position coding of a decoder is removed, and the mask is transformed into an adjacent matrix by a lower triangular matrix so that the decoder can only see adjacent nodes.

And 4.2: each node is allocated with a real-valued embedded vector according to the centrality measurement of the node and is added into the node characteristics, and the method comprises the following steps:

F＝X+Z_deg (1)

where F represents the resulting new feature vector. X represents the initial feature vector of an amino acid or atom. Z is a learnable embedded vector, specified by the degree of protein or compound nodes.

Further, the step 5 specifically includes:

wherein the content of the first and second substances,

is a conventional attention weight calculation method, the function phi is defined by the correlation between nodes,

is a learnable scalar, indexed by the output value of the function phi, and shared among all layers.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the invention, network topology information is considered, and node centrality codes and correlation codes are fused in the model, so that the model can contain more effective information;

2. a two-classification model based on a transformer is constructed, and a cross attention mechanism in the two-classification model is utilized to process the relationship between the protein characteristics and the compound characteristics so as to fuse multi-modal information and improve the accuracy of interaction prediction.

Drawings

FIG. 1 is a schematic diagram of a node relevance computation method;

FIG. 2 is a schematic diagram of a system;

FIG. 3a is a line graph comparing the performance of an embodiment of the present invention with other methods on a human data set;

FIG. 3b is a line graph comparing the performance of examples of the invention with other methods on a C.elegans data set.

Detailed Description

The invention is further described with reference to the following figures and examples.

As shown in FIG. 1, the invention provides a method for predicting compound-protein interaction fused with network topology information, which comprises the steps of firstly, calculating the centrality of each node and the correlation of paired data according to the network topology information of a compound and a protein, and then establishing a model to predict potential interaction relation between the compound and the protein. And finally, evaluating the performance of the model by adopting corresponding indexes. In order to make the correlation calculation method more intuitively represented, the invention establishes a corresponding schematic diagram, as shown in fig. 1. And constructing a compound-protein interaction network through the data set, then performing projection operation on the compound-protein interaction network, and calculating the number of common neighbors of each adjacent node of the node 1 in the interaction network and the node 2 in the paired data to obtain the relevance measurement of the node 1 to the node 2. The upper half of the diagram shows the original interaction network, which can be seen to be essentially a bipartite graph, and the lower half shows the projection of the original network onto the set of compounds and onto the set of proteins, where two protein nodes, if connected to the same compound, connect the two proteins, and the weights of the edges are the number of compounds in common. The same treatment was also performed for the compound node. Calculating the correlation according to the obtained projections, as shown in fig. 1, the correlation of x1 to y1 is required, and then the node with the most common neighbors to y1 among the neighbors of x1 except y1 is first found, and the number of the common neighbors is taken. Here, x1 has only two neighbors, only y2 meets the conditions, and there are 2 neighbors in common with y2 and y1, x1 and x2, respectively, so the correlation of x1 to y1 is 2. Then, the node with the most common neighbors to x1 among other neighbors of y1 is found, which is x2 in fig. 1, and the number 2 of common neighbors of x2 and x1 is taken as the correlation of y1 to x 1. Finally, a learnable scalar is respectively distributed to each possible value obtained, and the learnable scalar is used as a bias item in the cross attention module, namely the cross attention module is used for coding the relevance, so that more effective information is added to the model. The invention aims to predict potential interaction by using a compound-protein interaction prediction model fusing a central code and a related code.

As shown in fig. 2, the flow of the embodiment of the present invention is as follows:

step 1: firstly, compound-protein interaction pairs are obtained from a human and Caenorhabditis elegans data set, wherein the human data set comprises 3369 positive interactions between 1052 compounds and 852 proteins; the Caenorhabditis elegans dataset contains 4000 positive interactions between 1434 compounds and 2504 proteins. And after removing the abnormal value, randomly generating negative samples with the same number as that of the positive samples, and randomly dividing a training set, a verification set and a test set. Amino acid insertions were obtained by the seqvec model, which was pre-trained on elmo with a large protein database. The molecular SMILES descriptor was used to obtain atomic insertions by an open source kit rdkit for chemical informatics.

Step 2: and respectively taking each compound and each protein in the two data sets as nodes, using positive interactions of paired compounds and proteins as edges, constructing a compound-protein interaction network, and calculating the number of neighbor nodes of each node in the network as the degree centrality of the node.

And step 3: for each pair of compounds and proteins in the data set, calculating the number of common neighbors of each adjacent node of the compounds in an interaction network and the proteins, and taking the number as a correlation measure of the compounds to the proteins, wherein the correlation measure represents the number of the same compounds of the proteins capable of interacting with the compounds and the target proteins, intuitively means that two protein nodes already have many common compounds and have a stronger trend of having more common compounds in the future; the same principle yields a measure of the relatedness of the protein to the compound.

And 4, step 4: constructing a transformer-based binary classification model, removing the position code of a decoder and transforming a mask from a lower triangular matrix into a contiguous matrix so that the decoder can only see adjacent nodes. And (3) allocating a real-valued embedded vector to each node according to the obtained centrality measurement of the node, adding the real-valued embedded vector and the original feature matrix of the node, and inputting the addition as a new feature matrix into the model according to a formula (1).

F＝X+Z_deg (1)

And 5: and respectively allocating a learnable scalar to each possible value of the obtained correlation of each pair of nodes, and using the learnable scalar as a bias term of the cross attention module in the model in the step 4, such as the formula (2).

Wherein the content of the first and second substances,

The validity of the invention is verified:

through comparative experiments, the performance of the invention is evaluated on 5 indexes respectively, and the results of comparison of the invention with other methods are shown in fig. 3a and 3b, wherein the best index of the invention on a test set achieves the precision rate: 0.997, recall: 1, accuracy: 0.999, F1 score: 0.998, and AUC: 1. the verification result shows that the performance of the method is superior to that of other methods.

Claims

1. A method for predicting compound-protein interaction fused with network topology information, which is characterized by comprising the following steps:

step 1: preprocessing the data;

step 2: constructing a compound-protein interaction network according to the data set, and calculating the degree of each node in the interaction network as the centrality measure of the node;

and 4, step 4: constructing a two-classification model based on a transformer, distributing a real-value embedded vector for each node according to the obtained centrality measurement of the node, and adding the real-value embedded vector into the node characteristics;

and 5: respectively allocating a learnable scalar to each possible value of the obtained correlation of each pair of nodes, and using the learnable scalar as a bias item of the cross attention module in the model in the step 4;

2. The method for predicting a compound-protein interaction fused with network topology information according to claim 1, wherein the step 1 is specifically:

step 1.1: preprocessing compound-protein interaction data, protein sequence information and compound SMILES data, removing abnormal values, randomly generating negative examples, and randomly dividing a data set;

step 1.2: encoding a protein sequence using the seqvec model;

3. The method for predicting a compound-protein interaction fused with network topology information according to claim 1, wherein the step 2 is specifically:

step 2.1: constructing a compound-protein interaction network by taking each compound and each protein in the original data set as nodes and taking the positive interaction of the paired compounds and proteins as edges;

4. The method for predicting a compound-protein interaction fused with network topology information according to claim 1, wherein the step 3 is specifically:

step 3.1: calculating and storing the number of common neighbors between every two proteins and the number of common neighbors between every two compounds in the interaction network;

step 3.2: for each pair of compound and protein in the data set, according to the result stored in step 3.1, finding the number of common neighbors of each adjacent node of the compound in the interaction network and the protein, and recording the maximum value of the number as the correlation metric of the compound to the protein;

step 3.3: for each pair of compound and protein in the data set, the number of mutual neighbors of each adjacent node of the protein in the interaction network and the compound is found according to the result saved in step 3.1, and the maximum value is recorded as the correlation measure of the protein to the compound.

5. The method for predicting a compound-protein interaction fused with network topology information according to claim 1, wherein the step 4 is specifically:

step 4.1: constructing a traditional transformer model, removing the position code of a decoder and transforming a mask from a lower triangular matrix into an adjacent matrix so that the decoder can only see adjacent nodes;

step 4.2: each node is allocated with a real-valued embedded vector according to the centrality measurement of the node and is added into the node characteristics, and the method comprises the following steps:

F＝X+Z_deg (1)

wherein, F represents a new feature vector obtained finally; x represents an initial feature vector of an amino acid or atom; z is a learnable embedded vector, specified by the degree of protein or compound nodes.

6. The method for predicting a compound-protein interaction fused with network topology information according to claim 1, wherein the step 5 is specifically:

wherein the content of the first and second substances,

is a learnable scalar, indexed by the output value of the function φ, and shared across all layers.